Linux/LLM Speed Bandwidth Testing mbw / nvbandwidth

thinkmelt@protonmail.com

Oct 19, 2025 • 6 min read

The #1 limiting factor if your LLM will run quickly or not is memory bandwidth.

Memory Bandwidth of your RAM.
Memory Bandwidth of your GPU.
How quickly data can get between the GPU and the CPU over your very slow PCIe bus.

Memory Bandwidth of your RAM.

In Linux we can test this quickly using the mbw tool

sudo apt install mbw -y

Once you have it installed, you can quickly run it with simply:

mbw 512

Will run a generic 10 tests.

We have very slow in terms of LLM memory at only 8.5-15.5 GB/s bandwidth.

2. Memory Bandwidth From your CPU to your GPU.

We can now move onto testing our CPU to GPU or PCIe bandwidth set, using nvbandwidth.

You will be required to install some pre-tools - including nvcc.

sudo apt update -y
sudo apt install git build-essential libboost-program-options-dev cmake -y
sudo apt install nvidia-cuda-toolkit

Once you have the support build tools, you can then install nvbandwidth:

git clone https://github.com/NVIDIA/nvbandwidth.git ;
cd nvbandwidth ; cmake

If it went cleanly it will look like:

Alternately you can set:

cmake -DMULTINODE=1 .

There was an error that required editing testcase.cpp as follows to set:

Comment out line 61
Add below it 'supprtsMulticast = 1;

// CU_ASSERT(cuDeviceGetAttribute(&supportsMulticast, CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED, dev));
supportsMulticast = 1;

After this, one then calls the installation script as the simplest and cleanest way to get it running:

sudo ./debian_install.sh

You should get a build like:

Running it with no options will automatically query any and all GPU's in your system and benchmark them in various configurations, making tests from GPU - GPU, GPU-CPU, CPU-GPU, there are actually 35 different tests you can run (shown below)

We can see our information for both video cards, showing VERY SLOW performance on the one card:

Device 0: NVIDIA GeForce RTX 3060 Ti (00000000:03:00)
Device 1: NVIDIA GeForce RTX 3060 Ti (00000000:0a:00)
Running device_to_host_memcpy_ce.
memcpy CE CPU(row) <- GPU(column) bandwidth (GB/s)
           0         1
 0      6.78     25.68

Card 0 - 6.78 GB/s / Card 1 - 25.68 GB/s!? How! Why?!
The result is from a generic consumer low-end motherboard that gives you one fast PCI-e port never expecting you to run more than one GPU. But this right here is the major cause of bottlenecks in your home LLM - your lack of PCIe!

We can get the motherboard without walking over to the computer by simply:

sudo dmidecode -t baseboard

Will give us all our statistics on this basic motherboard:

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
	Manufacturer: ASUSTeK COMPUTER INC.
	Product Name: TUF GAMING X570-PLUS
	Version: Rev X.0x
	Serial Number: 191060782600654
	Asset Tag: Default string
	Features:
		Board is a hosting board
		Board is replaceable
	Location In Chassis: Default string
	Chassis Handle: 0x0003
	Type: Motherboard
	Contained Object Handles: 0

At this point you actually pretty much need to pull the manual for your motherboard and really check out your specs:

Using the following speed table we can get a quick idea what we are dealing with:

PCIe Speed Table

The following table outlines the maximum theoretical speeds for different generations of PCI Express (PCIe) based on the number of lanes. Each generation doubles the data transfer rate compared to the previous one.

PCIe Speed by Generation and Lane Count

PCIe Generation	Year Released	Data Transfer Rate	Bandwidth (x1)	Bandwidth (x16)
PCIe 1.0	2003	2.5 GT/s	250 MB/s	4.0 GB/s
PCIe 2.0	2007	5.0 GT/s	500 MB/s	8.0 GB/s
PCIe 3.0	2010	8.0 GT/s	1.0 GB/s	16.0 GB/s
PCIe 4.0	2017	16.0 GT/s	2.0 GB/s	32.0 GB/s
PCIe 5.0	2019	32.0 GT/s	4.0 GB/s	64.0 GB/s
PCIe 6.0	2021	64.0 GT/s	8.0 GB/s	128.0 GB/s

Key Points

Lanes: PCIe connections can have different lane configurations (x1, x4, x8, x16).
Backward Compatibility: Newer PCIe generations are generally backward compatible with older versions.
Real-World Performance: Actual speeds may be about 15% lower due to system overhead and other factors.

In our case yes - we have 1 at roughly 25 GB/s and one at 6.78 GB/s ? - So clearly we have quite a bottleneck, and yes this is definitely going to significantly effect performance.

We are looking at 1 slot giving us a 'gaming performance' of ~ 25 GB/s and one slot running PCIe 3.0 x 8 most likely. We can actually look this all up with a bit of inspection.

lspci | grep NVIDIA

Will give us:

03:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3060 Ti Lite Hash Rate] (rev a1)
03:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)
0a:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3060 Ti] (rev a1)
0a:00.1 Audio device: NVIDIA Corporation GA104 High Definition Audio Controller (rev a1)

Note the numbers, we can use them to get more information, to inspect the first card:

lspci -s <bus:device.function>  # So
sudo lspci -s 03:00.0  -vvv

This must be run as sudo, you can get some information but not the link caps:

LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us

When we check out second card we can see also:

lspci -s 0a:00.0 -vvv

And our link cap is:

LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-

Clearly this one cannot be the 24 GB/s as it doesn't even support it, again it can also be the motherboard.

We can carry on from here, but the end result is simply that we do not have industrial grade equipment so do not expect industrial grade performance... But for a home LLM on a miniscule budget - it is respectable.

Part 2: Internal Inspection of nvbandwidth:

There are actually a large number of tests that nvbandwidth supports

nvbandwidth --list

0, host_to_device_memcpy_ce:
1, device_to_host_memcpy_ce:
2, host_to_device_bidirectional_memcpy_ce:
3, device_to_host_bidirectional_memcpy_ce:
4, device_to_device_memcpy_read_ce:
5, device_to_device_memcpy_write_ce:
6, device_to_device_bidirectional_memcpy_read_ce:
7, device_to_device_bidirectional_memcpy_write_ce:
8, all_to_host_memcpy_ce:
9, all_to_host_bidirectional_memcpy_ce:
10, host_to_all_memcpy_ce:
11, host_to_all_bidirectional_memcpy_ce:
12, all_to_one_write_ce:
13, all_to_one_read_ce:
14, one_to_all_write_ce:
15, one_to_all_read_ce:
16, host_to_device_memcpy_sm:
17, device_to_host_memcpy_sm:
18, host_to_device_bidirectional_memcpy_sm:
19, device_to_host_bidirectional_memcpy_sm:
20, device_to_device_memcpy_read_sm:
21, device_to_device_memcpy_write_sm:
22, device_to_device_bidirectional_memcpy_read_sm:
23, device_to_device_bidirectional_memcpy_write_sm:
24, all_to_host_memcpy_sm:
25, all_to_host_bidirectional_memcpy_sm:
26, host_to_all_memcpy_sm:
27, host_to_all_bidirectional_memcpy_sm:
28, all_to_one_write_sm:
29, all_to_one_read_sm:
30, one_to_all_write_sm:
31, one_to_all_read_sm:
32, host_device_latency_sm:
33, device_to_device_latency_sm:
34, device_local_copy: << - This one measures internal GB/s!

So to see the internal GB/s Bandwidth of the video card is actually test 34:

./nvbandwidth -t 34

We see the true speed results internal to the card.

Device 0: NVIDIA GeForce RTX 3060 Ti (00000000:03:00)
Device 1: NVIDIA GeForce RTX 3060 Ti (00000000:0a:00)

Running device_local_copy.
memcpy local GPU(column) bandwidth (GB/s)
           0         1
 0    207.27    207.21

SUM device_local_copy 414.48

NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.

Summary - This just really shows you where you bottlenecks can sit. Even through the nvidia 3060ti card (now considered a very basic baseline card) can have internal memory bandwidth of "448 GB/s", when an LLM is inferencing across a very slow PCIe 3.0 x 8 you will loose a LOT of your speed.

This is why 48 GB and 96GB video cards are astronomically priced because they allow an entire LLM to run without getting out onto the clogged and super relatively slow PCIe buses.

The TRUE speed of the 3060ti was only 207 GB/s not even close to various claimed results of 336/448 GB/s. You can see on this thread that NVIDIA puts theoretical performance results on their cards, not actual typical real performance results! Once you know this, the meat and potatoes of what you are really getting out of your GPU's, your CPU's and your LLM's can hedge accordingly.

The most important purchase you can buy to safe-proof your LLM for future is to start with a motherboard with true PCIe 4.0 x 16 over all 4 slots! That throws a lot of people because that needs to be a server, and usually starts at $1200 alone.