LLM

LLM Workups: Running BLOOMZ 7.1B on a 3060ti.

thinkmelt@protonmail.com

Sep 17, 2025 • 2 min read

"Write the python code to run BLOOMZ 7.1B LLM on a machine where some of the work must be offloaded to the CPU as the GPU is only a 3060ti."

The first time a LLM is run it takes a significant amount of time as it pulls the tensors.
This LLM is known for being fairly accurate, reliable, and 'cost effective.' You don't need a $6000 video card to run this.
The requirement 'max_memory={0: "4GiB", "cpu":"20Gib"} was required to be added.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Load the tokenizer and model. The model is 'bigscience/bloomz-7b1'.
# device_map="auto" enables automatic offloading to CPU when GPU VRAM (e.g., 8GB on RTX 3060 Ti) is insufficient.
model_name = "bigscience/bloomz-7b1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",  # Handles GPU/CPU offloading automatically
    torch_dtype=torch.float16,  # FP16 for memory efficiency
    trust_remote_code=True,
    max_memory={0: "4GiB", "cpu": "20GiB"}  # May be required for custom model configurations
)

# Create a text generation pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device_map="auto")

# Test the model with a sample prompt
prompt = "Hello, I am testing the BLOOMZ-7b1 model. What is the capital of Japan?"
output = generator(prompt, max_new_tokens=50, do_sample=True, top_k=50, top_p=0.95)

# Print the generated output
print(output[0]['generated_text'])

Its output was clear and concise. Very nice.

Hello, I am testing the BLOOMZ-7b1 model. What is the capital of Japan? Tokyo

Process finished with exit code 0

Next we rewrote the code so that it would time various token generation timings:

import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Load the tokenizer and model. The model is 'bigscience/bloomz-7b1'.
# device_map="auto" enables automatic offloading to CPU when GPU VRAM (e.g., 8GB on RTX 3060 Ti) is insufficient.
model_name = "bigscience/bloomz-7b1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",  # Handles GPU/CPU offloading automatically
    torch_dtype=torch.float16,  # FP16 for memory efficiency
    trust_remote_code=True,
    max_memory={0: "4GiB", "cpu": "20GiB"}  # May be required for custom model configurations
)

# Create a text generation pipeline
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device_map="auto")

# Define the prompt
prompt = "Hello, What do you think about Japan?"

# Loop over token increments from 50 to 200 in steps of 25
for max_tokens in range(50, 201, 25):
    start_time = time.perf_counter()  # Start timing
    output = generator(prompt, max_new_tokens=max_tokens, do_sample=True, top_k=50, top_p=0.95)
    end_time = time.perf_counter()  # End timing

    elapsed_time = end_time - start_time
    print(f"Time to generate {max_tokens} tokens: {elapsed_time:.2f} seconds")
    print(output[0]['generated_text'])  # Optional: Print the generated text for verification

Running the following code shows us, that this is a terse and concise LLM:

Hello, What do you think about Japan? Japan
Time to generate 75 tokens: 7.91 seconds
Hello, What do you think about Japan? Japan is a beautiful country
Time to generate 100 tokens: 5.30 seconds
Hello, What do you think about Japan? I love it
Time to generate 125 tokens: 18.46 seconds
Hello, What do you think about Japan? This is my first post, so I'd like to give a review
Time to generate 150 tokens: 11.91 seconds
Hello, What do you think about Japan? I think it is a wonderful country.
Time to generate 175 tokens: 6.60 seconds
Hello, What do you think about Japan? Japan is a country
Time to generate 200 tokens: 13.87 seconds
Hello, What do you think about Japan? I want to know what you think about Japan

Summary: This LLM held consistency, and would not extrapolate even when asked through settings or in the console. It is very good for very short summaries which would make this an excellent classifier.