LLM TurboQuant Example! Qwen3.5 27B Agentic Workflow Primer. Saving $900/month in Tokens Speed Running our House LLM and Clobbering the Claudes..

LLM TurboQuant Example! Qwen3.5 27B Agentic Workflow Primer. Speed Running our House LLM.

LLM TurboQuant Example! Qwen3.5 27B Agentic Workflow Primer.  Saving $900/month in Tokens Speed Running our House LLM and Clobbering the Claudes..
Part 3. The Hard Dollar Figures.

The first question we wanted to know was - realizing that if we turned our context way down (which caused token speeds to accelerate) and we ran our house LLM/24 hours a day on various tasks - what kind of $$ are we saving??!

If you want the back story on how we ran a full Qwen3.5 27B on a single 4080ti 16 GB VRAM card - here is the guide on how to do it!

LLM TurboQuant Example! Running Llama.cpp Qwen3.5-27B-TQ3_1S Entirely on a 16GB 4080ti! 32,000 tokens!! (All on the GPU Part 1)
If you have ever ran a LLM inside the RAM only of a computer it is disastrously slow. To remedy this required very large amounts of RAM on the GPU, and Nvidia seeing this made sure to charge appropriately.The bottleneck every time is the slow speed of the PCIe

We fired up our house LLM in a fast mode and threw the following question at it:

/usr/bin/llama-server -m /home/c/PythonProject/TurboResearcher/models/Qwen3.5-27B-TQ3_1S.gguf --host 192.168.1.3 --n-gpu-layers -1 --flash-attn on --cache-type-k q8_0 --cache-type-v turbo3 -c 4096 --temp 0.7
What is the cost per day and month for an LLM that does the following speeds 5, 15, 30, 60 tokens per second when the output cost per token is $25 per million tokens?
  • We noticed that context speeds were about 13 Tokens/second for both the 4096 and 8192 context. At 16384 it slowed down to about 5.0 Tokens/s. We tried an odd fit of 12000 and same - about 5.0 Tokens/s. So for our purposes 8192 is the sweet spot.
  • It worked our problem.. did not complete in 4096 context length, we tried again at 16384 context length, it took 12 minutes to answer, we tried again at the sweet spot of 8192.. Here is the answer (3 min 32 seconds)

Based on the parameters provided, here is the cost breakdown.

Calculation Assumptions

  • Operation: Continuous generation (24 hours/day, 7 days/week).
  • Timeframe: 1 Day (24 hours) and 1 Month (30 days).
  • Cost Rate: $25 per 1 million tokens ($0.000025 per token).
  • Seconds in a Day: 86,400.

Cost Formulas

  1. Daily Tokens: $\text{Speed (tokens/sec)} \times 86,400$
  2. Daily Cost: $\text{Daily Tokens} \div 1,000,000 \times $25$
  • Simplified: $\text{Speed} \times 2.16$
  1. Monthly Cost: $\text{Daily Cost} \times 30$
  • Simplified: $\text{Speed} \times 64.8$

Cost Table

Speed (Tokens/Sec) Daily Cost (USD) Monthly Cost (USD)
5 $10.80 $324.00
15 $32.40 $972.00
30 $64.80 $1,944.00
60 $129.60 $3,888.00

Summary

  • At 5 tokens/sec: Approximately $11/day or $324/month.
  • At 15 tokens/sec: Approximately $32/day or $972/month.
  • At 60 tokens/sec: Approximately $130/day or $3,888/month.

Note: These costs assume the model is actively generating output 100% of the time at the specified speed. If the model is idle or waiting for input, the actual cost will be lower.

Claude code is absolutely way faster but you can also spend $1000's per month.    In other words if you are a person that needs Claude and bulk automates, then house LLM's are something to seriously look at.  We didn't do this with a $5000 DGX Sparx, we are pulling this off on a used 4080ti we picked up off Facebook Market Place. If Claude's output cost per token is current $25/million, and even with a $15/million input cost what isn't told is every recursive query it back feeds in it's previous thoughts, which can cause usage costs upwards of $50/day.  We discovered that a single query even on openrouter.ai with a cheap LLM cost us 74 cents. But it only answered a single question.. We need about 80 questions a day we typically use.  So atypically the usage cost is about $40/day in this scenario way outside most people's budgets.

Yes corporations with $40M budgets will just retain talent, and write the check, but the home producers that can save money should do so.

In Summary

There is real money savings running these house LLM's because our token costs are now $0. But YES you will need to adapt and rethink how you code. Small steps. Don't ask it to create a database API, take the time to write out the small steps of each function of that API.  You should be verifying the code anyways..

Our plan is starting to formulate.

  • Break the tasks up into small contextual pieces and plan it.
  • Recursively feed each task in a series into the LLM and let it work while you sleep..
  • By preplanning your software weeks in advance you can have it work all day and night  - while you are driving, while you are sleeping, while you are reviewing it's answers.  And it's cost is $0.

Onto Part 4... To be written soon!

Linux Rocks Every Day