Using local LLM models vs APIs

One of the questions I get asked about the LLM for Mortals book is: if you are working with sensitive data, why are you spending all your time on showing how to use public APIs. Should you not be showing how to use local models?

I do have several examples in the book of using local models; GLiNER for named-entity-recognition, docling for OCR, and ChromaDB for a local vector database are all examples I have in the book, and are models I have used in real life production applications. My friend and colleague Gio has a recent blog post on using the newest Qwen 3.5 9 billion parameter model for text classification for routing 311 calls to the correct department based on the description.

Although this model can run on Gio’s consumer hardware (A GPU with 8 gigs of RAM), it is slow, taking 20 to 30 seconds to process a single record. To compare to an API, I used OpenRouter to call this model (see code and results here). For the twenty cases that Gio classified, using an API they take less than a second on average:

You can classify all twenty examples sequentially in the API in the time it takes to process one example in Gio’s local setup. (I do not see where OpenRouter has quotas for throughput; it is quite possible you could send these 20 all at once in parallel as well to get the processing time down to 1 second total.)

So many of the individuals in my network get concerned with paying for API usage. But the APIs are incredibly cheap. This particular model at OpenRouter costs ten cents per one million inputs, and fifteen cents for one million outputs. In this sample, the mean input is less than 500 tokens, and mean output is less than 30 tokens. You can process close to 20,000 records for $1 using this model.

For this application, New York City had 3.7 million 311 calls in 2025. You could process the entire volume of 311 calls for under $200 with this model. The actual GPU costs are pretty variable at the moment, but it would take several years of processing 311 calls in this manner for the API costs to eclipse the cost of just purchasing a single GPU. (And this ignores you would need multiple machines to handle this volume, let alone electricity and upkeep.) Clearly, using the API in this scenario is better price wise relative to purchasing your own compute.

So when do local models make sense? There are really two questions you should ask. Question 1 is “is the smaller model reasonably accurate enough for your use case”. I do not doubt the Qwen 9 billion (or maybe even the 4 billion or 1 billion models) could accomplish this routing task reasonably well. Question 2 is whether the task can be done in batch using compute you already have.

So if you are an analyst, and you already have a GPU, if you wanted to classify 1,000 documents, just have the computer go brr overnight and be done with it. You have only saved 10 cents relative to using the API, but there is not much harm in using the compute you already have available.

For folks with sensitive data applications, there are reasonable solutions with the major foundation model providers, either in AWS, Azure, or Google. So that is not an excuse.