TG
ai·local-llm·en·7 min read

llama.cpp hits 100k stars: the local agents era has begun

Why llama.cpp's milestone and Pi running Qwen3.6 on a MacBook in airplane mode mark the start of the second AI revolution — local, sovereign, and plural.

Ler em português
llama.cpp hits 100k stars: the local agents era has begun

Two posts on X, in the same week, capture pretty well where local AI landed in 2026.

The first one is from Georgi Gerganov, creator of llama.cpp, celebrating the project hitting 100k stars on GitHub. He uses the milestone to write an honest reflection on the state of the local LLM movement.

The second one is from Julien Chaumond, CTO of Hugging Face, showing the Pi coding agent running Qwen3.6 27B via llama.cpp on a MacBook Pro, in full airplane mode, tackling non-trivial tasks in HF's own code repositories — at quality he describes as "very, very close to the latest Opus on Claude Code".

That's not a coincidence. It's a signal.

The milestone: 100k stars and 1500+ contributors

llama.cpp isn't a new project. But getting to 100k stars with 1500+ contributors, and doing it the way it does — in C/C++, vendor-neutral, running on basically any hardware — says a lot about what kind of software stack the community is betting on.

As Georgi puts it:

"The technology is too important to be locked to a vendor. It has to be developed in the open, by the community, alongside independent hardware vendors."

That's the point people miss when they polarize between "local LLM is the future" and "local LLM is a toy". The right question isn't whether it's as good as GPT-5 today. It's what kind of infrastructure do you want sustaining software for the next 10 years.

What changed in 12 months

A year ago, the argument against local agents was basically one thing: memory and compute cost.

Available models were too expensive for long-context, multi-tool, validation-loop work. There was no obvious path from "this runs on my laptop" to "this does something useful".

Georgi himself admits he didn't expect the agentic era to reach local territory this fast. What flipped the script:

  • gpt-oss moved the floor — the first time local tool calling really worked inside the constraints of a daily-driver device.
  • Qwen3.5 and Qwen3.6 are a qualitative jump, covering a broad range of sizes for different machines.
  • GLM-4.7-Flash, MiniMax-M2.5 and Coder variants widened the practical menu.

And Julien closing this loop with Pi + Qwen3.6 27B + llama.cpp on a MacBook Pro shows the real result: actual coding agent, in actual production code, with no internet.

We don't need frontier intelligence for most tasks

This is maybe Georgi's most important argument, and the most ignored in the hype:

  • we don't need frontier intelligence to automate search and email
  • we don't need trillion-parameter models to summarize docs
  • we don't need GPU data centers to turn off the garage lights

There is a useful level of intelligence that fits in a MacBook. And a huge amount of real-world problems that fits inside that level.

Comparing local vs. hosted capability at one specific moment is the kind of debate that ages badly. What matters is the trajectory, and the trajectory is clear: the share of useful work you can do offline is growing fast.

Another point of Georgi's worth pulling out: a lot of what looks like "bad local model" is, in practice, bad harness.

From the typed task in the client to the final token, there's a long chain that's still fragile:

  • per-model chat template parsing
  • prompt construction
  • tool-calling adapters
  • context assembly
  • subtle inference bugs

Most people trying to "plug Qwen3.5 into Claude Code" and getting frustrated are bumping into exactly that chain. Not the model.

His recommendation is pragmatic:

  1. start with full-quality models that fit your hardware
  2. know exactly what your harness does — write your own, or use the llama-server webui (now with native MCP support)
  3. only after it works, look for optimizations (quantize for speed, tune params)

That last point connects directly to what I wrote in Agent Harness Engineering in practice: the agent is model + harness. Swapping the model without understanding the harness rarely solves anything.

Models that ship today

For anyone wanting to actually try it, here's Georgi's own list of models he uses in real apps (chat, MCP, coding):

  • gpt-oss-120b
  • Qwen3-Coder-30B
  • GLM-4.7-Flash
  • MiniMax-M2.5
  • Qwen3.5-35B-A3B
  • (plus Qwen3.6 27B from Julien's setup)

For most of them, Q8_0 quantizations preserve original quality well. Model size stopped being the wall; what matters now is picking what matches the hardware you already have.

It's not just llama.cpp: the ggml.org family

Worth remembering llama.cpp isn't alone. It's part of a broader ecosystem built around ggml — the C tensor library Georgi maintains — now living under the ggml-org GitHub organization.

Two projects in that family worth tracking:

  • llama.cpp — LLM inference in C/C++. 110k+ stars, 18k+ forks. Runs on CPU, NVIDIA/AMD/Intel GPU, Metal (Apple Silicon), Vulkan, and even embedded ARM. MIT.
  • whisper.cpp — C/C++ port of OpenAI's Whisper. ~50k stars. Fully local audio transcription and translation, in the same spirit: no API, no cloud, any decent hardware. MIT.

Put them together and you already have an end-to-end local voice assistant: whisper.cpp transcribes, llama.cpp thinks and replies. All on the same laptop, nothing leaving the machine.

That's the architectural pattern that gives the sovereignty conversation real weight: small, open, pluggable parts, no single-vendor dependency.

Airplane mode as a banner

What stands out in Julien's post isn't the model size. It's the airplane mode.

A coding agent attacking a real repository, in real code, calling no API at all:

  • Efficiency — no network latency, no provider queue.
  • Security — sensitive code never leaves the machine.
  • Privacy — no prompt shows up in someone else's log.
  • Sovereignty — your stack doesn't die when a vendor hikes prices, changes ToS or vanishes.

He calls it the "second AI revolution". The framing tracks: the first one was the explosion of frontier models via API. The second is what happens once enough capability stops being a monopoly of whoever owns the GPU farm.

And as he says, "most people haven't realized this yet." Those who do gain an edge before the market prices it in.

What to do now

If you work with applied AI, three practical moves for this week:

  1. Install llama.cpp and run llama-server locally with a model from the list above. Use its webui with MCP to feel tool calling working.
  2. Write (or audit) your harness. If you don't control the agent's steps, you don't control the outcome. Stop trying to make closed tools talk to open models by osmosis.
  3. Pick a use case that doesn't need frontier. Summarizing internal docs, classifying tickets, indexing and searching email, touching code in a repo. Start there before trying to "replace Claude Code".

Wrapping up

Will frontier AI live in data centers? Always. There are tasks where that's the right call.

But the assumption that all AI has to run out there is dying. And it's dying because:

  • user hardware got better
  • open models got better
  • the stack (llama.cpp + ggml) got robust
  • the community decided this is too important to depend on a vendor

100k stars on llama.cpp isn't a vanity score. It's a vote. About what kind of AI infrastructure the world wants to build.

If I had to bet, 2026 will be the year a lot of people look at their own stack and realize they didn't need to pay for cloud for half of what they're doing.

References

Thiago Marinho

May 16, 2026 · Brazil