TG
ai·rag·software-engineering·7 min read

Why Markdown is the lingua franca of LLMs and what it changes for your RAG

Markdown is not just a human-friendly format — it's the syntax LLMs were trained on at scale. Why it improves table understanding, chunking, and RAG quality — and where the claim deserves nuance.

Ler em português
Why Markdown is the lingua franca of LLMs and what it changes for your RAG

A strong claim has been circulating on LinkedIn and technical forums:

"Markdown is the preferred language of LLMs for structured data. Exporting a table as Markdown — instead of plain text — makes the model understand the relationship between rows and columns much better. That's why Docling exporting to Markdown isn't just convenience — it's a technical decision that improves RAG."

Can I confirm that? Mostly yes, but it deserves nuance, because "preferred language" is a big phrase and the fine print matters.

Why Markdown works so well with LLMs

The core premise holds. LLMs learned Markdown extensively from training data:

  • GitHub alone has billions of .md files — READMEs, docs, issues, wikis.
  • Stack Overflow, Reddit, Discourse, public Notion pages, static blogs — all Markdown.
  • Modern technical documentation (Next.js, Vercel, FastAPI, LangChain) — Markdown.

When you write Markdown, you're writing in the model's native register. Tokens like #, ##, -, **, |, and ``` carry clear semantics learned from millions of examples: hierarchy, list, emphasis, table cell, code block.

Plain text flattens that information. The model has to infer structure — and often infers it wrong.

Tables: where the difference shows up in benchmarks

For tabular data, there's empirical evidence.

In studies comparing input formats for LLMs on table-comprehension tasks:

  • Markdown key-value reached around 61% accuracy.
  • Raw CSV stayed around 44% on the same tasks.

The intuition is direct: the separator rows (| --- | --- |) and column alignment give structural cues that help the attention mechanism map header ↔ value. Raw CSV depends on , position and breaks when cells contain commas, line breaks, or multiple nesting levels.

For RAG retrieval, header-based Markdown chunking has been shown to improve retrieval accuracy by up to 35% over unstructured text. The reasoning is simple: a heading is an explicit signal of "end of topic, start of another" — exactly what a semantic chunker needs.

Where the claim deserves a caveat

"Preferred language" is too strong as a universal rule. Markdown is not always the best format — it depends on the task.

1. Typed structured extraction → JSON wins. If you need the model to return { "name": "...", "age": 32 }, force JSON with a schema. Markdown has no type contract.

2. Preserving complex web page structure → HTML can win. The HtmlRAG paper (WWW 2025) shows exactly this: for knowledge modeling in RAG systems built on web pages, HTML outperforms plain text because headers, attributes, and table structure inherent to the DOM get lost in conversion. Markdown recovers part of that, but not all (attributes, classes, metadata).

3. Markdown has a token cost. Syntax characters (|, ---, #, **) inflate token counts. In large-scale embedding pipelines, that matters — it's not a net win in every scenario.

4. Embeddings vs. LLM context are different stages. In many RAG pipelines, plain text is used for embeddings (embedding models tend to be robust to structural noise) and Markdown is delivered to the LLM at generation time. Markdown shines mostly in the context window, not necessarily in vectorization.

About Docling specifically

Docling's choice to export to Markdown is a solid technical decision — and it's not "just convenience." But the more precise reason is:

  • Markdown preserves semantic structure (heading hierarchy, tables, lists, reading order) that plain-text extraction destroys.
  • That structure serves two masters: smart chunking (header-aware) and LLM comprehension at generation.
  • Docling uses models like TableFormer to convert complex tables with high fidelity — exactly what most other tools break.
  • Clean output lets frameworks like LlamaIndex and LangChain consume it directly.

So yes: Docling → Markdown → RAG is a well-founded technical choice. Not because of the "preferred language" mystique, but because it preserves structural signals that survive the full pipeline all the way to the model.

Worth noting: the official Docling integration with LangChain offers two export modes:

  • ExportType.DOC_CHUNKS (default) — each chunk becomes a LangChain Document already carrying structured metadata (heading, page, bounding box).
  • ExportType.MARKDOWN — the whole document as a single Markdown string, typically combined with MarkdownHeaderTextSplitter to chunk by #/##/###.

The fact that the default is chunks-with-metadata, not raw Markdown, reinforces the core thesis of this post: the win is preserving structure — heading, hierarchy, reading order, table boundaries. Markdown is one way to carry that structure; explicit chunk metadata is another. In both cases, the common enemy is flattened plain text.

When to use what — practical rule

ScenarioRecommended format
Context delivered to the LLM (RAG)Markdown
Structured extraction with fixed fieldsJSON (with schema)
Crawling structure-rich web pagesPreserved HTML
Tables with headers and cellsMarkdown or HTML table
Logs, dumps, dense data without visual layoutPlain text
Long documents with hierarchical sectionsMarkdown

Rephrasing the original claim

Instead of:

"Markdown is the preferred language of LLMs for structured data."

I'd say:

"Markdown is one of the most effective formats for preserving semantic structure in RAG pipelines, especially for tables, hierarchical documents, and any content that will be read by an LLM. For typed extraction, prefer JSON. To preserve full DOM fidelity, consider HTML."

Wrapping up

The intuition behind the original claim is correct: LLMs understand Markdown very well because they were trained on it at scale, and that has measurable impact on table comprehension and RAG retrieval quality.

But "lingua franca" doesn't mean "universally best format." It's a shared vocabulary — useful where the model needs to read structure, less useful where the model needs to emit typed structure or where you need to preserve full fidelity of a rich representation.

If you're building RAG today: start by exporting to good Markdown, chunk by headings, and measure. You'll feel the difference in the first benchmark.

Known alternatives in the PDF/Office → Markdown space worth putting on your shortlist and comparing yourself:

  • Docling — IBM Research, strong on tables via TableFormer.
  • Marker — popular in RAG communities, rivals Docling.
  • MarkItDown — Microsoft, covers PDF, Office, HTML, and images.

This is not a ranking — it's a starting point. Run all three on your corpus and measure retrieval plus generation quality before picking one.

Academic endorsement

Beyond the community's practical experience, recent literature also backs the core points of this post:

In short: community intuition is aligned with what's coming out of the labs. Markdown isn't magic — it's the lowest common denominator between "humans can read it" and "LLMs have seen millions of examples of it."

Thiago Marinho

May 21, 2026 · Brazil