You've probably seen documents wreck AI systems.
You feed a scanned PDF into a model, and it confidently hallucinates half the table, messes up the formula, and completely ignores the stamp in the corner. You throw a contract at it and get back a wall of text with no structure. You try to extract fields from an invoice and end up fixing more than you extracted.
Document understanding has been the quiet embarrassment of the AI boom. Models that can write poetry and pass bar exams still struggle with a two-column research paper.
GLM-OCR, released last week by Zhipu AI and Tsinghua University, is a serious attempt to fix that, and what makes it interesting isn't just the benchmark scores. It's how small it is.
We're talking 0.9 billion parameters. For reference, most capable multimodal models start at 7B and go up from there. GLM-OCR is doing competitive document parsing at a fraction of that size, and the paper explains exactly how.
Let's get into it.
Why Document OCR Is Still a Hard Problem
Before we talk about GLM-OCR, it helps to understand why this problem is harder than it looks.
When most people hear "OCR," they picture text recognition, scan a page, get the words back. Tools like Tesseract have done that reasonably well for decades. But modern document understanding is a completely different beast.
Real documents are messy. A single page might contain:
- Multi-column flowing text that doesn't follow a simple reading order
- Tables with merged cells and nested headers
- Mathematical formulas with subscripts, superscripts, and Greek letters
- Handwritten annotations in the margins
- Official seals or stamps layered on top of text
- Code blocks with specific indentation that matters
- Mixed languages in the same paragraph
Getting all of that right, and outputting it in a structured format a downstream system can actually use, requires something much more sophisticated than traditional OCR pipelines.
Recent large multimodal models (LMMs) handle this better. Models like GPT-4o and Gemini can parse complex documents reasonably well. But they're expensive to run, slow for production workloads, and not something you can deploy on an edge device or fine-tune for your specific domain without significant resources.
GLM-OCR is designed to live in that gap: better than traditional OCR, cheaper and smaller than frontier multimodal models.
The Architecture: Three Pieces That Work Together
GLM-OCR is built from three components. Understanding each one helps you see why the design choices make sense.
1. The Vision Encoder: CogViT (0.4B)
The first component is a visual encoder called CogViT, which takes the raw image of a document page and turns it into a set of numerical representations (called embeddings) that capture what's visually present.
Think of it like the model's eyes. It looks at the page and compresses what it sees into a form the language part of the model can understand. At 0.4B parameters, it's already a substantial model on its own, it was pretrained specifically on image-text pairs and grounding tasks, meaning it learned to connect visual regions with language descriptions.
2. The Cross-Modal Connector
This is a small bridge layer that sits between the vision encoder and the language decoder. Its job is translation: it takes the visual embeddings from CogViT and maps them into the same representational space that the language model speaks.
It's lightweight by design. The real processing happens on either side of it, the connector just makes sure the two components can communicate.
3. The Language Decoder: GLM (0.5B)
The final piece is a 0.5B parameter language model, a version of the General Language Model (GLM) architecture from Tsinghua. This is the part that generates the output: the Markdown text, the JSON structure, the extracted fields.
The language decoder has read a lot of text during pretraining. It knows what a properly formatted table looks like in Markdown. It knows the syntax of LaTeX formulas. It knows that an invoice has fields like vendor name, date, and total amount. That knowledge shapes what it produces when the visual encoder hands it a document to interpret.
Combined, the full model is 0.9B parameters, small enough to run locally on a modern laptop, deployable with Ollama or vLLM without specialized infrastructure.
The Clever Part: Multi-Token Prediction
Here's one of the most technically interesting things GLM-OCR does, and it's worth slowing down on.
Standard language models are autoregressive. That means they generate one token at a time, each token conditioned on everything that came before it. This is fine for creative generation, where each word genuinely depends on the last. But for OCR, it's inefficient.
Why? Because OCR output is largely deterministic. If the document says "Invoice Number: 4829-B," the model shouldn't need to agonize over each character. The pattern is clear. The output is constrained.
GLM-OCR addresses this with Multi-Token Prediction (MTP). Instead of generating one token per step, the model is trained to predict multiple tokens simultaneously. During training, it learns to predict 10 tokens at once. At inference time, it actually generates 5.2 tokens per decoding step on average.
That 5.2x increase in tokens per step translates to approximately 50% higher throughput compared to standard autoregressive decoding, without sacrificing output quality. The paper also notes they use a parameter-sharing scheme across draft models to keep the memory overhead manageable.
In practical terms: GLM-OCR processes 1.86 PDF pages per second. For a production document pipeline handling thousands of pages per day, that difference is significant.
The Two-Stage Pipeline
Another key design choice is how GLM-OCR handles a full document page, it doesn't just look at the whole thing at once and hope for the best.
Stage 1: Layout Analysis
Before any text recognition happens, the system runs a separate layout detection model called PP-DocLayout-V3. This model scans the page and identifies its structural regions: where the text columns are, where the tables live, where figures and formulas are located, where headers and footers fall.
The output of this stage is essentially a map of the page, "here's a table from coordinates A to B, here's a paragraph block from C to D, here's a formula region at E."
Stage 2: Parallel Region Recognition
With the map in hand, GLM-OCR processes each detected region in parallel. The table gets processed as a table. The formula gets processed as a formula. The paragraph gets processed as flowing text.
This is fundamentally different from how a general vision-language model would approach the same page, which is typically to encode the whole image and then generate text in a left-to-right sweep. The two-stage approach is both more accurate (each region gets specialized attention) and more efficient (regions can be processed concurrently).
Two Operating Modes: Parsing vs. Extraction
GLM-OCR isn't a single-mode system. The paper describes two distinct use cases with different output paths.
Document Parsing is for when you want the full content of a document in a structured format. You feed the pipeline a page image; it runs layout detection, processes all the regions, and assembles the output as Markdown or JSON. This is what you'd use to digitize a research paper, convert a scanned report, or build a knowledge base from archival documents.
Key Information Extraction (KIE) is different. Instead of transcribing everything, the goal is to pull out specific fields, vendor name, invoice total, contract date, patient name. For KIE, the model skips the two-stage pipeline and instead takes the full document image plus a task prompt, then directly generates a JSON object with the requested fields.
The distinction matters because it reflects how document AI is actually used in the real world. Sometimes you need everything. Sometimes you need exactly one thing.
How It Was Trained
The training recipe is four stages, and the progression is deliberate.
Stage 1 trains the vision encoder in isolation on image-text pairs and grounding data. The goal is to give it a solid visual foundation before it ever sees document-specific content.
Stage 2.1 is multimodal pretraining, the vision encoder and language decoder are jointly trained on image-text pairs, document parsing data, grounding tasks, and visual question answering (VQA). This is where the model learns to connect images to language in a general sense.
Stage 2.2 introduces the MTP objective. The model starts learning to predict multiple tokens per step, building the faster decoding capability into its weights rather than adding it as a post-hoc trick.
Stage 3 is supervised fine-tuning on OCR-specific tasks: text recognition, formula transcription, table structure recovery, and KIE. The model sees the specific output formats it will be expected to generate in production.
Stage 4 applies Reinforcement Learning, specifically a method called GRPO (Group Relative Policy Optimization). And this is where the training gets genuinely interesting.
Instead of using a single reward signal, the team designed task-specific rewards:
- For text recognition: Normalized Edit Distance (how similar is the predicted text to the ground truth, character by character)
- For formula recognition: CDM score (a metric that compares the visual rendering of predicted vs. ground truth formulas)
- For table recognition: TEDS score (Tree-Edit-Distance-based Similarity, which measures how well the predicted table structure matches the original)
- For KIE: Field-level F1 (how many of the extracted fields are correct)
They also added structural penalties: repetition penalties to stop the model from looping, malformed structure penalties, and JSON validation constraints to ensure outputs are parseable.
The effect is that the model doesn't just learn to be "generally good." It learns what correctness specifically means for each task it might face.
The Benchmark Numbers (With Context)
The paper reports strong scores across the major document understanding benchmarks:
| Benchmark | GLM-OCR Score |
|---|---|
| OmniDocBench v1.5 | 94.6 |
| OCRBench (Text) | 94.0 |
| UniMERNet (Formulas) | 96.5 |
| PubTabNet (Tables) | 85.2 |
| TEDS_TEST (Tables) | 86.0 |
| Nanonets-KIE | 93.7 |
| Handwritten-KIE | 86.1 |
These are genuinely strong numbers, especially for a sub-1B model. But the paper is transparent about where it doesn't lead.
On PubTabNet, MinerU 2.5 scores 88.4 vs. GLM-OCR's 85.2, so tables remain a relative weak point. On the KIE benchmarks, Gemini-3-Pro scores higher on both Nanonets-KIE and Handwritten-KIE, though those results are listed as reference-only and excluded from the ranking.
The honest framing: GLM-OCR is the best among open-source models of comparable size. It's not definitively the best at everything.
What This Means in Practice
Here's where I think the real significance lies, not in the benchmarks but in what this architecture makes possible.
Local deployment is real. GLM-OCR runs on Ollama. You can run document parsing on your own hardware, with no API calls, no data leaving your infrastructure. For companies handling sensitive documents, legal, medical, financial, that's not a nice-to-have, it's a requirement.
The cost floor just dropped. The paper mentions a MaaS API priced at 0.2 RMB per million tokens. That's roughly $0.027 per million tokens. For a document pipeline processing millions of pages, the economics are completely different from frontier model APIs.
Fine-tuning is accessible. The model supports fine-tuning through LLaMA-Factory. If you have domain-specific documents, specialized legal templates, medical forms, engineering schematics, you can adapt GLM-OCR to your exact use case without starting from scratch.
The inference stack is standard. vLLM, SGLang, and Ollama support means you can slot GLM-OCR into existing deployment infrastructure without custom tooling.
Taken together, this isn't just a research result. It's a production-ready system that competes with much larger models while remaining small enough to actually deploy.
The Broader Pattern
GLM-OCR is part of a trend that's worth paying attention to: the compression of capability into smaller models through smarter architecture and training, rather than brute-force scale.
The Multi-Token Prediction approach is similar in spirit to speculative decoding techniques that have emerged across the field. The task-specific RL reward design is a more sophisticated version of what makes systems like AlphaCode and DeepSeek-R1 tick. The two-stage pipeline reflects lessons from classical computer vision that the LLM era briefly forgot, sometimes the right approach is structured preprocessing, not end-to-end learning on raw pixels.
What Zhipu AI and Tsinghua have done here is package a set of good ideas into a coherent, deployable system. The result is a model that punches well above its weight class.
If you're building anything that touches document processing, RAG pipelines, contract analysis, invoice automation, research paper ingestion, GLM-OCR is worth experimenting with.
Paper: arxiv.org/abs/2603.10910
Model: huggingface.co/zai-org/GLM-OCR
Repo: github.com/zai-org/GLM-OCR