VTCBench: Can Vision-Language Models Understand Long Contexts with Vision-Text Compression?
VTCBench is the first comprehensive benchmark specifically designed to evaluate the long-context understanding capabilities of Vision-Language Models (VLMs) within the Vision-Text Compression (VTC) paradigm.
VTC is an emerging framework that converts long texts into dense 2D visual representations (images), achieving token compression ratios of 2-10x compared to standard text tokenization. VTCBench rigorously assesses whether VLMs can actually understand this compressed information or if they are merely performing surface-level OCR.
🚀 Key Features
Three Core Tasks: Retrieval, Reasoning, and Memory
- VTC-Retrieval: A visual “Needle-In-A-Haystack” (NIAH) test. Requires locating “needles” (key-value pairs) embedded within a large “haystack” of distractors.
- VTC-Reasoning: Tests associative reasoning with minimized literal overlap between query and key, requiring inference of latent associations.
- VTC-Memory: Multi-turn conversations testing long-term memory retention.
VTCBench-Wild
A wild-version designed to simulate real-world visual diversity (e.g., varying fonts, backgrounds, and layouts)
Two Evaluation Settings
- Predefined VTC Ratio: Predetermines the compression ratio (e.g., $r_\text{VTC}=2.0$) to compare model intelligence at a standardized information density.
- Predefined Rendering: Uses a fixed document format (12-pt Helvetica, 96 DPI) to simulate realistic document processing.
Extensive Model Coverage
Benchmarks 13 leading models including GPT-5, Gemini-2.5 Pro, Gemma, Glyph, Qwen2.5 & Qwen3 & InternVL3.5 series, and more.
Easily extensible to new models via our server-client evaluation framework.
📊 Benchmark Tasks
| Task | Task Categories | Context Example | Evaluation Example |
|---|---|---|---|
| VTC-Retrieval (NIAH) | Lexical Matching, Multi-Hop Tracing, Aggregation |
Dynamic query/key-value with types: word-word,
word-number, uuid-number.
visual example
(essays...)
One of the special magic numbers for long-context is: 2026.
...One of the special magic numbers for distracting-information is: 2025.
|
QA Variant:
Q: What's the special magic number for long-context?
A: 2026.
Completion Variant:
Prompt: one of the special magic number for long-context is:
Completion: 2026.
|
| VTC-Reasoning (NIAH) | Associative Reasoning, Question-Answering |
Dynamic query/key-value with types: event/action-person.
visual example.
(books...)
There was a vegan guest, named Katie.
|
One-Hop Reasoning:
Q: Which character cannot eat fish-based meals?
A: Katie.
Two-Hop Reasoning:
Q: Which character cannot eat Brandade meals?
A: Katie.
|
| VTC-Memory (QA) | Memory, Question-Answering |
No dynamic query/key-value, fully static.
visual example.
(conversations...)
Caroline: Researching adoption agencies—it's
been a dream to have a family and give a loving home to kids who need it.
|
Q: What did Caroline research?
A: Adoption agencies.
|
| VTCBench-Wild | All of the above | A more challenging variant of the above tasks, introducing visual diversity to simulate real-world document conditions. | |
📈 Main Findings

- Perception ≠ Comprehension: While many VLMs excel at OCR and simple retrieval, their performance collapses on reasoning and memory tasks compared to text-only LLMs.
- Length Fragility: VLM performance degrades significantly as the context length increases (e.g., from 1k up to 32k tokens).
- Parameter Sensitivity: VTC performance is highly sensitive to font size and the spatial positioning of information
🛠 Usage & Data
Please refer to the Usage Guide for instructions on how to use VTCBench.
📄 Citation
@misc{zhao2025vtcbenchvisionlanguagemodelsunderstand,
title={VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?},
author={Hongbo Zhao and Meng Wang and Fei Zhu and Wenzhuo Liu and Bolin Ni and Fanhu Zeng and Gaofeng Meng and Zhaoxiang Zhang},
year={2025},
eprint={2512.15649},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.15649},
}