VTCBench: Can Vision-Language Models Understand Long Contexts with Vision-Text Compression?

VTCBench is the first comprehensive benchmark specifically designed to evaluate the long-context understanding capabilities of Vision-Language Models (VLMs) within the Vision-Text Compression (VTC) paradigm.

VTC is an emerging framework that converts long texts into dense 2D visual representations (images), achieving token compression ratios of 2-10x compared to standard text tokenization. VTCBench rigorously assesses whether VLMs can actually understand this compressed information or if they are merely performing surface-level OCR.

🚀 Key Features

Three Core Tasks: Retrieval, Reasoning, and Memory

VTC-Retrieval: A visual “Needle-In-A-Haystack” (NIAH) test. Requires locating “needles” (key-value pairs) embedded within a large “haystack” of distractors.
VTC-Reasoning: Tests associative reasoning with minimized literal overlap between query and key, requiring inference of latent associations.
VTC-Memory: Multi-turn conversations testing long-term memory retention.

VTCBench-Wild

A wild-version designed to simulate real-world visual diversity (e.g., varying fonts, backgrounds, and layouts)

Two Evaluation Settings

Predefined VTC Ratio: Predetermines the compression ratio (e.g., $r_\text{VTC}=2.0$) to compare model intelligence at a standardized information density.
Predefined Rendering: Uses a fixed document format (12-pt Helvetica, 96 DPI) to simulate realistic document processing.

Extensive Model Coverage

Benchmarks 13 leading models including GPT-5, Gemini-2.5 Pro, Gemma, Glyph, Qwen2.5 & Qwen3 & InternVL3.5 series, and more.

Easily extensible to new models via our server-client evaluation framework.

📊 Benchmark Tasks

Task	Task Categories	Context Example	Evaluation Example
VTC-Retrieval (NIAH)	Lexical Matching, Multi-Hop Tracing, Aggregation	^{Dynamic query/key-value with types: word-word, word-number, uuid-number. visual example} (essays...) One of the special magic numbers for long-context is: 2026. ...One of the special magic numbers for distracting-information is: 2025.	QA Variant: Q: What's the special magic number for long-context? A: 2026. Completion Variant: Prompt: one of the special magic number for long-context is: Completion: 2026.
VTC-Reasoning (NIAH)	Associative Reasoning, Question-Answering	^{Dynamic query/key-value with types: event/action-person. visual example.} (books...) There was a vegan guest, named Katie.	One-Hop Reasoning: Q: Which character cannot eat fish-based meals? A: Katie. Two-Hop Reasoning: Q: Which character cannot eat Brandade meals? A: Katie.
VTC-Memory (QA)	Memory, Question-Answering	^{No dynamic query/key-value, fully static. visual example.} (conversations...) Caroline: Researching adoption agencies—it's been a dream to have a family and give a loving home to kids who need it.	Q: What did Caroline research? A: Adoption agencies.
VTCBench-Wild	All of the above	A more challenging variant of the above tasks, introducing visual diversity to simulate real-world document conditions.

📈 Main Findings

vtcbench_results

Perception ≠ Comprehension: While many VLMs excel at OCR and simple retrieval, their performance collapses on reasoning and memory tasks compared to text-only LLMs.
Length Fragility: VLM performance degrades significantly as the context length increases (e.g., from 1k up to 32k tokens).
Parameter Sensitivity: VTC performance is highly sensitive to font size and the spatial positioning of information.

🛠 Usage & Data

Please refer to the Usage Guide for instructions on how to use VTCBench.

📄 Citation

@misc{zhao2025vtcbenchvisionlanguagemodelsunderstand,
      title={VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?},
      author={Hongbo Zhao and Meng Wang and Fei Zhu and Wenzhuo Liu and Bolin Ni and Fanhu Zeng and Gaofeng Meng and Zhaoxiang Zhang},
      year={2025},
      eprint={2512.15649},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.15649},
}