Evaluation Guide

This project adopts a server-client architecture. We require a running OpenAI-compatible LLM/VLM server (e.g., vLLM Serving¹, OpenAI API², etc.) to provide LLM/VLM inference services.

Evaluation Framework (Client)

This repo provides the evaluation framework, i.e. the client side of the project. To set up the evaluation framework, you can use uv (recommended) or pip:

uv venv
uv sync
uv run playwright install chromium

# or using pip:
pip install -e .
playwright install chromium

More on playwright Installation...

This project depends on [DeOCR](https://pypi.org/project/deocr/), which in turn depends on [Playwright](https://pypi.org/project/playwright/) to do text-to-image using a browser. Below is a copy of DeOCR's installation instruction. Please follow instructions from [DeOCR](https://pypi.org/project/deocr/) whenever possible. ```sh pip install deocr[playwright,pymupdf] # activate your python environment, then install playwright deps playwright install chromium ``` If you have trouble installing playwright, or have host-switching problems (e.g., slurm), we suggest a hacky fix like this: ```sh # put libasound.so.2 file (a fake one is also fine) in $HOME/.local/lib # and then export lib path for playwright to find it: export LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.local/lib export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/.local/lib ```

We provide ready-to-use shell/slurm scripts for parallel evaluation in the slurm/ folder that are equivalent to the following.

VTCBench evaluation:

uv run examples/run.py \
  --model config/model/qwen_2.5_vl_7b.json \
  --data config/data/nolima.json \
  --data.context_length 1000 \
  --render config/render/default.yml \
  # --run.num_tasks 1 # for smoke test

VTCBench-Wild evaluation:

# no rendering and context length params, because they come from -wild dataset
uv run examples/run_wild.py \
  --model config/model/qwen_2.5_vl_7b.json \
  --data.path MLLM/VTCBench \
  --data.split Retrieval \
  # --run.num_tasks 1 # for smoke test

Collect results by running uv run examples/collect.py results, or uv run examples/collect.py /path/to/results/. This will print a table like below:

                     contains_all  ROUGE-L  json_id
render_css model_id                                
           Qwen3-8B         99.38    74.35      800

vLLM Serving

To setup a vLLM serving endpoint, please refer to the vLLM Serving Documentation¹.

A simple example to get you started, using deps from pyproject.toml:

# mkdir ../vllm-0.11
# set up a vllm environment seperately, parallel to this repo.
uv venv
uv add vllm==0.11.0 # optionally flash-attn https://github.com/Dao-AILab/flash-attention
# serve your model
vllm serve Qwen/Qwen3-VL-2B-Instruct --port 8001
# to test your endpoint
curl http://localhost:8001/v1/models

Known Dependency Constraints

Following are our dependency recommendations for known models to avoid potential issues. Upgrade or downgrade with caution.

Model Name	Dependency
Qwen3-VL Series	`vllm==0.11.0, transformers==4.57.1`
moonshotai/Kimi-VL-A3B-Instruct	`vllm==0.9.2, transformers<4.54`
InternVL3.5 Series	`vllm==0.10.1.1, transformers==4.57.1`

https://docs.vllm.ai/en/stable/cli/serve/ ↩ ↩²
https://platform.openai.com/docs/api-reference/introduction ↩