Data
VTCBench-Wild
Evaluation on a permutation of dataset × task × formatting is a huge task.
We hope to simpify it with VTCBench-Wild, a static yet wild version of VTCBench, with 2.2k samples randomly drawn from diverse datasets, tasks and common document formats.
hf download --repo-type dataset MLLM-CL/VTCBench --local-dir data/VTCBench
Dataset Overview
| VTCBench | Dataset | Metric | Needle | Haystack | Evaluated by | License |
|---|---|---|---|---|---|---|
| VTC-Retrieval | RULER | containsAll |
word/uuid/number | essay | Completion/QA | Apache-2.0 |
| VTC-Reasoning | NoLiMa | containsAll |
character/event | book | QA | Adobe Research |
| VTC-Memory | LoCoMo | ROUGE-L |
NA | conversations | QA | CC BY-NC 4.0 |
Metrics:
containsAnychecks if the prediction is a substring of the (or any one of) ground truth, e.g.:- $1.0$ with
pred="magic number is 6822442",gt=["6822442"] - $1.0$ with
pred="magic number is 6822442",gt=["1234567", "6822442"] - $0.0$ with
pred="magic number is 1234567",gt=["6822442"]
- $1.0$ with
containsAllchecks if the prediction contains all of the ground truths, e.g.:- $1.0$ with
pred="magic number is 6822442",gt=["6822442"] - $0.5$ with
pred="magic number is 6822442",gt=["1234567", "6822442"] - $0.0$ with
pred="magic number is 1234567",gt=["6822442"]
- $1.0$ with
ROUGE-Lis computed usingrouge-score.- For details, refer to the implementation: metrics.py.
VTC-Retrieval (RULER)
Download from our own fork of RULER Huggingface.
We converted 4 tasks from RULER paper, namely [S,MK,MV,MQ]-NIAH. Each task contains 30 samples, 10 for each needle k-v type: (word-number, uuid-number, word-word).
hf download --repo-type dataset MLLM-CL/RULER --local-dir data/RULER
A sample data point for S-NIAH (word-number)
- Needle: `One of the special magic numbers for yielding-grain is: 6822442.` - Question Template: `{haystack_w_needle} What is the special magic number for yielding-grain mentioned in the provided text?`VTC-Reasoning (NoLiMa)
Download via NoLiMa Huggingface:
hf download --repo-type dataset amodaresi/NoLiMa --local-dir data/NoLiMa
Optionally, if you have a custom path
Modify config file path accordingly: [config/data/nolima.json](/VTCBench/config/data/nolima.json). ```json { "needle_set_path": ["data/NoLiMa/needlesets/needle_set.json"], "haystack_dir": "data/NoLiMa/haystack/rand_shuffle", "...": "..." } ```VTC-Memory (LoCoMo)
Download from LoCoMo Github
mkdir -p data/LoCoMo
wget -P data/LoCoMo https://raw.githubusercontent.com/snap-research/locomo/refs/heads/main/data/locomo10.json
python examples/convert.py data/LoCoMo/locomo10.json
[!NOTE] Be aware that LoCoMo does not have haystacks, and context is directly provided in the needle’s
contextfield.
Random haystack will result in non-relevant context for needles/QAs, simply put:
❌$M$ haystacks * $N$ needles/QAs;
✔️$N$ needles/QAs with their own context.
Optionally, if you have a custom path
`examples/convert.py` will output folders parallel to the input file. Modify config file path accordingly: [config/data/locomo.json](/VTCBench/config/data/locomo.json). ```json { "needle_set_path": ["data/LoCoMo/needlesets/4_SingleHop.json"], "haystack_dir": "data/LoCoMo/haystack", "...": "..." } ```Rendered Samples
| VTC-Retrieval | VTC-Reasoning | VTC-Memory |
|---|---|---|
![]() |
![]() |
![]() |


