Data | VTCBench

VTCBench-Wild

Evaluation on a permutation of dataset × task × formatting is a huge task.

We hope to simpify it with VTCBench-Wild, a static yet wild version of VTCBench, with 2.2k samples randomly drawn from diverse datasets, tasks and common document formats.

hf download --repo-type dataset MLLM-CL/VTCBench --local-dir data/VTCBench

Dataset Overview

VTCBench	Dataset	Metric	Needle	Haystack	Evaluated by	License
VTC-Retrieval	RULER	`containsAll`	word/uuid/number	essay	Completion/QA	Apache-2.0
VTC-Reasoning	NoLiMa	`containsAll`	character/event	book	QA	Adobe Research
VTC-Memory	LoCoMo	`ROUGE-L`	NA	conversations	QA	CC BY-NC 4.0

Metrics:

containsAny checks if the prediction is a substring of the (or any one of) ground truth, e.g.:
- $1.0$ with pred="magic number is 6822442", gt=["6822442"]
- $1.0$ with pred="magic number is 6822442", gt=["1234567", "6822442"]
- $0.0$ with pred="magic number is 1234567", gt=["6822442"]
containsAll checks if the prediction contains all of the ground truths, e.g.:
- $1.0$ with pred="magic number is 6822442", gt=["6822442"]
- $0.5$ with pred="magic number is 6822442", gt=["1234567", "6822442"]
- $0.0$ with pred="magic number is 1234567", gt=["6822442"]
ROUGE-L is computed using rouge-score.
For details, refer to the implementation: metrics.py.

VTC-Retrieval (RULER)

Download from our own fork of RULER Huggingface.

We converted 4 tasks from RULER paper, namely [S,MK,MV,MQ]-NIAH. Each task contains 30 samples, 10 for each needle k-v type: (word-number, uuid-number, word-word).

hf download --repo-type dataset MLLM-CL/RULER --local-dir data/RULER

A sample data point for S-NIAH (word-number)

- Needle: `One of the special magic numbers for yielding-grain is: 6822442.` - Question Template: `{haystack_w_needle} What is the special magic number for yielding-grain mentioned in the provided text?`

VTC-Reasoning (NoLiMa)

Download via NoLiMa Huggingface:

hf download --repo-type dataset amodaresi/NoLiMa --local-dir data/NoLiMa

Optionally, if you have a custom path

Modify config file path accordingly: [config/data/nolima.json](/VTCBench/config/data/nolima.json). ```json { "needle_set_path": ["data/NoLiMa/needlesets/needle_set.json"], "haystack_dir": "data/NoLiMa/haystack/rand_shuffle", "...": "..." } ```

VTC-Memory (LoCoMo)

Download from LoCoMo Github

mkdir -p data/LoCoMo
wget -P data/LoCoMo https://raw.githubusercontent.com/snap-research/locomo/refs/heads/main/data/locomo10.json
python examples/convert.py data/LoCoMo/locomo10.json

[!NOTE] Be aware that LoCoMo does not have haystacks, and context is directly provided in the needle’s context field.
Random haystack will result in non-relevant context for needles/QAs, simply put:
❌$M$ haystacks * $N$ needles/QAs;
✔️$N$ needles/QAs with their own context.

Optionally, if you have a custom path

`examples/convert.py` will output folders parallel to the input file. Modify config file path accordingly: [config/data/locomo.json](/VTCBench/config/data/locomo.json). ```json { "needle_set_path": ["data/LoCoMo/needlesets/4_SingleHop.json"], "haystack_dir": "data/LoCoMo/haystack", "...": "..." } ```

Rendered Samples

VTC-Retrieval	VTC-Reasoning	VTC-Memory