VTCBench-Wild

Dataset on HF

Evaluation on a permutation of dataset × task × formatting is a huge task.

We hope to simpify it with VTCBench-Wild, a static yet wild version of VTCBench, with 2.2k samples randomly drawn from diverse datasets, tasks and common document formats.

hf download --repo-type dataset MLLM-CL/VTCBench --local-dir data/VTCBench

Dataset Overview

VTCBench Dataset Metric Needle Haystack Evaluated by License
VTC-Retrieval RULER containsAll word/uuid/number essay Completion/QA Apache-2.0
VTC-Reasoning NoLiMa containsAll character/event book QA Adobe Research
VTC-Memory LoCoMo ROUGE-L NA conversations QA CC BY-NC 4.0

Metrics:

  • containsAny checks if the prediction is a substring of the (or any one of) ground truth, e.g.:
    • $1.0$ with pred="magic number is 6822442", gt=["6822442"]
    • $1.0$ with pred="magic number is 6822442", gt=["1234567", "6822442"]
    • $0.0$ with pred="magic number is 1234567", gt=["6822442"]
  • containsAll checks if the prediction contains all of the ground truths, e.g.:
    • $1.0$ with pred="magic number is 6822442", gt=["6822442"]
    • $0.5$ with pred="magic number is 6822442", gt=["1234567", "6822442"]
    • $0.0$ with pred="magic number is 1234567", gt=["6822442"]
  • ROUGE-L is computed using rouge-score.
  • For details, refer to the implementation: metrics.py.

VTC-Retrieval (RULER)

Download from our own fork of RULER Huggingface.

We converted 4 tasks from RULER paper, namely [S,MK,MV,MQ]-NIAH. Each task contains 30 samples, 10 for each needle k-v type: (word-number, uuid-number, word-word).

hf download --repo-type dataset MLLM-CL/RULER --local-dir data/RULER
A sample data point for S-NIAH (word-number) - Needle: `One of the special magic numbers for yielding-grain is: 6822442.` - Question Template: `{haystack_w_needle} What is the special magic number for yielding-grain mentioned in the provided text?`

VTC-Reasoning (NoLiMa)

Download via NoLiMa Huggingface:

hf download --repo-type dataset amodaresi/NoLiMa --local-dir data/NoLiMa
Optionally, if you have a custom path Modify config file path accordingly: [config/data/nolima.json](/VTCBench/config/data/nolima.json). ```json { "needle_set_path": ["data/NoLiMa/needlesets/needle_set.json"], "haystack_dir": "data/NoLiMa/haystack/rand_shuffle", "...": "..." } ```

VTC-Memory (LoCoMo)

Download from LoCoMo Github

mkdir -p data/LoCoMo
wget -P data/LoCoMo https://raw.githubusercontent.com/snap-research/locomo/refs/heads/main/data/locomo10.json
python examples/convert.py data/LoCoMo/locomo10.json

[!NOTE] Be aware that LoCoMo does not have haystacks, and context is directly provided in the needle’s context field.
Random haystack will result in non-relevant context for needles/QAs, simply put:
❌$M$ haystacks * $N$ needles/QAs;
✔️$N$ needles/QAs with their own context.

Optionally, if you have a custom path `examples/convert.py` will output folders parallel to the input file. Modify config file path accordingly: [config/data/locomo.json](/VTCBench/config/data/locomo.json). ```json { "needle_set_path": ["data/LoCoMo/needlesets/4_SingleHop.json"], "haystack_dir": "data/LoCoMo/haystack", "...": "..." } ```

Rendered Samples

VTC-Retrieval VTC-Reasoning VTC-Memory