Domain Training Evaluation: In-Distribution Evaluation: Out-of-Distribution
Math DAPO-Math-17k 1 AIME 23 Math500 4
Chemistry SciKnowEval 5 SciKnowEval 5 GPQA 6
Tool Use ToolAlpaca 7 ToolAlpaca 7 BFCLv4 8
Code Dolci-Think-RL-7B 9 Dolci-Think-RL-7B 9 LCBv6 10
Logic ZLogic 11
Knowledge MMLU-R 12
Embedding MMD Heatmap
Maximum Mean Discrepancy (MMD) heatmap between embeddings of two datasets.
Embedding t-SNE Map
t-SNE map of dataset embeddings.
  1. https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k 

  2. https://huggingface.co/datasets/math-ai/aime24 

  3. https://huggingface.co/datasets/math-ai/aime25 

  4. https://huggingface.co/datasets/math-ai/math500 

  5. https://huggingface.co/datasets/hicai-zju/SciKnowEval  2

  6. https://huggingface.co/datasets/Idavidrein/gpqa 

  7. https://github.com/tangqiaoyu/ToolAlpaca/tree/main/data  2

  8. https://github.com/ShishirPatil/gorilla/blob/main/berkeley-function-call-leaderboard/bfcl_eval/data/BFCL_v4_multiple.json 

  9. https://huggingface.co/datasets/allenai/Dolci-Think-RL-7B  2

  10. https://huggingface.co/datasets/livecodebench/code_generation_lite 

  11. https://huggingface.co/datasets/allenai/ZebraLogicBench-private 

  12. https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0