Label 200 examples per task. Your labels become the ground truth for model evaluation.
Drop files in agents/subnano/data/<task>/gold-eval.jsonl