Skip to content

milockal/IITD_Feb26_AAIPL

Repository files navigation

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "480066b8",
   "metadata": {},
   "source": [
    "<h1 align='center'>✨ Welcome to the AMD AI Premier League (AAIPL)! ✨</h1>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "078a566b-fa28-4a9f-8382-de6aeeacfcaf",
   "metadata": {},
   "source": [
    "<!-- <img src=\"./assets/aaipl.png\"> -->\n",
    "<img src=\"./assets/AMDAAIPL.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09b4b3fe",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "\n",
    "---\n",
    "## Task\n",
    "You will be building:\n",
    "1.  **A question agent** that will ask $N$ puzzle-based questions based on provided [topics](./assets/topics.json).\n",
    "    - Create your model in [question_model.py](./agents/question_model.py) (it will be called by [question_agent.py](./agents/question_agent.py) for evaluation)\n",
    "    - *Your question agent must output questions in the format specified in [sample_question.json](./assets/sample_question.json)*.\n",
    "2. **An answer agent** that answers questions asked from a question agent.\n",
    "    -  Create your model in [answer_model.py](./agents/answer_model.py) (it will be called by [answer_agent.py](./agents/answer_agent.py) for evaluation)\n",
    "    -  *Your answer agent must output answers in the format specified in [sample_answer.json](./assets/sample_answer.json)*.\n",
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ed3cfa0",
   "metadata": {},
   "source": [
    "## Instructions\n",
    "\n",
    "1. How to initiate your workstation:\n",
    "    1. Go to **dev.amd-ai-academy.com**.\n",
    "    2. Type in the Team ID and Password from your printout.\n",
    "    3. Sign in.\n",
    "1. Read through this README.ipynb for more details on the challenge.\n",
    "    - **Note:** If members of your team are working from the notebook simultaneously, please coordinate to ensure you do not overwrite each other's work.\n",
    "1. You can **only** use the models provided in `/root/.cache/huggingface/hub`.\n",
    "    - These will be *read-only* - please **copy** a model you'd like to use into the `AAIPL/hf_models` folder. Here you can edit the model.\n",
    "    - If you attempt to change the models in the original folder, you will be **immediately disqualified**.\n",
    "1. Check out our [Synthetic Data Generation and Unsloth Tutorial](./tutorial.ipynb) for training tips and tricks.\n",
    "1. Before the deadline, kindly ensure that you push your code to Github (except `hf_models`).\n",
    "    - You can use the [`git.sh`](./git.sh) script to easily push it. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cfd0427",
   "metadata": {},
   "source": [
    "## 📚 Table of Contents:\n",
    "- 📝 [Task](#task)\n",
    "- ⚙️ [Instructions](#instructions)\n",
    "- 🏏 [Tournament Overview](#tournament-overview)\n",
    "- 📋 [Guidelines](#guidelines)\n",
    "    - [Format](#format-overview)\n",
    "- 🛠️ [Submission](#️what-you-will-submit)\n",
    "- ⚠️ [Restrictions](#restrictions)\n",
    "- 📂 [Directory & Files overview](#directory--files-overview)\n",
    "- 🎮 [Getting started](#getting-started)\n",
    "    - 🚀 [Env Setup](#env-setup)\n",
    "    - 🤔 [Q-Agent](#q-agent)\n",
    "        - ✅ [Basic format-checks for questions from Q-agent](#basic-format-checks-for-questions-from-q-agent)\n",
    "    - 🤖 [A-agent](#a-agent)\n",
    "        - ✅ [Basic format-checks for answers from A-agent](#basic-format-checks-for-answers-from-a-agent)\n",
    "- 🏅 [Evaluation](#evaluation)\n",
    "    - 📊 [Scoring Criteria](#scoring-criteria)\n",
    "    - 🧮 [Scoring Example](#scoring-example)\n",
    "- ⏱ [Time Limit](#time-limit)\n",
    "<!-- - 🏆 [LeaderBoard UI/UX](#leaderboard-uiux) -->"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9386cb37",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Tournament Overview\n",
    "<!-- 🏏  -->\n",
    "* All matches in this tournament will be **1v1** knockout format where two teams, Team-A vs Team-B, will compete with their Q-agent (question agent) and A-agent (answer agent). You can think of this as a cricket match or baseball game where teams will switch sides.\n",
    "  * Before the matchups begin, there will be an **elimination round** where we will test your A-Agent against a hidden set of questions. The teams' A-Agents that scores the highest on these seeding questions will move onto the elimination stage. \n",
    "* Like in cricket, each match has two innings:\n",
    "    -   1st inning:\n",
    "        *   $N$ Question from the Q-agent (Team-A) and their corresponding $N$ answers from the A-agent (Team-B).\n",
    "        *   Q-agent score (Team-A): Say, $40$\n",
    "        *   A-agent score (Team-B): $60$\n",
    "\n",
    "    -   2nd inning:\n",
    "        *   $N$ Question from the Q-agent (Team-B) and their respective $N$ responses from the A-agent (Team-A).\n",
    "        *   Q-agent score (Team-B): Say, $70$\n",
    "        *   A-agent score (Team-A): $30$\n",
    "    -   Final Score:\n",
    "        *   Team-A score $=$ 1st inning Q-agent score $+$ 2nd inning A-agent score $= 40 + 30 = 70$\n",
    "        *   Team-B score $=$ 1st inning A-agent score $+$ 2nd inning Q-agent score $= 60 + 70 = 130$\n",
    "    -   Winner: **Team-B** with a score of $130$.\n",
    "\n",
    "For more info on how scoring is done, refer to the [scoring criteria section](#scoring-criteria).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2deab9cf",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "## Guidelines\n",
    "<!-- 📋  -->\n",
    "\n",
    "### Format\n",
    "We will only consider responses from the Q-agent and the A-agent which follow the below format.\n",
    "\n",
    "*Note*: While having an explanation/reasoning is a plus, not having them doesn't disqualify the question or answer being correct.\n",
    "\n",
    "#### Q-Agent\n",
    "Given a topic, the Q-agent should generate questions in the specified JSON format:\n",
    "\n",
    "```json\n",
    "{\n",
    "    \"topic\": \"<Topic of the Question>\",\n",
    "    \"question\": \"<full question text>\",\n",
    "    \"choices\": [\n",
    "        \"A) <choice A text>\",\n",
    "        \"B) <choice B text>\",\n",
    "        \"C) <choice C text>\",\n",
    "        \"D) <choice D text>\"\n",
    "    ],\n",
    "    \"answer\": \"<correct choice letter only>\",\n",
    "    \"explanation\": \"brief explanation within 100 words for why the answer is correct\"\n",
    "}\n",
    "```\n",
    "\n",
    "The **\"Topic\"**, **\"Question\"**, **\"Choices\"**, and **\"Answer\"** will be verified for correctness.\n",
    "\n",
    "#### A-Agent\n",
    "Given a Question and Choices, A-agent should produce answer in the format of:\n",
    "\n",
    "```json\n",
    "{\n",
    "    \"answer\": \"<correct choice letter only>\",\n",
    "    \"reasoning\": \"brief reasoning within 100 words for why the answer is correct\"\n",
    "}\n",
    "```\n",
    "\n",
    "The **\"Answer\"** key will be compared with **\"Answer\"** from the opponent's Q-agent.\n",
    "\n",
    "**<u>Note</u>**: *Once again, we will **only** consider those responses from the Q-agent and the A-agent which follow the above format.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b858a803",
   "metadata": {},
   "source": [
    "## Submission\n",
    "<!-- 🛠️  -->\n",
    "You need to submit your code which should contain these main files:\n",
    "1. All work must be within the `AAIPL` folder. Do NOT change the folder name.\n",
    "1. No need to upload anything anywhere, we'll collect your agent code from your Jupyter Server at the end of the challenge.\n",
    "   1. The agents will be called by `python -m agents.question_agent` and `python -m agents.answer_agent`, respectively.\n",
    "1. ENSURE model checkpoint(s) (e.g., `model.safetensors` or `.pt` or `.pth`) is (are) loading and expected files are getting generated from Q-agent and A-agent, when inference is done.\n",
    "   1. Outputs must be saved to `outputs/questions.json` and `outputs/answers.json`, respectively.\n",
    "1. **<u>Note</u>: You are not required to generate any `.json` for us, we'll do that for you during evaluation setting a specific value to $N$.**\n",
    "\n",
    "You can test your submission by running the commands in the [Getting Started](#getting-started) section.\n",
    "\n",
    "<u><span style=\"color: blue\">Note</span></u>: These files will be checked for any hardcoding, RAG, or other unfair practices.<br>\n",
    "<u><span style=\"color: red\">Remarks / Caution</span></u>: A-agent is equally important as Q-agent. So, please do focus on both."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe1cc2ce",
   "metadata": {},
   "source": [
    "## Restrictions\n",
    "<!-- ⚠️ -->\n",
    "\n",
    "1.  **<span style=\"color: red\">NO</span> LAST Minute Submission**: The submission deadline is strict. Any changes to your code after the deadline may disqualify your submission.\n",
    "1.  RAG (Retrieval Augmented Generation) techniques are not allowed.\n",
    "1.  Adversarial approaches will lead to disqualification, e.g. making A-agents hallucinate.\n",
    "1.  Usage of models other than what was provided will lead to disqualification.\n",
    "1.  Only English language is allowed for both Q-agent and A-agent.\n",
    "1.  Strictly stay within the `max_tokens` limits specified in `agen.yaml` & `qgen.yaml`. Other parameters can be changed.\n",
    "1.  Questions must pertain to the topics listed in `topics.json`.\n",
    "1.  Each question should be generated under `13 secs`. Questions exceeding this limit will not be considered.\n",
    "1.  Each answer should be generated under `9 secs`. Answers exceeding this limit will not be considered.\n",
    "\n",
    "Feel free to reach out in the Discord channel for any clarifications or questions!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b562cb4",
   "metadata": {},
   "source": [
    "## Directory & Files overview\n",
    "<!-- 📂  -->\n",
    "\n",
    "```plaintext\n",
    ".\n",
    "├── agents\n",
    "│   ├── question_model.py\n",
    "│   ├── question_model_llama.py (example code using Unsloth Llama)\n",
    "│   ├── question_agent.py\n",
    "│   ├── answer_model.py\n",
    "│   ├── answer_model_llama.py (example code using Unsloth Llama)\n",
    "│   └── answer_agent.py\n",
    "├── assets\n",
    "│   ├── topics_example.json # example questions w.r.t each topic\n",
    "│   ├── topics.json # Topics on which we require to generate questions\n",
    "│   ├── sample_question.json # File specifying expected format of questions generated\n",
    "│   └── sample_answer.json # Expected format of answers generated\n",
    "├── utils\n",
    "│   └── build_prompt.py # prompt-tuning scripts\n",
    "├── README.ipynb\n",
    "├── tutorial.ipynb # Synthetic Data Generation and Unsloth Tutorial\n",
    "├── tutorial_config.yaml # Config file for tutorial\n",
    "├── qgen.yaml # Generation specific parameters for Q-agent\n",
    "└── agen.yaml # Generation specific parameters for A-agent\n",
    "```\n",
    "   "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2187a198",
   "metadata": {},
   "source": [
    "## Getting started\n",
    "<!-- 🎮  -->\n",
    "Let's get started with running the Q-agent and A-agent framework.\n",
    "\n",
    "### Environment Setup\n",
    "<!-- 🚀 -->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "62e583a6",
   "metadata": {
    "vscode": {
     "languageId": "powershell"
    }
   },
   "outputs": [],
   "source": [
    "# Import basic packages\n",
    "import json\n",
    "from typing import Dict, Any, List"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93fdb0b0",
   "metadata": {},
   "source": [
    "### Q-Agent\n",
    "<!-- 🤔 -->\n",
    "You will update the model in `question_model.py`, which will be invoked by `question_agent.py`. In the provided skeleton, we have used the base Qwen3-4B model for Q-Agent but you should experiment with other models and techniques. Check out our [Synthetic Data Generation and Unsloth Tutorial](./tutorial.ipynb) for training tips and tricks.\n",
    "\n",
    "Generated questions must pertain to the topics mentioned in `topics.json` file. Additional topics will be added for the tournament finals.\n",
    "\n",
    "__Topics:__\n",
    "1.  `Logical Reasoning`: Syllogisms\n",
    "2.  `Puzzles`: Seating Arrangements (Linear, Circular)\n",
    "3.  `Blood Relations and Family Tree`: Puzzles involving generations and family tree logic\n",
    "4.  `Alphanumeric Series`: Mixed series questions\n",
    "\n",
    "Sample questions and answers are available in the [assets folder](./assets)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4f181848",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading Q-Agent model...\n",
      "`torch_dtype` is deprecated! Use `dtype` instead!\n",
      "Loading checkpoint shards: 100%|██████████████████| 3/3 [00:02<00:00,  1.27it/s]\n",
      "Device set to use cuda:0\n",
      "Q-Agent Loaded!\n",
      "STEPS: 100%|██████████████████████████████████████| 4/4 [01:21<00:00, 20.30s/it]\n",
      "Generated 20 questions!\n",
      "{  \n",
      "  \"topic\": \"Syllogisms\",\n",
      "  \"question\": \"Statement I: All A are B\\nStatement II: Some B are C\\nStatement III: All C are D\\nConclusion I: Some D are A\\nConclusion II: All B are D\",\n",
      "  \"choices\": [\"A) A) If only conclusion I follow\", \"B) B) If only conclusion II follows\", \"C) C) If conclusion I and II both follow\", \"D) D) If neither conclusion I nor conclusion II follows\"],\n",
      "  \"answer\": \"D\",\n",
      "  \"explanation\": \"Conclusion I requires a link between D and A, which is not established. Conclusion II assumes all B are D, but only some B are C, and C ⊆ D, so B and D may not fully overlap.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Seating Arrangements (Linear, Circular)\",  \n",
      "  \"question\": \"In a circular table with 8 seats, 4 people are seated such that each person is seated between two others. If A is directly opposite B, C is seated two seats to the left of D, and E is seated immediately to the right of F, who is seated three seats to the right of G, and G is seated directly opposite H, which person is seated two seats to the left of the person opposite to the person two seats to the right of E?\",  \n",
      "  \"choices\": [\"A) A\", \"B) C\", \"C) G\", \"D) H\"],  \n",
      "  \"answer\": \"C\",  \n",
      "  \"explanation\": \"The circular arrangement places A opposite B, C two left of D, E right of F, G opposite H, and F three right of G. Following the sequence, E's right is F, opposite is G, two left of G is C. Thus, C is correct.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Syllogisms\",\n",
      "  \"question\": \"Statement I: All A are B\\nStatement II: Some B are C\\nStatement III: No C is D\\nConclusion I: Some A are not D\\nConclusion II: All B are not D\",\n",
      "  \"choices\": [\"A) A) If only conclusion I follow\", \"B) B) If only conclusion II follows\", \"C) C) If conclusion I and II both follow\", \"D) D) If neither conclusion I nor conclusion II follows\"],\n",
      "  \"answer\": \"B\",\n",
      "  \"explanation\": \"Conclusion II is valid as B and C overlap, and C and D are disjoint. Conclusion I is not guaranteed since A and D may not intersect.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Mixed Series (Alphanumeric)\",\n",
      "  \"question\": \"Identify the correct continuation of the sequence: A3E, B6I, C9M, D12O, E15R, _____\",\n",
      "  \"choices\": [\"A) A) F18T\", \"B) B) G17S\", \"C) C) F18U\", \"D) D) G18S\"],\n",
      "  \"answer\": \"C\",\n",
      "  \"explanation\": \"First letters: A, B, C, D, E → F. Numbers: 3, 6, 9, 12, 15 → 18. Second letters: E, I, M, O, R → U (alphabetical order). Only C) F18U follows all rules.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Seating Arrangements (Linear, Circular)\",\n",
      "  \"question\": \"In a circular table with 8 people, each facing the center, the following conditions hold: A is seated directly opposite D, B is seated two seats to the left of E, F is seated between G and H, and I is seated immediately to the right of J. If G is seated three seats to the right of A, and J is seated two seats to the left of F, who is seated two seats to the left of the person directly opposite B?\",\n",
      "  \"choices\": [\"A) H\", \"B) G\", \"C) I\", \"D) J\"],\n",
      "  \"answer\": \"A\",\n",
      "  \"explanation\": \"Arrangement: A-G-H-F-J-I-B-E-D. B is at position 2. Opposite B is E (position 5). Two left of E is H (position 3).\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Syllogisms\",\n",
      "  \"question\": \"Statement I: All A are B\\nStatement II: Some B are C\\nStatement III: No C is D\\nConclusion I: Some A are not D\\nConclusion II: All B are not D\",\n",
      "  \"choices\": [\"A) A) If only conclusion I follow\", \"B) B) If only conclusion II follows\", \"C) C) If conclusion I and II both follow\", \"D) D) If neither conclusion I nor conclusion II follows\"],\n",
      "  \"answer\": \"A\",\n",
      "  \"explanation\": \"Conclusion II is invalid as'some B are C' and 'no C is D' don't directly link B to D. Conclusion I holds via transitive inference: A ⊆ B, B ∩ C ≠ ∅, C ∩ D = ∅ ⇒ A ∩ D = ∅.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Mixed Series (Alphanumeric)\",  \n",
      "  \"question\": \"Which of the following is NOT a valid continuation of the series: T3Z, U4B, V5C, W6D, X7E, _____?\",  \n",
      "  \"choices\": [\"A) A) Y8F\", \"B) B) Y9F\", \"C) C) Z8E\", \"D) D) Y8G\"],  \n",
      "  \"answer\": \"D\",  \n",
      "  \"explanation\": \"The series alternates between two patterns: letters advance by +1 (T→U→V→W→X→Y), numbers increase by +1 (3→4→5→6→7→8), and the third letter cycles through Z→B→C→D→E→F. D is incorrect as it breaks the letter cycle and number sequence.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Syllogisms\",\n",
      "  \"question\": \"Statement I: All A are B.\\nStatement II: Some B are C.\\nStatement III: All C are D.\\nConclusion I: Some D are A.\\nConclusion II: All B are D.\",\n",
      "  \"choices\": [\"A) A) If only conclusion I follow\", \"B) B) If only conclusion II follows\", \"C) C) If conclusion I and II both follow\", \"D) D) If neither conclusion I nor conclusion II follows\"],\n",
      "  \"answer\": \"D\",\n",
      "  \"explanation\": \"Conclusion I requires D to overlap A, which isn't guaranteed. Conclusion II assumes B ⊆ D, but B only overlaps C, which is a subset of D. No definite overlap exists between B and D.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Seating Arrangements (Linear, Circular)\",\n",
      "  \"question\": \"In a circular table with 8 people, A, B, C, D, E, F, G, and H, each seated at equal distances. A is directly opposite B. C is seated two seats to the left of D. E is seated three seats to the right of F, and G is seated directly to the left of H. If the table is rotated such that A moves to the position originally occupied by D, which person is now directly opposite A?\",\n",
      "  \"choices\": [\"A) B\", \"B) C\", \"C) E\", \"D) G\"],\n",
      "  \"answer\": \"D\",\n",
      "  \"explanation\": \"Original positions: A opposite B, D two seats left of C, F three left of E, G left of H. After rotation, A moves to D's spot. D's original position was two seats left of C, so new A's position is two seats left of C. Opposite A is now G, as G was originally left of H, and the table rotation preserves relative positions.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Mixed Series (Alphanumeric)\",  \n",
      "  \"question\": \"Which of the following is NOT a valid continuation of the series: A3B, C5D, E7F, G9H, _____?\",  \n",
      "  \"choices\": [\"A) A) I11J\", \"B) B) K13L\", \"C) C) J11I\", \"D) D) H11G\"],  \n",
      "  \"answer\": \"C\",  \n",
      "  \"explanation\": \"The pattern alternates between increasing letters (A, C, E, G, I, K) and odd numbers (3, 5, 7, 9, 11, 13). The letters progress alphabetically, and the numbers increase by 2. Option C reverses the letter order and skips a step, making it invalid.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Seating Arrangements (Linear, Circular)\",\n",
      "  \"question\": \"In a circular table with 8 people, A, B, C, D, E, F, G, and H, each seated at equal intervals. A is seated directly opposite F. B is seated two seats to the left of D, and E is seated three seats to the right of C. If G is seated two seats to the right of H, and H is seated directly opposite E, who is seated directly opposite to the person who is two seats to the left of G?\",\n",
      "  \"choices\": [\"A) A) B\", \"B) B) C\", \"C) C) D\", \"D) D) E\"],\n",
      "  \"answer\": \"D\",\n",
      "  \"explanation\": \"From given conditions, the circular arrangement is: A, H, G, F, E, D, C, B. Opposite G (position 3) is C. Two seats left of G is F. Opposite F is A. But the question asks for opposite of person two left of G, which is F's opposite, A. Wait, but the correct answer is D. Correction: The correct arrangement leads to the person two left of G being E, opposite of E is C. But the correct answer is D. Wait, this requires rechecking. The correct answer is D due to the specific seat positions derived from the constraints.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Family tree logic\",\n",
      "  \"question\": \"Ravi claims, 'The woman in the photo is my mother’s only daughter-in-law’s daughter, but not my sister.’ How is the woman related to Ravi?\",\n",
      "  \"choices\": [\"A) A) Cousin\", \"B) B) Niece\", \"C) C) Sister\", \"D) D) Aunt\"],\n",
      "  \"answer\": \"D\",\n",
      "  \"explanation\": \"Ravi's mother's daughter-in-law is his wife's mother. The daughter of this woman is Ravi's sister-in-law. Since the woman is not Ravi's sister, she must be his aunt.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Syllogisms\",\n",
      "  \"question\": \"Statement I: All A are B.\\nStatement II: Some B are C.\\nStatement III: No C is D.\\nConclusion I: Some A are C.\\nConclusion II: All A are not D.\",\n",
      "  \"choices\": [\"A) A) If only conclusion I follow\", \"B) B) If only conclusion II follows\", \"C) C) If conclusion I and II both follow\", \"D) D) If neither conclusion I nor conclusion II follows\"],\n",
      "  \"answer\": \"C\",\n",
      "  \"explanation\": \"Statement I (All A are B) and II (Some B are C) allow for some A being C. Statement III (No C is D) ensures A cannot be D, so II holds. Both conclusions are valid.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Syllogisms\",\n",
      "  \"question\": \"Statement I: All mammals are warm-blooded\\nStatement II: Some warm-blooded animals are birds\\nStatement III: All birds are reptiles\\nConclusion I: Some reptiles are mammals\\nConclusion II: Some mammals are birds\",\n",
      "  \"choices\": [\"A) A) If only conclusion I follow\", \"B) B) If only conclusion II follows\", \"C) C) If conclusion I and II both follow\", \"D) D) If neither conclusion I nor conclusion II follows\"],\n",
      "  \"answer\": \"D\",\n",
      "  \"explanation\": \"Statement III contradicts Statement II (birds aren't reptiles). No direct link between reptiles and mammals. Conclusion II is invalid as birds aren't mammals. No valid overlap between reptiles and mammals.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Syllogisms\",  \n",
      "  \"question\": \"Statement I: All A are B. Statement II: Some B are C. Statement III: No C is D. Conclusions I: Some A are not D. Conclusion II: All B are not D.\",  \n",
      "  \"choices\": [\"A) A) If only conclusion I follow\", \"B) B) If only conclusion II follows\", \"C) C) If conclusion I and II both follow\", \"D) D) If neither conclusion I nor conclusion II follows\"],  \n",
      "  \"answer\": \"D\",  \n",
      "  \"explanation\": \"Conclusion I: A ⊆ B, B ∩ C ≠ ∅, C ⊆ ¬D. A may overlap with C, but no direct link to D. Conclusion II: B ∩ C ≠ ∅, but C ⊆ ¬D. B could overlap with D. Neither conclusion is definitively supported.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Family tree logic\",\n",
      "  \"question\": \"If A is the brother of B, B is the sister of C, and C is the daughter of D, but D is not the parent of A, and E is the parent of D, then how is E related to A?\",\n",
      "  \"choices\": [\"A) A) Father\", \"B) B) Uncle\", \"C) C) Cousin\", \"D) D) Brother-in-law\"],\n",
      "  \"answer\": \"B\",\n",
      "  \"explanation\": \"D is the parent of C and E is D's parent. A and C are siblings (B is their sibling). Since D is not A's parent, E is A's grandparent's parent, making E A's uncle.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Mixed Series (Alphanumeric)\",  \n",
      "  \"question\": \"Which of the following is NOT a valid continuation of the series: R3T, S5U, T7V, U9W, _____?\",  \n",
      "  \"choices\": [\"A) A) V11X\", \"B) B) W10Y\", \"C) C) X12Z\", \"D) D) V11Y\"],  \n",
      "  \"answer\": \"D\",  \n",
      "  \"explanation\": \"The series follows a pattern where the first letter increments by 1 (R→S→T→U→V), the number increases by 2 (3→5→7→9→11), and the third letter increments by 1 (T→U→V→W→X). Option D is invalid as it uses V11Y, which breaks the third letter pattern (should be X).\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Mixed Series (Alphanumeric)\",\n",
      "  \"question\": \"Which of the following is NOT a valid continuation of the sequence: ZYX_7, WVU_5, TST_3, RQP_1, _____?\",\n",
      "  \"choices\": [\"A) A) PON_2\", \"B) B) ONO_0\", \"C) C) NOP_2\", \"D) D) NMP_1\"],\n",
      "  \"answer\": \"B\",\n",
      "  \"explanation\": \"The first letters decrease by 3 each time (Z→W→T→R→P). The second letters are the reverse of the first (Y→V→S→Q→O). The numbers decrease by 2 each time (7→5→3→1→...). The third letters are the same as the second. Only B has an incorrect number (0 instead of 2).\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Seating Arrangements (Linear, Circular)\",\n",
      "  \"question\": \"In a circular table with 8 people, each facing the center, A is seated two seats to the left of B, and C is seated directly opposite D. E is seated between F and G, with F two seats to the right of E. H is seated two seats to the left of A. If G is seated directly opposite B, and F is seated two seats to the left of H, who is seated directly opposite E?\",\n",
      "  \"choices\": [\"A) A) B\", \"B) B) D\", \"C) C) H\", \"D) D) C\"],\n",
      "  \"answer\": \"C\",\n",
      "  \"explanation\": \"Arrangement: A-H-F-E-G-C-D-B. Opposite E is H. C is opposite D, so H is opposite E. D is opposite C, so H is not opposite D. A is opposite B, so H is not opposite B. F is opposite G, so H is not opposite F.\"\n",
      "}\n",
      "{  \n",
      "  \"topic\": \"Syllogisms\",\n",
      "  \"question\": \"Statements: I. All A are B. II. Some B are C. III. No C is D. Conclusions: I. Some A are C. II. No A are D. III. Some B are not D. Which conclusion(s) must be true?\",\n",
      "  \"choices\": [\"A) A) Only I and II\", \"B) B) Only II and III\", \"C) C) Only I and III\", \"D) D) Only III\"],\n",
      "  \"answer\": \"B\",\n",
      "  \"explanation\": \"II is true via All A are B and No C is D. III is true via Some B are C and No C is D. I is not necessarily true as A and C may be disjoint.\"\n",
      "}\n",
      "\n",
      "==================================================\n",
      "\n",
      "\n",
      "Time taken per batch generation: [28.97447919845581, 17.809133291244507, 17.675289154052734, 16.74118399620056]\n",
      "Tokens generated per batch: [549, 541, 604, 524]\n",
      "Total Time Taken: 81.200 seconds; Total Tokens: 2218; TGPS: 27.315 seconds\n",
      "\n",
      "\n",
      "\n",
      "++++++++++++++++++++++++++++++++++++++++++++++++++\n",
      "\n",
      "Saved to outputs/questions.json!\n"
     ]
    }
   ],
   "source": [
    "# Run the following code to generate questions.\n",
    "# For demo purpose, we have used the base Qwen3-4B model for Q-Agent. Participants are expected to improve upon this\n",
    "!python -m agents.question_agent \\\n",
    "    --output_file \"outputs/questions.json\" \\\n",
    "    --num_questions 20 \\\n",
    "    --verbose"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e511c33",
   "metadata": {},
   "source": [
    "#### Basic format-checks for questions from Q-agent\n",
    "\n",
    "Generated questions must follow the [format instructions](#format-overview). All questions generated from the Q-agent will be filtered and validated before being sent to the opponent's A-agent. We generate two version of questions, one is the raw, unfiltered one `questions.json` and the other is `filtered_questions.json` after passing through the below example filter. The full filtering and validation process is part of the judging system and is not demonstrated here.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3770ec5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen3-4B\", padding_side='left')\n",
    "\n",
    "def count_tokens_q(text: str) -> int:\n",
    "    \"\"\"Count the number of tokens using Qwen3-4B tokenizer\"\"\"\n",
    "    return len(tokenizer.encode(text, add_special_tokens=False))\n",
    "\n",
    "def filter_questions(questions: List[str|Dict[str, str|Any]]) -> List[Dict[str, str|Any]]:\n",
    "    def basic_checks(q2: Dict[str, str])->bool:\n",
    "        # check required keys\n",
    "        required_keys = ['topic', 'question', 'choices', 'answer']\n",
    "        if all((key in q2) for key in required_keys):\n",
    "            # check choices format\n",
    "            checks = all(isinstance(choice, str) and len(choice) > 2 and choice[0].upper() in 'ABCD' for choice in q2['choices'])\n",
    "            if isinstance(q2['choices'], list) and len(q2['choices']) == 4 and checks:\n",
    "                # check answer format\n",
    "                # Check token length\n",
    "                check_len = sum(count_tokens_q(q2[k]) for k in ['question', 'answer'])\n",
    "                check_len += sum(count_tokens_q(choice) for choice in q2['choices']) - 15\n",
    "                if check_len < 130:\n",
    "                    if check_len + count_tokens_q(q2.get('explanation', 'None')) <= 1024:\n",
    "                        # Extra Checks: (PLUS checks) len(q2['answer']) == 1 and q2['answer'].upper() in 'ABCD':\n",
    "                        if isinstance(q2['answer'], str):\n",
    "                            return True\n",
    "        return False\n",
    "    correct_format_question = []\n",
    "    for i, q in enumerate(questions):\n",
    "        if isinstance(q, dict):\n",
    "            if basic_checks(q):\n",
    "                correct_format_question.append(q)\n",
    "        elif isinstance(q, str):\n",
    "            try:\n",
    "                q1 = json.loads(q)\n",
    "                if basic_checks(q1):\n",
    "                    correct_format_question.append(q1)\n",
    "            except json.JSONDecodeError:\n",
    "                # If JSON decoding fails, skip this answer\n",
    "                print(f\"Skipping invalid JSON at index {i}: {q}\")\n",
    "                continue\n",
    "        else:\n",
    "            continue\n",
    "    if len(correct_format_question) >= 0.5 * len(questions):\n",
    "        return correct_format_question\n",
    "    return list()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a66e521",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "with open(\"outputs/questions.json\", \"r\") as f:\n",
    "    questions = json.load(f)\n",
    "\n",
    "filtered_questions = filter_questions(questions)\n",
    "\n",
    "with open(\"outputs/filtered_questions.json\", \"w\") as f:\n",
    "    json.dump(filtered_questions, f, indent=4)\n",
    "\n",
    "print(f\"Number of questions: {len(questions)}\")\n",
    "print(f\"Number of filtered questions: {len(filtered_questions)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f96209d",
   "metadata": {},
   "source": [
    "### A-agent\n",
    "<!-- 🤖  -->\n",
    "You will update the model in `answer_model.py`, which will be invoked by `answer_agent.py`. In the provided skeleton, we have again used the base Qwen3-4B model for A-Agent but you should experiment with other models and techniques. Check out our [Synthetic Data Generation and Unsloth Tutorial](./tutorial.ipynb) for training tips and tricks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "d51af0a4",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading A-Agent model...\n",
      "`torch_dtype` is deprecated! Use `dtype` instead!\n",
      "Loading checkpoint shards: 100%|██████████████████| 3/3 [00:04<00:00,  1.48s/it]\n",
      "Device set to use cuda:0\n",
      "Warming up GPU...\n",
      "A-Agent Loaded & Warmed Up!\n",
      "STEPS: 100%|███████████████████████████████████| 4/4 [00:31<00:00,  7.99s/batch]\n",
      "\n",
      "=== Question 1 ===\n",
      "Question: Statement I: All A are B\n",
      "Statement II: Some B are C\n",
      "Statement III: All C are D\n",
      "Conclusion I: Some D are A\n",
      "Conclusion II: All B are D\n",
      "Expected: D\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"D\",\n",
      "  \"reasoning\": \"None of the conclusions logically follow from the premises. Conclusion I assumes an inverse relationship not supported by 'All A are B' and 'Some B are C'. Conclusion II incorrectly generalizes 'Some B are C' to 'All B are D', which isn't justified by the given statements.\"\n",
      "}\n",
      "\n",
      "=== Question 2 ===\n",
      "Question: In a circular table with 8 seats, 4 people are seated such that each person is seated between two others. If A is directly opposite B, C is seated two seats to the left of D, and E is seated immediately to the right of F, who is seated three seats to the right of G, and G is seated directly opposite H, which person is seated two seats to the left of the person opposite to the person two seats to the right of E?\n",
      "Expected: C\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"D\",\n",
      "  \"reasoning\": \"By mapping positions around the circle, we find that starting from E's position, moving two seats right leads to H, whose opposite is G. Two seats left of G is D.\"\n",
      "}\n",
      "\n",
      "=== Question 3 ===\n",
      "Question: Statement I: All A are B\n",
      "Statement II: Some B are C\n",
      "Statement III: No C is D\n",
      "Conclusion I: Some A are not D\n",
      "Conclusion II: All B are not D\n",
      "Expected: B\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"A\",\n",
      "  \"reasoning\": \"From 'All A are B' and 'Some B are C', we can't directly infer anything about A and D. However, since 'No C is D', combining it with 'Some B are C' implies some B are not D. Since all A are B, those same A's would also be non-D, making Conclusion I valid. Conclusion II claims all B are not D, which isn't necessarily true because some B could overlap with D if they're not in the C group.\"\n",
      "}\n",
      "\n",
      "=== Question 4 ===\n",
      "Question: Identify the correct continuation of the sequence: A3E, B6I, C9M, D12O, E15R, _____\n",
      "Expected: C\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"C\",\n",
      "  \"reasoning\": \"The pattern alternates between letters in the alphabet (A→B→C→D→E) and numbers increasing by 3 (3, 6, 9, 12, 15). The next letter after E is F, and the number increases to 18. The last letter in the sequence is R; the next should be U (since S would break the alternating pattern). Thus, 'F18U' fits.\"\n",
      "}\n",
      "\n",
      "=== Question 5 ===\n",
      "Question: In a circular table with 8 people, each facing the center, the following conditions hold: A is seated directly opposite D, B is seated two seats to the left of E, F is seated between G and H, and I is seated immediately to the right of J. If G is seated three seats to the right of A, and J is seated two seats to the left of F, who is seated two seats to the left of the person directly opposite B?\n",
      "Expected: A\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"D\",\n",
      "  \"reasoning\": \"By placing A at position 1, D is opposite at 5. G is three seats right of A (position 4). F is between G (4) and H, so F=6, H=7. J is two left of F (J=4), but that conflicts with G's placement. Adjusting positions shows J must be at 2, making I at 3. B is two left of E; if B=6, then E=2 (conflict). Correct arrangement places B at 7, E at 1. The person opposite B (opposite of 7 is 3) is I.\"\n",
      "}\n",
      "\n",
      "=== Question 6 ===\n",
      "Question: Statement I: All A are B\n",
      "Statement II: Some B are C\n",
      "Statement III: No C is D\n",
      "Conclusion I: Some A are not D\n",
      "Conclusion II: All B are not D\n",
      "Expected: A\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"A\",\n",
      "  \"reasoning\": \"From 'All A are B' and 'Some B are C', we can't directly infer anything about A and D. However, since 'No C is D', combining it with 'Some B are C' implies some B are not D. Since all A are B, those same A's would also be non-D, making Conclusion I valid. Conclusion II claims all B are not D, which isn't necessarily true because some B could overlap with D if they're not in the C group.\"\n",
      "}\n",
      "\n",
      "=== Question 7 ===\n",
      "Question: Which of the following is NOT a valid continuation of the series: T3Z, U4B, V5C, W6D, X7E, _____?\n",
      "Expected: D\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"B\",\n",
      "  \"reasoning\": \"The pattern alternates between increasing letters by one (T→U→V→W→X→Y) and incrementing numbers (3→4→5→6→7→8), while the third character cycles through Z,B,C,D,E,F... The inconsistency lies in option B's number (9 instead of 8).\"\n",
      "}\n",
      "\n",
      "=== Question 8 ===\n",
      "Question: Statement I: All A are B.\n",
      "Statement II: Some B are C.\n",
      "Statement III: All C are D.\n",
      "Conclusion I: Some D are A.\n",
      "Conclusion II: All B are D.\n",
      "Expected: D\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"D\",\n",
      "  \"reasoning\": \"None of the conclusions logically follow from the premises. Conclusion I assumes an inverse relationship not supported by 'All A are B' and 'Some B are C'. Conclusion II incorrectly generalizes 'Some B are C' to 'All B are D', which isn't justified by the given statements.\"\n",
      "}\n",
      "\n",
      "=== Question 9 ===\n",
      "Question: In a circular table with 8 people, A, B, C, D, E, F, G, and H, each seated at equal distances. A is directly opposite B. C is seated two seats to the left of D. E is seated three seats to the right of F, and G is seated directly to the left of H. If the table is rotated such that A moves to the position originally occupied by D, which person is now directly opposite A?\n",
      "Expected: D\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"D\",\n",
      "  \"reasoning\": \"After rotating so A takes D's original spot, the new positions shift everyone clockwise by four places. Originally, A was opposite B; after rotation, D becomes adjacent to A. Since E is three seats right of F, and G is left of H, the person opposite A in the new arrangement is D.\"\n",
      "}\n",
      "\n",
      "=== Question 10 ===\n",
      "Question: Which of the following is NOT a valid continuation of the series: A3B, C5D, E7F, G9H, _____?\n",
      "Expected: C\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"C\",\n",
      "  \"reasoning\": \"The pattern alternates between increasing letters (A→C→E→G→I→K→...) and odd numbers (3,5,7,9,...). The third element in each pair should follow the sequence. Option C has 'J' which skips 'I', breaking the alphabetical order.\"\n",
      "}\n",
      "\n",
      "=== Question 11 ===\n",
      "Question: In a circular table with 8 people, A, B, C, D, E, F, G, and H, each seated at equal intervals. A is seated directly opposite F. B is seated two seats to the left of D, and E is seated three seats to the right of C. If G is seated two seats to the right of H, and H is seated directly opposite E, who is seated directly opposite to the person who is two seats to the left of G?\n",
      "Expected: D\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"D\",\n",
      "  \"reasoning\": \"Given that H is opposite E, and G is two seats right of H, then G is two seats right of H, making E opposite G. Since E is three seats right of C, C must be three seats left of E. With A opposite F, placing others accordingly shows D is opposite the person two seats left of G.\"\n",
      "}\n",
      "\n",
      "=== Question 12 ===\n",
      "Question: Ravi claims, 'The woman in the photo is my mother’s only daughter-in-law’s daughter, but not my sister.’ How is the woman related to Ravi?\n",
      "Expected: D\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"B\",\n",
      "  \"reasoning\": \"Ravi's mother's only daughter-in-law is his wife. The woman is her daughter, making her Ravi's sister-in-law. However, since she isn't Ravi's sister, she must be his niece.\"\n",
      "}\n",
      "\n",
      "=== Question 13 ===\n",
      "Question: Statement I: All A are B.\n",
      "Statement II: Some B are C.\n",
      "Statement III: No C is D.\n",
      "Conclusion I: Some A are C.\n",
      "Conclusion II: All A are not D.\n",
      "Expected: C\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"D\",\n",
      "  \"reasoning\": \"None of the conclusions logically follow from the premises. Conclusion I assumes an overlap between A and C that isn't guaranteed by 'All A are B' and 'Some B are C'. Conclusion II incorrectly infers 'not D' for all A based on 'No C is D', which doesn't directly relate to A's relationship with D.\"\n",
      "}\n",
      "\n",
      "=== Question 14 ===\n",
      "Question: Statement I: All mammals are warm-blooded\n",
      "Statement II: Some warm-blooded animals are birds\n",
      "Statement III: All birds are reptiles\n",
      "Conclusion I: Some reptiles are mammals\n",
      "Conclusion II: Some mammals are birds\n",
      "Expected: D\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"D\",\n",
      "  \"reasoning\": \"Statement III claims all birds are reptiles, which contradicts biological facts but must be taken as given. From the premises, no direct link exists between reptiles and mammals. Conclusion I assumes some reptiles are mammals, which isn't supported by any premise. Similarly, Conclusion II suggests some mammals are birds, which also lacks logical support based on the provided statements.\"\n",
      "}\n",
      "\n",
      "=== Question 15 ===\n",
      "Question: Statement I: All A are B. Statement II: Some B are C. Statement III: No C is D. Conclusions I: Some A are not D. Conclusion II: All B are not D.\n",
      "Expected: D\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"A\",\n",
      "  \"reasoning\": \"Conclusion I follows because combining 'All A are B' and 'No C is D' allows inferring some A are not D via syllogistic logic. However, Conclusion II does not logically follow since 'Some B are C' and 'No C is D' do not necessarily imply all B are not D.\"\n",
      "}\n",
      "\n",
      "=== Question 16 ===\n",
      "Question: If A is the brother of B, B is the sister of C, and C is the daughter of D, but D is not the parent of A, and E is the parent of D, then how is E related to A?\n",
      "Expected: B\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"B\",\n",
      "  \"reasoning\": \"E is the parent of D, who is the mother of C. Since A's father is not D, E must be the uncle of A through their shared mother.\"\n",
      "}\n",
      "\n",
      "=== Question 17 ===\n",
      "Question: Which of the following is NOT a valid continuation of the series: R3T, S5U, T7V, U9W, _____?\n",
      "Expected: D\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"D\",\n",
      "  \"reasoning\": \"The pattern alternates between increasing letters by one position in the alphabet and adding an odd number sequence. The next term should follow 'U9W' with 'V11X', but 'D) V11Y' has Y instead of X, breaking the consistent alphabetical progression.\"\n",
      "}\n",
      "\n",
      "=== Question 18 ===\n",
      "Question: Which of the following is NOT a valid continuation of the sequence: ZYX_7, WVU_5, TST_3, RQP_1, _____?\n",
      "Expected: B\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"B\",\n",
      "  \"reasoning\": \"The pattern alternates between descending letters and decreasing odd numbers. The first three letters in each term decrease by two (ZYX → WVU → TST → RQP), while the number decreases by 2 each time (7→5→3→1). The next should be 'NOP' with 2 (C), but B's 'ONO_0' breaks the consistent decremental pattern.\"\n",
      "}\n",
      "\n",
      "=== Question 19 ===\n",
      "Question: In a circular table with 8 people, each facing the center, A is seated two seats to the left of B, and C is seated directly opposite D. E is seated between F and G, with F two seats to the right of E. H is seated two seats to the left of A. If G is seated directly opposite B, and F is seated two seats to the left of H, who is seated directly opposite E?\n",
      "Expected: C\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"D\",\n",
      "  \"reasoning\": \"By mapping positions based on given clues, we find that E's direct opposite is D.\"\n",
      "}\n",
      "\n",
      "=== Question 20 ===\n",
      "Question: Statements: I. All A are B. II. Some B are C. III. No C is D. Conclusions: I. Some A are C. II. No A are D. III. Some B are not D. Which conclusion(s) must be true?\n",
      "Expected: B\n",
      "Model Answer:\n",
      "{  \n",
      "  \"answer\": \"B\",\n",
      "  \"reasoning\": \"Only conclusions II and III logically follow. Conclusion II follows because no C is D and some B are C implies some B are not D. Conclusion I does not necessarily hold since all A are B but there's no direct link to C.\"\n",
      "}\n",
      "BATCH - 0\n",
      "Tokens: 324, Time: 9.851 seconds\n",
      "TGPS: 32.889 seconds\n",
      "BATCH - 1\n",
      "Tokens: 268, Time: 8.644 seconds\n",
      "TGPS: 31.003 seconds\n",
      "BATCH - 2\n",
      "Tokens: 276, Time: 6.973 seconds\n",
      "TGPS: 39.581 seconds\n",
      "BATCH - 3\n",
      "Tokens: 204, Time: 6.477 seconds\n",
      "TGPS: 31.495 seconds\n",
      "\n",
      "==================================================\n",
      "Total Time: 31.946 seconds; Total Tokens: 1072; TGPS: 33.557 seconds\n"
     ]
    }
   ],
   "source": [
    "# Same instructions apply for the answer agent.\n",
    "# For demo purpose, we have used the base Qwen3-4B model for A-agent. Participants are expected to improve upon this.\n",
    "!python -m agents.answer_agent \\\n",
    "    --input_file \"outputs/filtered_questions.json\" \\\n",
    "    --output_file \"outputs/answers.json\" \\\n",
    "    --verbose"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3891529b",
   "metadata": {},
   "source": [
    "#### Basic format-checks for answers from A-agent\n",
    "Generated answers must follow the [format instructions](#format-overview). The following filter is added into the `answer_agent.py`. Similar to before, two versions are saved, `answers.json` and `filtered_answers.json`. The latter is used for evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3acad45e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen3-4B\", padding_side='left')\n",
    "\n",
    "def count_tokens_a(text: str) -> int:\n",
    "    \"\"\"Count the number of tokens in the text using the agent's tokenizer\"\"\"\n",
    "    return len(tokenizer.encode(text, add_special_tokens=False))\n",
    "\n",
    "def filter_answers(ans: List[str|Dict[str, str]]) -> List[Dict[str, str]]:\n",
    "    r\"\"\"Filter answers to ensure they are in the correct format\"\"\"\n",
    "    def basic_checks(a1: Dict[str, str])->bool:\n",
    "        # check required keys\n",
    "        required_keys = ['answer']\n",
    "        if all((key in a1) and isinstance(a1[key], str) for key in required_keys):\n",
    "            if len(a1['answer']) == 1 and (a1['answer'] not in 'ABCDabcd'):\n",
    "                    return False\n",
    "            check_len = count_tokens_a(a1['answer'])\n",
    "            if check_len < 50:\n",
    "                check_len += count_tokens_a(a1.get('reasoning', 'None'))\n",
    "                if check_len < 512:\n",
    "                    # check answer format - EXTRA checks\n",
    "                    # if len(a1['answer']) == 1 and a1['answer'].upper() in 'ABCD':\n",
    "                    return True\n",
    "        return False\n",
    "\n",
    "    filtered_answers = []\n",
    "    for i, a in enumerate(ans):\n",
    "        if isinstance(a, dict):\n",
    "            if basic_checks(a):\n",
    "                filtered_answers.append(a)\n",
    "            else:\n",
    "                filtered_answers.append(None)\n",
    "        elif isinstance(a, str):\n",
    "            # Basic checks: at least with correct JSON format\n",
    "            try:\n",
    "                a1 = json.loads(a)\n",
    "                if basic_checks(a1):\n",
    "                    filtered_answers.append(a1)\n",
    "                else:\n",
    "                    filtered_answers.append(None)\n",
    "            except json.JSONDecodeError:\n",
    "                # If JSON decoding fails, skip this answer\n",
    "                print(f\"Skipping invalid JSON at index {i}: {a}\")\n",
    "                filtered_answers.append(None)\n",
    "                continue\n",
    "        else:\n",
    "            # If the answer is neither a dict nor a str, skip it\n",
    "            print(f\"Skipping unsupported type at index {i}: {type(a)}\")\n",
    "            filtered_answers.append(None)\n",
    "    return filtered_answers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49a4301d",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "with open(\"outputs/answers.json\", \"r\") as f:\n",
    "    answers = json.load(f)\n",
    "filtered_answers = filter_answers(answers)\n",
    "\n",
    "\n",
    "print(f\"Number of answers: {len(answers)}\")\n",
    "print(f\"Number of filtered answers: {len(filtered_answers)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79a6a911",
   "metadata": {},
   "source": [
    "## Evaluation\n",
    "<!-- 🏅  -->"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2aa2284",
   "metadata": {},
   "source": [
    "### Scoring Criteria\n",
    "\n",
    "<!-- 📊  -->\n",
    "\n",
    "Scores are assigned based on: out of $N$ questions from Q-agent, how many an A-agent can answer and vice-versa. *No negative marking for wrong answers.*\n",
    "\n",
    "$$\\text{A-agent Score} = \\dfrac{\\#\\ \\text{of questions correctly answered with expected format}}{N}\\times 100$$\n",
    "$$\\text{Q-agent Score} = \\dfrac{\\#\\ \\text{of questions incorrectly answered by A-agent}}{N}\\times 100$$\n",
    "\n",
    "\n",
    "$N$ denotes the number of filtered / format-correct questions. **Teams whose Q-agent fails to generate at least $50\\%$ of `num_questions` (where `num_questions` ranges from $2$ to $1000+$) of the questions correctly (as per [format-checking](#format-overview)) will be automatically disqualified.**<br>\n",
    "\n",
    "In case of **TIE**, closed benchmark questions will be used to evaluate the answer agents (A-agent) and rank the teams accordingly.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba1f7ccf",
   "metadata": {},
   "source": [
    "### Scoring Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "57c11592",
   "metadata": {},
   "outputs": [],
   "source": [
    "# calculate scores...\n",
    "N = len(filtered_questions)\n",
    "assert N == len(filtered_answers), \"Number of questions and answers must match.\"\n",
    "num_correct_answers = len([1 for q,a in zip(filtered_questions, filtered_answers) if a is not None and q['answer'] == a['answer']])\n",
    "\n",
    "# Here the answer may be correct, but since q['answer'] is not an option letter is not there, we face problems\n",
    "# Below shown is one way of simple string parsing\n",
    "num_correct_answers = len([1 for q,a in zip(filtered_questions, filtered_answers) if a is not None and q['answer'][0] == a['answer']])\n",
    "\n",
    "a_score = num_correct_answers*100/(N+1e-9)\n",
    "q_score = (N-num_correct_answers)*100/(N+1e-9)\n",
    "# Announce the scores\n",
    "print(f\"Number of questions: {N}\")\n",
    "print(f\"Number of correct answers: {num_correct_answers}\")\n",
    "print(\"Scores:\")\n",
    "print(f\"Team B: A-agent score: {a_score:.2f}\")\n",
    "print(f\"Team A: Q-agent score: {q_score:.2f}\")\n",
    "print(f\"Innings 1 winner: {'Team A' if q_score > a_score else 'Team B' if q_score < a_score else 'Draw'}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors