webarena/scripts/webarena-zeno.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploring WebArena Results with Zeno \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Zeno](https://zenoml.com/) provides interative interface to explore the results of your agents in WebArena. You can easily\n",
    "* Visualize the trajectories\n",
    "* Compare the performance of different agents\n",
    "* Interactively select and analyze trajectories with various filters such as trajectory length "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install zeno_client"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import json\n",
    "import os\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "import zeno_client"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first need to convert and combine the output `HTML` trajectories into a single `JSON` file using the `html2json` script:\n",
    "Remember to change `result_folder` to the path you saved your `render_*.html`. The results will be saved to `{{result_folder}}/json_dump.json`. For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python html2json.py --result_folder ../cache/918_text_bison_001_cot --config_json ../config_files/test.raw.json\n",
    "!python html2json.py --result_folder ../cache/919_gpt35_16k_cot --config_json ../config_files/test.raw.json\n",
    "!python html2json.py --result_folder ../cache/919_gpt35_16k_cot_na --config_json ../config_files/test.raw.json\n",
    "!python html2json.py --result_folder ../cache/919_gpt35_16k_direct --config_json ../config_files/test.raw.json\n",
    "!python html2json.py --result_folder ../cache/919_gpt35_16k_direct_na --config_json ../config_files/test.raw.json\n",
    "!python html2json.py --result_folder ../cache/919_gpt4_8k_cot --config_json ../config_files/test.raw.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next you will record the json file names in `RESULT_JSONS` and provide the model tag in `RESULT_NAMES`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "RESULT_JSONS = [\n",
    "    \"../cache/918_text_bison_001_cot/json_dump.json\", \n",
    "    \"../cache/919_gpt35_16k_cot/json_dump.json\",\n",
    "    \"../cache/919_gpt35_16k_cot_na/json_dump.json\",\n",
    "    \"../cache/919_gpt35_16k_direct/json_dump.json\",\n",
    "    \"../cache/919_gpt35_16k_direct_na/json_dump.json\",\n",
    "    \"../cache/919_gpt4_8k_cot/json_dump.json\",\n",
    "    ]\n",
    "RESULT_NAMES = [\"palm-2-cot-uahint\", \"gpt35-cot\", \"gpt35-cot-uahint\", \"gpt35-direct\", \"gpt35-direct-uahint\", \"gpt4-cot\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Obtaining Data\n",
    "\n",
    "We can use the first results file to create the base `dataset` we'll upload to Zeno with just the initial prompt intent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(RESULT_JSONS[0], \"r\") as f:\n",
    "    raw_json: dict = json.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"example_id\": list(raw_json.keys()),\n",
    "        \"site\": [\", \".join(x[\"sites\"]) for x in raw_json.values()],\n",
    "        \"eval_type\": [\", \".join(x[\"eval_types\"]) for x in raw_json.values()],\n",
    "        \"achievable\": [x[\"achievable\"] for x in raw_json.values()],\n",
    "        \"context\": [\n",
    "            json.dumps(\n",
    "                [\n",
    "                    {\n",
    "                        \"role\": \"system\",\n",
    "                        \"content\": row[\"intent\"],\n",
    "                    }\n",
    "                ]\n",
    "            )\n",
    "            for row in raw_json.values()\n",
    "        ],\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Authenticate and Create a Project\n",
    "\n",
    "We can now create a new [Zeno](https://zenoml.com) project and upload this data.\n",
    "\n",
    "Create an account and API key by signing up at [Zeno Hub](https://hub.zenoml.com) and going to your [Account page](http://hub.zenoml.com/account). Save the API key in a `.env` file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# read ZENO_API_KEY from .env file\n",
    "load_dotenv(override=True)\n",
    "\n",
    "client = zeno_client.ZenoClient(\"os.environ.get(\"ZENO_API_KEY\")\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "project = client.create_project(\n",
    "    name=\"WebArena Tester\",\n",
    "    view={\n",
    "        \"data\": {\n",
    "            \"type\": \"list\",\n",
    "            \"elements\": {\"type\": \"message\", \"content\": {\"type\": \"markdown\"}},\n",
    "            \"collapsible\": \"top\",\n",
    "        },\n",
    "        \"label\": {\"type\": \"markdown\"},\n",
    "        \"output\": {\n",
    "            \"type\": \"list\",\n",
    "            \"elements\": {\n",
    "                \"type\": \"message\",\n",
    "                \"highlight\": True,\n",
    "                \"content\": {\"type\": \"markdown\"},\n",
    "            },\n",
    "            \"collapsible\": \"top\",\n",
    "        },\n",
    "    },\n",
    "    metrics=[\n",
    "        zeno_client.ZenoMetric(name=\"success\", type=\"mean\", columns=[\"success\"]),\n",
    "        zeno_client.ZenoMetric(\n",
    "            name=\"# of go backs\", type=\"mean\", columns=[\"# of go_backs\"]\n",
    "        ),\n",
    "        zeno_client.ZenoMetric(name=\"# of steps\", type=\"mean\", columns=[\"# of steps\"]),\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "project.upload_dataset(df, id_column=\"example_id\", data_column=\"context\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Uploading Model Outputs\n",
    "\n",
    "We can now upload the full trajectory outputs for our models.\n",
    "\n",
    "If you want to display the images, you will need to upload the images to a publically accessible location and provide the URL in the `image_url` field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "image_base_url = None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_message(row):\n",
    "    return_list = []\n",
    "    for message in row[\"messages\"]:\n",
    "        role = \"user\" if \"user\" in message else \"assistant\"\n",
    "\n",
    "        if role == \"user\":\n",
    "            if image_base_url:\n",
    "                content = (\n",
    "                    \"[![image](%s/%s)](%s/%s)\\n%s\"\n",
    "                    % (\n",
    "                        image_base_url,\n",
    "                        \"/\".join(message[\"image\"].split(\"/\")[-2:]),\n",
    "                        image_base_url,\n",
    "                        \"/\".join(message[\"image\"].split(\"/\")[-2:]),\n",
    "                        message[role],\n",
    "                    )\n",
    "                )\n",
    "            else:\n",
    "                content = message[role]\n",
    "        else:\n",
    "            content = message[role]\n",
    "        return_list.append({\"role\": role, \"content\": content})\n",
    "    return return_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_system_df(result_path: str):\n",
    "    with open(result_path, \"r\") as f:\n",
    "        json_input: dict = json.load(f)\n",
    "    return pd.DataFrame(\n",
    "        {\n",
    "            \"example_id\": list(json_input.keys()),\n",
    "            \"# of clicks\": [\n",
    "                sum(\n",
    "                    [\n",
    "                        1\n",
    "                        for x in r[\"messages\"]\n",
    "                        if \"assistant\" in x and \"`click\" in x[\"assistant\"]\n",
    "                    ]\n",
    "                )\n",
    "                for r in json_input.values()\n",
    "            ],\n",
    "            \"# of types\": [\n",
    "                sum(\n",
    "                    [\n",
    "                        1\n",
    "                        for x in r[\"messages\"]\n",
    "                        if \"assistant\" in x and \"`type\" in x[\"assistant\"]\n",
    "                    ]\n",
    "                )\n",
    "                for r in json_input.values()\n",
    "            ],\n",
    "            \"# of go_backs\": [\n",
    "                sum(\n",
    "                    [\n",
    "                        1\n",
    "                        for x in r[\"messages\"]\n",
    "                        if \"assistant\" in x and \"`go_back\" in x[\"assistant\"]\n",
    "                    ]\n",
    "                )\n",
    "                for r in json_input.values()\n",
    "            ],\n",
    "            \"# of steps\": [len(r[\"messages\"]) for r in json_input.values()],\n",
    "            \"context\": [json.dumps(format_message(row)) for row in json_input.values()],\n",
    "            \"success\": [r[\"success\"] for r in json_input.values()],\n",
    "        }\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i, system in enumerate(RESULT_JSONS):\n",
    "    output_df = get_system_df(system)\n",
    "    project.upload_system(\n",
    "        output_df, name=RESULT_NAMES[i], id_column=\"example_id\", output_column=\"context\"\n",
    "    ) "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "zeno-build",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}