mirror of
https://github.com/web-arena-x/webarena.git
synced 2026-02-06 03:06:47 +00:00
338 lines
11 KiB
Plaintext
338 lines
11 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Exploring WebArena Results with Zeno \n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[Zeno](https://zenoml.com/) provides interative interface to explore the results of your agents in WebArena. You can easily\n",
|
|
"* Visualize the trajectories\n",
|
|
"* Compare the performance of different agents\n",
|
|
"* Interactively select and analyze trajectories with various filters such as trajectory length "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!pip install zeno_client"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"import json\n",
|
|
"import os\n",
|
|
"from dotenv import load_dotenv\n",
|
|
"\n",
|
|
"import zeno_client"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We first need to convert and combine the output `HTML` trajectories into a single `JSON` file using the `html2json` script:\n",
|
|
"Remember to change `result_folder` to the path you saved your `render_*.html`. The results will be saved to `{{result_folder}}/json_dump.json`. For example:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!python html2json.py --result_folder ../cache/918_text_bison_001_cot --config_json ../config_files/test.raw.json\n",
|
|
"!python html2json.py --result_folder ../cache/919_gpt35_16k_cot --config_json ../config_files/test.raw.json\n",
|
|
"!python html2json.py --result_folder ../cache/919_gpt35_16k_cot_na --config_json ../config_files/test.raw.json\n",
|
|
"!python html2json.py --result_folder ../cache/919_gpt35_16k_direct --config_json ../config_files/test.raw.json\n",
|
|
"!python html2json.py --result_folder ../cache/919_gpt35_16k_direct_na --config_json ../config_files/test.raw.json\n",
|
|
"!python html2json.py --result_folder ../cache/919_gpt4_8k_cot --config_json ../config_files/test.raw.json"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Next you will record the json file names in `RESULT_JSONS` and provide the model tag in `RESULT_NAMES`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"RESULT_JSONS = [\n",
|
|
" \"../cache/918_text_bison_001_cot/json_dump.json\", \n",
|
|
" \"../cache/919_gpt35_16k_cot/json_dump.json\",\n",
|
|
" \"../cache/919_gpt35_16k_cot_na/json_dump.json\",\n",
|
|
" \"../cache/919_gpt35_16k_direct/json_dump.json\",\n",
|
|
" \"../cache/919_gpt35_16k_direct_na/json_dump.json\",\n",
|
|
" \"../cache/919_gpt4_8k_cot/json_dump.json\",\n",
|
|
" ]\n",
|
|
"RESULT_NAMES = [\"palm-2-cot-uahint\", \"gpt35-cot\", \"gpt35-cot-uahint\", \"gpt35-direct\", \"gpt35-direct-uahint\", \"gpt4-cot\"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Obtaining Data\n",
|
|
"\n",
|
|
"We can use the first results file to create the base `dataset` we'll upload to Zeno with just the initial prompt intent."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with open(RESULT_JSONS[0], \"r\") as f:\n",
|
|
" raw_json: dict = json.load(f)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df = pd.DataFrame(\n",
|
|
" {\n",
|
|
" \"example_id\": list(raw_json.keys()),\n",
|
|
" \"site\": [\", \".join(x[\"sites\"]) for x in raw_json.values()],\n",
|
|
" \"eval_type\": [\", \".join(x[\"eval_types\"]) for x in raw_json.values()],\n",
|
|
" \"achievable\": [x[\"achievable\"] for x in raw_json.values()],\n",
|
|
" \"context\": [\n",
|
|
" json.dumps(\n",
|
|
" [\n",
|
|
" {\n",
|
|
" \"role\": \"system\",\n",
|
|
" \"content\": row[\"intent\"],\n",
|
|
" }\n",
|
|
" ]\n",
|
|
" )\n",
|
|
" for row in raw_json.values()\n",
|
|
" ],\n",
|
|
" }\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Authenticate and Create a Project\n",
|
|
"\n",
|
|
"We can now create a new [Zeno](https://zenoml.com) project and upload this data.\n",
|
|
"\n",
|
|
"Create an account and API key by signing up at [Zeno Hub](https://hub.zenoml.com) and going to your [Account page](http://hub.zenoml.com/account). Save the API key in a `.env` file."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# read ZENO_API_KEY from .env file\n",
|
|
"load_dotenv(override=True)\n",
|
|
"\n",
|
|
"client = zeno_client.ZenoClient(\"os.environ.get(\"ZENO_API_KEY\")\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"project = client.create_project(\n",
|
|
" name=\"WebArena Tester\",\n",
|
|
" view={\n",
|
|
" \"data\": {\n",
|
|
" \"type\": \"list\",\n",
|
|
" \"elements\": {\"type\": \"message\", \"content\": {\"type\": \"markdown\"}},\n",
|
|
" \"collapsible\": \"top\",\n",
|
|
" },\n",
|
|
" \"label\": {\"type\": \"markdown\"},\n",
|
|
" \"output\": {\n",
|
|
" \"type\": \"list\",\n",
|
|
" \"elements\": {\n",
|
|
" \"type\": \"message\",\n",
|
|
" \"highlight\": True,\n",
|
|
" \"content\": {\"type\": \"markdown\"},\n",
|
|
" },\n",
|
|
" \"collapsible\": \"top\",\n",
|
|
" },\n",
|
|
" },\n",
|
|
" metrics=[\n",
|
|
" zeno_client.ZenoMetric(name=\"success\", type=\"mean\", columns=[\"success\"]),\n",
|
|
" zeno_client.ZenoMetric(\n",
|
|
" name=\"# of go backs\", type=\"mean\", columns=[\"# of go_backs\"]\n",
|
|
" ),\n",
|
|
" zeno_client.ZenoMetric(name=\"# of steps\", type=\"mean\", columns=[\"# of steps\"]),\n",
|
|
" ],\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"project.upload_dataset(df, id_column=\"example_id\", data_column=\"context\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Uploading Model Outputs\n",
|
|
"\n",
|
|
"We can now upload the full trajectory outputs for our models.\n",
|
|
"\n",
|
|
"If you want to display the images, you will need to upload the images to a publically accessible location and provide the URL in the `image_url` field."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"image_base_url = None"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def format_message(row):\n",
|
|
" return_list = []\n",
|
|
" for message in row[\"messages\"]:\n",
|
|
" role = \"user\" if \"user\" in message else \"assistant\"\n",
|
|
"\n",
|
|
" if role == \"user\":\n",
|
|
" if image_base_url:\n",
|
|
" content = (\n",
|
|
" \"[](%s/%s)\\n%s\"\n",
|
|
" % (\n",
|
|
" image_base_url,\n",
|
|
" \"/\".join(message[\"image\"].split(\"/\")[-2:]),\n",
|
|
" image_base_url,\n",
|
|
" \"/\".join(message[\"image\"].split(\"/\")[-2:]),\n",
|
|
" message[role],\n",
|
|
" )\n",
|
|
" )\n",
|
|
" else:\n",
|
|
" content = message[role]\n",
|
|
" else:\n",
|
|
" content = message[role]\n",
|
|
" return_list.append({\"role\": role, \"content\": content})\n",
|
|
" return return_list"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def get_system_df(result_path: str):\n",
|
|
" with open(result_path, \"r\") as f:\n",
|
|
" json_input: dict = json.load(f)\n",
|
|
" return pd.DataFrame(\n",
|
|
" {\n",
|
|
" \"example_id\": list(json_input.keys()),\n",
|
|
" \"# of clicks\": [\n",
|
|
" sum(\n",
|
|
" [\n",
|
|
" 1\n",
|
|
" for x in r[\"messages\"]\n",
|
|
" if \"assistant\" in x and \"`click\" in x[\"assistant\"]\n",
|
|
" ]\n",
|
|
" )\n",
|
|
" for r in json_input.values()\n",
|
|
" ],\n",
|
|
" \"# of types\": [\n",
|
|
" sum(\n",
|
|
" [\n",
|
|
" 1\n",
|
|
" for x in r[\"messages\"]\n",
|
|
" if \"assistant\" in x and \"`type\" in x[\"assistant\"]\n",
|
|
" ]\n",
|
|
" )\n",
|
|
" for r in json_input.values()\n",
|
|
" ],\n",
|
|
" \"# of go_backs\": [\n",
|
|
" sum(\n",
|
|
" [\n",
|
|
" 1\n",
|
|
" for x in r[\"messages\"]\n",
|
|
" if \"assistant\" in x and \"`go_back\" in x[\"assistant\"]\n",
|
|
" ]\n",
|
|
" )\n",
|
|
" for r in json_input.values()\n",
|
|
" ],\n",
|
|
" \"# of steps\": [len(r[\"messages\"]) for r in json_input.values()],\n",
|
|
" \"context\": [json.dumps(format_message(row)) for row in json_input.values()],\n",
|
|
" \"success\": [r[\"success\"] for r in json_input.values()],\n",
|
|
" }\n",
|
|
" )"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"for i, system in enumerate(RESULT_JSONS):\n",
|
|
" output_df = get_system_df(system)\n",
|
|
" project.upload_system(\n",
|
|
" output_df, name=RESULT_NAMES[i], id_column=\"example_id\", output_column=\"context\"\n",
|
|
" ) "
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "zeno-build",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.12"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|