History

Wei-Lin Chiang 3f61c6e6fc Upgrade gradio to 4.17 (#3027 )		3 months ago
..
data	Mt bench plot (#2068)	9 months ago

README.md	update mt-bench readme	4 months ago

clean_judgment.py	Use single-answer grading as the default option for LLM judge (#1892)	10 months ago

common.py	Fix type hint for play_a_match_single (#3008)	3 months ago

compute_agreement.py	Add compute agreement (#1855)	10 months ago

download_mt_bench_pregenerated.py	Use single-answer grading as the default option for LLM judge (#1892)	10 months ago

gen_api_answer.py	fix: 'compeletion' typo (#2847)	4 months ago

gen_judgment.py	Revert "fix: llm_judge resume from breakpoint when judging" (#2334)	8 months ago

gen_model_answer.py	Add revision arg to MT Bench answer generation (#2728)	5 months ago

qa_browser.py	Upgrade gradio to 4.17 (#3027)	3 months ago

show_result.py	drop scores for API error judgments (#2074)	9 months ago

README.md

LLM Judge

LLM Judge

| Paper | Leaderboard |

In this package, you can use MT-bench questions and prompts to evaluate your models with LLM-as-a-judge.
MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants.
To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.

Install

git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip install -e ".[model_worker,llm_judge]"

Review Pre-Generated Model Answers and Judgments

We provide pre-generated model answers and judgments for some models.
You can view them at this demo.

To download the pre-generated data, use

python3 download_mt_bench_pregenerated.py

After downloading the data, you can view them locally by

python3 qa_browser.py --share

You can use this QA browser to view the answers generated by you later.

MT-Bench

Evaluate a model on MT-bench

Step 1. Generate model answers to MT-bench questions

python gen_model_answer.py --model-path [MODEL-PATH] --model-id [MODEL-ID]

Arguments:

[MODEL-PATH] is the path to the weights, which can be a local folder or a Hugging Face repo ID.
[MODEL-ID] is a name you give to the model.

e.g.,

python gen_model_answer.py --model-path lmsys/vicuna-7b-v1.5 --model-id vicuna-7b-v1.5

The answers will be saved to data/mt_bench/model_answer/[MODEL-ID].jsonl.

To make sure FastChat loads the correct prompt template, see the supported models and how to add a new model here.

You can also specify --num-gpus-per-model for model parallelism (needed for large 65B models) and --num-gpus-total to parallelize answer generation with multiple GPUs.

Step 2. Generate GPT-4 judgments

There are several options to use GPT-4 as a judge, such as pairwise winrate and single-answer grading.
In MT-bench, we recommend single-answer grading as the default mode.
This mode asks GPT-4 to grade and give a score to model's answer directly without pairwise comparison.
For each turn, GPT-4 will give a score on a scale of 10. We then compute the average score on all turns.

export OPENAI_API_KEY=XXXXXX  # set the OpenAI API key
python gen_judgment.py --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]

e.g.,

python gen_judgment.py --model-list vicuna-13b-v1.3 alpaca-13b llama-13b claude-v1 gpt-3.5-turbo gpt-4 --parallel 2

The judgments will be saved to data/mt_bench/model_judgment/gpt-4_single.jsonl

Step 3. Show MT-bench scores

Show the scores for selected models

python show_result.py --model-list vicuna-13b-v1.3 alpaca-13b llama-13b claude-v1 gpt-3.5-turbo gpt-4

Show all scores
```
python show_result.py
```

Other grading options

Besides score-based single-answer grading, we also support two additional grading options based on win rates:

pariwise-baseline: run pairwise comparison against a baseline model.
pairwise-all: run pairwise comparison between all model pairs on all questions.

Option 2: pairwise comparison against a baseline (default: gpt-3.5-turbo)

Generate GPT-4 judgments

python gen_judgment.py --mode pairwise-baseline --model-list vicuna-13b-v1.3 alpaca-13b llama-13b --parallel 2

The judgments will be saved to data/mt_bench/model_judgment/gpt-4_pair.jsonl

Show results

python show_result.py --mode pairwise-baseline

Option 3: Run GPT-4 judge with all pair comparisons

Another option is to run pairwise comparisons on all possible pairs.
This could be more expensive when #models increases, but it gives you a more comprehensive information.

python gen_judgment.py --mode pairwise-all --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]

python show_result.py --mode pairwise-all

How to get GPT-3.5/GPT-4/Claude's answer?

python gen_api_answer.py --model [MODEL-NAME] to generate GPT-3.5/4 and Claude's answers.

How to plot the radar figure?

You can use this colab notebook to plot the radar figure for MT-bench.

Other backends

We can also use vLLM for answer generation, which can be faster for the models supported by vLLM.

Launch a vLLM worker

python3 -m fastchat.serve.controller
python3 -m fastchat.serve.vllm_worker --model-path [MODEL-PATH]
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000

Arguments:
- [MODEL-PATH] is the path to the weights, which can be a local folder or a Hugging Face repo ID.

Generate the answers

python gen_api_answer.py --model [MODEL-NAME] --openai-api-base http://localhost:8000/v1 --parallel 50

Arguments:
- [MODEL-NAME] is the name of the model from Step 1.
- --parallel is the number of concurrent API calls to the vLLM worker.

Agreement Computation

We released 3.3K human annotations for model responses generated by 6 models in response to 80 MT-bench questions. The dataset is available at lmsys/mt_bench_human_judgments.

This Colab notebook shows how to compute the agreement between humans and GPT-4 judge with the dataset. Our results show that humans and GPT-4 judge achieve over 80% agreement, the same level of agreement between humans.

Datasets

Citation

Please cite the following paper if you find the code or datasets helpful.

@misc{zheng2023judging,
      title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena}, 
      author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
      year={2023},
      eprint={2306.05685},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

No Description

Python Jupyter Notebook Shell Nginx other

lianminzheng@gmail.com weichiang@berkeley.edu suquark@gmail.com zhanghao.wu@outlook.com

sqy1415@gmail.com 49086305+BabyChouSr@users.noreply.github.com erdongerzong@qq.com infwinston@gmail.com s.horvath@outlook.com.au zhisbug@users.noreply.github.com fozziethebeat@gmail.com 73746844+andy-yang-1@users.noreply.github.com yhzhuang@cmu.edu dal177@ucsd.edu n.muendler@posteo.de leiwen83@users.noreply.github.com a.strube@fz-juelich.de andres@thelinuxkid.com jon@jondurbin.com 41217877+liunux4odoo@users.noreply.github.com 96003344+alanxmay@users.noreply.github.com 34168009+BillSchumacher@users.noreply.github.com zeyugao@outlook.com eltociear@gmail.com

How to access data resources in code

README.md

LLM Judge

Contents

Install

Review Pre-Generated Model Answers and Judgments

MT-Bench

Evaluate a model on MT-bench

Step 1. Generate model answers to MT-bench questions

Step 2. Generate GPT-4 judgments

Step 3. Show MT-bench scores

Other grading options

Option 2: pairwise comparison against a baseline (default: gpt-3.5-turbo)

Option 3: Run GPT-4 judge with all pair comparisons

How to get GPT-3.5/GPT-4/Claude's answer?

How to plot the radar figure?

Other backends

Agreement Computation

Datasets

Citation

Contributors (25+)
All

README.md

LLM Judge

Contents

Install

Review Pre-Generated Model Answers and Judgments

MT-Bench

Evaluate a model on MT-bench

Step 1. Generate model answers to MT-bench questions

Step 2. Generate GPT-4 judgments

Step 3. Show MT-bench scores

Other grading options

Option 2: pairwise comparison against a baseline (default: gpt-3.5-turbo)

Option 3: Run GPT-4 judge with all pair comparisons

How to get GPT-3.5/GPT-4/Claude's answer?

How to plot the radar figure?

Other backends

Agreement Computation

Datasets

Citation

Contributors (25+) All

Contributors (25+)
All