MS MARCO Document Ranking
MS MARCO (Microsoft Machine Reading Comprehension) is a large scale dataset, the current dataset has 1,010,916 unique real queries that were generated by sampling and anonymizing Bing usage logs. The corpus of document ranking task has 3.2 million documents and the training set has 367,013 queries. More details are available at MSMARCO Document Ranking.
Results
Results of the runs we submitted.
Retriever |
Reranker |
Coor-Ascent |
dev |
eval |
ANCE FirstP |
- |
- |
0.373 |
0.334 |
ANCE MaxP |
- |
- |
0.383 |
0.342 |
ANCE FirstP+BM25 |
BERT Base FirstP |
- |
0.407 |
- |
ANCE FirstP+BM25 |
BERT Base FirstP |
+ |
0.431 |
0.380 |
ANCE MaxP |
BERT Base MaxP |
- |
0.409 |
- |
ANCE MaxP |
BERT Base MaxP |
+ |
0.432 |
0.391 |
Datasets & Checkpoints
For BERT FirstP, we concatenate the title and content of each document by a '[SEP]'. For BERT MaxP, we only use the content of each document. To reproduce our runs, we need to preprocess the official document file to the format: doc_id \t doc.
Type |
File |
Records |
Format |
Description |
Corpus |
msmarco-docs.tsv |
3,213,835 |
tsv: docid, url, title, body |
Document Collections |
Train |
msmarco-doctrain-queries.tsv |
367,013 |
tsv: qid, query |
Training Queries |
Train |
msmarco-doctrain-qrels.tsv |
384,597 |
TREC qrels |
Training Query-Doc Relevance Labels |
Train |
Training-Data-FirstP |
7,340,240 |
tsv: qid, docid, label |
ANCE FirstP training data |
Train |
Training-Data-MaxP |
7,340,240 |
tsv: qid, docid, label |
ANCE MaxP training data |
Dev |
msmarco-docdev-queries.tsv |
5,193 |
tsv: qid, query |
Dev Queries |
Dev |
msmarco-docdev-qrels.tsv |
5,478 |
TREC qrels |
Dev Query-Doc Relevance Labels |
Dev |
ANCE-FirstP-dev-top100 |
519,300 |
TREC submission |
ANCE FirstP dev top100 |
Dev |
ANCE-MaxP-dev-top100 |
519,300 |
TREC submission |
ANCE MaxP dev top100 |
Test |
docleaderboard-queries.tsv |
5,793 |
tsv: qid, query |
Test Queries |
Test |
ANCE-FirstP-eval-top100 |
579,300 |
TREC submission |
ANCE FirstP eval top100 |
Test |
ANCE-MaxP-eval-top100 |
579,300 |
TREC submission |
ANCE MaxP eval top100 |
Model |
BERT-Base-ANCE-FirstP |
- |
- |
BERT Base ANCE FirstP checkpoint |
Model |
BERT-Base-ANCE-MaxP |
- |
- |
BERT Base ANCE MaxP checkpoint |
Model |
F-MaxP |
- |
- |
BERT Base ANCE MaxP Coor-Ascent weights |
Inference
BERT FirstP
We provide the ANCE FirstP top-100 documents of dev and docleaderboard queries in aliyun in standard TREC format. You can click to download these data.
Preprocess dev and eval dataset, msmarco-docs-firstp.tsv is the preprocessed document file, each line is doc_id \t title [SEP] content:
python data/preprocess.py -input_trec data/ANCE_FirstP_dev.trec -input_qrels data/msmarco-docdev-qrels.tsv -input_queries data/msmarco-docdev-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_dev_firstp.jsonl
python data/preprocess.py -input_trec data/ANCE_FirstP_eval.trec -input_queries data/docleaderboard-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_eval_firstp.jsonl
The checkpoint of BERT Base FirstP is available at BERT-Base-ANCE-FirstP. Now you can reproduce ANCE FirstP + BERT Base FirstP, MRR@100(dev): 0.4079.
CUDA_VISIBLE_DEVICES=0 \
python inference.py \
-task classification \
-model bert \
-max_input 12800000 \
-test ./data/msmarco-doc_dev_firstp.jsonl \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-checkpoint ./checkpoints/bert-base_ance_firstp.bin \
-res ./results/bert-base_ance_dev_firstp.trec \
-max_query_len 64 \
-max_doc_len 445 \
-batch_size 256
BERT MaxP
ANCE MaxP top-100 documents of dev and docleaderboard queries are also provided.
Preprocess dev dataset, msmarco-docs-maxp.tsv is the preprocessed document file, each line is doc_id \t content:
python data/preprocess.py -input_trec data/ANCE_FirstP_dev.trec -input_qrels data/msmarco-docdev-qrels.tsv -input_queries data/msmarco-docdev-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_dev_maxp.jsonl
python data/preprocess.py -input_trec data/ANCE_FirstP_eval.trec -input_queries data/docleaderboard-queries.tsv -input_docs data/msmarco-docs-firstp.tsv -output data/msmarco-doc_eval_maxp.jsonl
The checkpoint of BERT Base MaxP is available at BERT-Base-ANCE-MaxP. Now you can reproduce ANCE MaxP + BERT Base MaxP, MRR@100(dev): 0.4094.
CUDA_VISIBLE_DEVICES=0 \
python inference.py \
-task classification \
-model bert \
-max_input 12800000 \
-test ./data/msmarco-doc_dev_maxp.jsonl \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-checkpoint ./checkpoints/bert-base_ance_maxp.bin \
-res ./results/bert-base_ance_dev_maxp.trec \
-max_query_len 64 \
-max_doc_len 445 \
-maxp \
-batch_size 64
We also provide the weights of BERT Base MaxP features learned by Coor-Ascent: F-MaxP. First, generate the BERT Base MaxP features of eval dataset.
CUDA_VISIBLE_DEVICES=0 \
python gen_feature.py \
-task classification \
-model bert \
-max_input 12800000 \
-dev ./data/msmarco-doc_eval_maxp.jsonl \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-checkpoint ./checkpoints/bert-base_ance_maxp.bin \
-res ./features/bert-base_ance_eval_maxp_features \
-max_query_len 64 \
-max_doc_len 445 \
-maxp \
-batch_size 64
Then, we compute the ranking score using the weights.
java -jar LeToR/RankLib-2.1-patched.jar -load checkpoints/f_maxp.ca -rank features/bert-base_ance_eval_maxp_features -score f0.score
python LeToR/gen_trec.py -dev data/msmarco-doc_eval_maxp.jsonl -res results/bert-base_ance_eval_maxp_ca.trec -k -1
Training
You can also finetune BERT yourself instead of using our checkpoints.
BERT FirstP
We provide our training data (qid did label): Training-Data-FirstP. 10 negative documents are randomly sampled for each training query from ANCE FirstP top-100 documents. Since the dev dataset is too large to evaluate every 10000 steps, we only evaluate the top-100 documents of the first 50 dev queries: msmarco-doc_dev_firstp-50.jsonl.
CUDA_VISIBLE_DEVICES=0 \
python train.py \
-task classification \
-model bert \
-train queries=./data/msmarco-doctrain-queries.tsv,docs=./data/msmarco-docs-firstp.tsv,qrels=./data/msmarco-doctrain-qrels.tsv,trec=./data/bids_marco-doc_ance-firstp-10.tsv \
-max_input 12800000 \
-save ./checkpoints/bert-base-firstp.bin \
-dev ./data/msmarco-doc_dev_firstp-50.jsonl \
-qrels ./data/msmarco-docdev-qrels.tsv \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-res ./results/bert.trec \
-metric mrr_cut_100 \
-max_query_len 64 \
-max_doc_len 445 \
-epoch 1 \
-batch_size 4 \
-lr 3e-6 \
-n_warmup_steps 100000 \
-eval_every 10000
After BERT finetuning, we choose the best checkpoint on dev dataset to generate BERT features.
CUDA_VISIBLE_DEVICES=0 \
python gen_feature.py \
-task classification \
-model bert \
-max_input 12800000 \
-dev ./data/msmarco-doc_dev_firstp.jsonl \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-checkpoint ./checkpoints/bert-base-firstp.bin \
-res ./features/bert-base_ance_dev_firstp_features \
-max_query_len 64 \
-max_doc_len 445 \
-batch_size 256
Then, we run Coor-Ascent on these features using RankLib to learned the weight of each feature.
java -jar LeToR/RankLib-2.1-patched.jar -train features/bert-base_ance_dev_firstp_features -ranker 4 -metric2t RR@100 -save checkpoints/f_firstp.ca
Finally, we can generate the features of eval dataset, and compute the ranking scores using the feature weights, which is the same as that in the inference section.
BERT MaxP
We provde our training data (qid did label): Training-Data-MaxP. 10 negative documents are randomly sampled for each training query from ANCE MaxP top-100 documents. Since the dev dataset is too large to evaluate every 10000 steps, we only evaluate the top-100 documents of the first 50 dev queries: msmarco-doc_dev_maxp-50.jsonl.
Train.
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python train.py \
-task classification \
-model bert \
-train queries=./data/msmarco-doctrain-queries.tsv,docs=./data/msmarco-docs-maxp.tsv,qrels=./data/msmarco-doctrain-qrels.tsv,trec=./data/bids_marco-doc_ance-maxp-10.tsv \
-max_input 12800000 \
-save ./checkpoints/bert-base-maxp.bin \
-dev ./data/msmarco-doc_dev_maxp-50.jsonl \
-qrels ./data/msmarco-docdev-qrels.tsv \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-res ./results/bert.trec \
-metric mrr_cut_100 \
-max_query_len 64 \
-max_doc_len 445 \
-maxp \
-epoch 1 \
-batch_size 8 \
-lr 2e-5 \
-n_warmup_steps 50000 \
-eval_every 10000
After BERT finetuning, we choose the best checkpoint on dev dataset to generate BERT features.
CUDA_VISIBLE_DEVICES=0 \
python gen_feature.py \
-task classification \
-model bert \
-max_input 12800000 \
-dev ./data/msmarco-doc_dev_maxp.jsonl \
-vocab bert-base-uncased \
-pretrain bert-base-uncased \
-checkpoint ./checkpoints/bert-base-maxp.bin \
-res ./features/bert-base_ance_dev_maxp_features \
-max_query_len 64 \
-max_doc_len 445 \
-maxp \
-batch_size 64
Then, we run Coor-Ascent on these features using RankLib to learned the weight of each feature.
java -jar LeToR/RankLib-2.1-patched.jar -train features/bert-base_ance_dev_maxp_features -ranker 4 -metric2t RR@100 -save checkpoints/f_maxp.ca
Finally, we can generate the features of eval dataset, and compute the ranking scores using the feature weights, which is the same as that in the inference section.