TF-IDF and BM25

Using the classic Cranfield dataset, this notebook shows how to use TF-IDF and BM25 to calculate the similarity scores between a query and the documents and show the evaluation scores, i.e., precision and recall. Note that the ranking of the returned documents is not yet considered.

import numpy as np
import pandas as pd

# load data into dataframes
docs = pd.read_json("data/cranfield_docs.json")
queries = pd.read_json("data/cranfield_queries.json")
relevance = pd.read_json("data/cranfield_relevance.json")

docs.head()

queries.head()