The idea from this blog post came after finishing the lab on TF-IDF of the edx Spark specialisation courses.
In this course the labs follow a step-by-step approach where you need to write some lines of code at every step. The lab is very detailed and easy to follow. However I found that focusing on a single step at a time I was missing the big picture of what’s happening overall.
This blog is my attempt to summarise and clarify things. I’ve written it mostly for myself but it can benefit other as well. It covers mostly the theory and doesn’t show any code (as I think it’s not allowed by the edx course policy).
The theory
TF-IDF stands for Term Frequency – Inverse Document Frequency. It’s a score that represents the importance of a word in a set of documents.
It’s made of 2 terms:
- TF (Term Frequency) represents how often a word appears in a document. The idea is that the more a word appears in a document the more important it is. It is computed as the number of times a word appears in a document divided by the total number of words in that document \(tf = n / T\) (where \(n\) is the number of times a word appear in the document and \(T\) is the total number of words in the document).
- IDF (Inverse Document Frequency) represents how rare a word is in a set of documents. The idea is that it’s more significant if 2 documents have a rare word in common than a common word. IDF is computed as \(N / n(t)\) where \(N\) is the total number of documents and \(n(t)\) is the number of documents containing the word \(t\).
TF-IDF is then computed as \( tf \times idf\).
TF is a local score as it depends only of a given document and a same word has different TF values for every document it appears in.
IDF is a global score as the score of a given word is the same for all documents.
TF-IDF to perform Entity resolution
TF-IDF can be used to measure the similarity between documents. It is especially useful for entity resolutions when we want to identify duplicates between 2 different datasets.
In this post I am going to detail how to use the TF-IDF to perform an entity resolution between 2 datasets.
Tokenisation
The first step is to tokenise the documents. This operation transform every document into a list of tokens (or bag of words). At this stage we also remove stop-words (common words such as ‘the’, ‘a’, ‘to’, … that don’t carry much meaning). We can also perform stemming (removing word suffixes – e.g. plural forms or conjugation).
Term frequency
We are now ready to implement a TF function that computes the TF scores for a tokenised document. This step associates a TF score to every unique token of every document.
Inverse Document Frequency
Next step consists in implementing the IDF values with the following steps:
- eliminate duplicate tokens inside each document
- count how many times each unique token appears in the document set \(n(t)\)
- compute the total number of documents \(N\)
- compute the IDF for each token as \(N/n(t)\)
TF-IDF scores
The IDF values are the same across all documents so we can share there values to compute the TF-IDF for each document.
Similarity scores
Now that we have a list of TF-IDF values for every token in every document we can measure the similarity between the documents.
A common way the measure similarity is the cosine similarity. The idea is to treat the tokens as dimensions. A document is therefore represented by a n-dimensions vector. Then, to compare 2 documents we just have to compare 2 n-dimensions vectors. The idea is that 2 vectors are similar if they point to the same direction. We measure this by computing the cosine of the 2 vectors:
\(\cos \theta = \frac{a.b}{|a| |b|}\)
\(\cos \theta\) is our similarity score.
Finding duplicates
We are now ready to chase duplicates in our dataset. The brute-force approach consists in comparing every document against every other documents. We then just have to study the distribution of the similarity score and set an appropriate threshold to identify duplicates. (T’s worth to plot precision, recall and F1 score to pick the correct threshold).
This approach works on small datasets but clearly doesn’t scale.
Instead it would be better to build an inverted index mapping tokens to documents and only compute similarity for documents having tokens in common. This optimised graph is illustrated in the picture below.
