Term Frequency-Inverse Document Frequency (TF-IDF)
Introduction
TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is widely used in information retrieval, text mining, and natural language processing (NLP). The goal of TF-IDF is to weigh the frequency of a term against its importance across multiple documents, helping to highlight words that are significant to specific documents but not common across the entire corpus.Understanding TF-IDF
Components of TF-IDF
TF-IDF is composed of two main components: 1. Term Frequency (TF): This measures how frequently a term appears in a document. It is calculated as: $$ TF(t, d) = \frac{f(t, d)}{N_d} $$ Where: - $f(t, d)$ is the number of times term $t$ appears in document $d$. - $N_d$ is the total number of terms in document $d$.2. Inverse Document Frequency (IDF): This measures how important a term is across the entire corpus. It is calculated as: $$ IDF(t, D) = \log\left(\frac{N}{n(t)}\right) $$ Where: - $N$ is the total number of documents in the corpus. - $n(t)$ is the number of documents containing term $t$.
Calculating TF-IDF
The TF-IDF score for a term $t$ in document $d$ is given by the product of its TF and IDF: $$ TFIDF(t, d, D) = TF(t, d) \times IDF(t, D) $$Example Calculation
Consider a small corpus of three documents: - Document 1: "The cat sat on the mat." - Document 2: "The dog sat on the log." - Document 3: "Cats and dogs are great pets."Let's calculate the TF-IDF for the term 'cat' in Document 1. 1. Calculate TF: - $f(cat, Document 1) = 1$ (appears once) - $N_{Document 1} = 7$ (total words) - $TF(cat, Document 1) = \frac{1}{7} \approx 0.143$
2. Calculate IDF: - $N = 3$ (total documents) - $n(cat) = 2$ (appears in Document 1 and Document 3) - $IDF(cat, D) = \log\left(\frac{3}{2}\right) \approx 0.176$
3. Calculate TF-IDF: - $TFIDF(cat, Document 1, D) = 0.143 \times 0.176 \approx 0.025$