Understanding TF-IDF in Natural Language Processing

Understanding TF-IDF in Natural Language Processing

·

4 min read

Introduction: TF-IDF (Term Frequency-Inverse Document Frequency) is a popular technique used in natural language processing to represent the importance of a term in a document within a collection of documents. In this blog post, we will explore the concept of TF-IDF and demonstrate its implementation using Python and the NLTK library.

TF-IDF Calculation:

TF-IDF is calculated using the following mathematical equation:

TF-IDF = TF * IDF

Where:

  • TF (Term Frequency) measures the frequency of a term in a document. It is calculated as the ratio of the number of times a term appears in a document to the total number of terms in the document.

  • IDF (Inverse Document Frequency) measures the importance of a term in the entire collection of documents. It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term.

Let's consider the following example texts:

Texts = ['Girl is good', 'Boy is good', 'Girl and Boy is Good']

To calculate the TF-IDF values for the terms in these texts, we need to perform the following steps:

  1. Tokenization:

    • Text: 'Girl is good'

    • Tokens: ['girl', 'is', 'good']

  2. Stemming:

    • Stemmed Tokens: ['girl', 'is', 'good']
  3. Stopword Removal:

    • Stopword-Free Tokens: ['girl', 'good']
  4. Preprocessing:

    • Preprocessed Text: 'girl good'

We repeat the above steps for each text in the collection. The processed texts become:

Processed Texts = ['girl good', 'boy good', 'girl boy good']

Now, let's take the term 'girl' as an example to calculate its TF and IDF values:

  • Term Frequency (TF):

    • 'girl good': Term 'girl' appears once, and the total number of terms in the document is 2.

    • TF('girl', 'girl good') = 1 / 2 = 0.5

    • 'girl boy good': Term 'girl' appears once, and the total number of terms in the document is 3.

    • TF('girl', 'girl boy good') = 1 / 3 ≈ 0.333

  • Inverse Document Frequency (IDF):

    • The term 'girl' appears in two documents out of the three in the collection.

    • IDF('girl') = log(3 / 2) ≈ 0.176

Finally, we can calculate the TF-IDF value for the term 'girl' in each document by multiplying its TF and IDF values:

  • TF-IDF('girl', 'girl good') = TF('girl', 'girl good') IDF('girl') ≈ 0.5 0.176 ≈ 0.088

  • TF-IDF('girl', 'girl boy good') = TF('girl', 'girl boy good') IDF('girl') ≈ 0.333 0.176 ≈ 0.059

We repeat this calculation for each term in the processed texts to obtain the TF-IDF values.

Now, let's implement the TF-IDF calculation in Python using the NLTK library.

import nltk
from pprint import pprint
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords

texts = ['Girl is

 good', 'Boy is good', 'Girl and Boy is Good']

processed = []
for text in texts:
    tokens = word_tokenize(text.lower())
    stemmer = SnowballStemmer('english')
    tokens = [token for token in tokens if token.isalpha() and token not in stopwords.words('english')]
    preprocessed = [stemmer.stem(token) for token in tokens]
    processed.append(" ".join(preprocessed))

pprint(processed)

Tf = TfidfVectorizer(use_idf=False)
idf = TfidfVectorizer(use_idf=True)
term_freq = Tf.fit_transform(processed)
inver_doc_freq = idf.fit_transform(processed)

import pandas as pd

Tf_m = term_freq.toarray()
idf_m = idf.idf_
tfidf = Tf_m * idf_m.T
df = pd.DataFrame(tfidf, columns=Tf.get_feature_names_out())
print(df)

The code snippet above demonstrates the calculation of TF-IDF values for the given example texts. First, we tokenize each text using the word_tokenize function from NLTK. Then, we apply stemming to the tokens using the SnowballStemmer to reduce words to their root form. Next, we remove stopwords and preprocess the texts.

After preprocessing, we create an instance of TfidfVectorizer to calculate the TF-IDF values. We use two instances: Tf with use_idf=False to calculate only term frequency and idf with use_idf=True to calculate both term frequency and inverse document frequency.

We fit and transform the processed texts using Tf and idf respectively. Then, we obtain the TF-IDF matrix by multiplying the term frequency matrix (Tf_m) with the inverse document frequency matrix (idf_m.T). Finally, we create a pandas DataFrame to display the TF-IDF values for each term.

Conclusion:

TF-IDF is a powerful technique for representing the importance of terms in textual data. It allows us to quantify the relevance of terms within a collection of documents and is widely used in various NLP applications. In this blog post, we have explored the mathematical concept of TF-IDF and demonstrated its implementation using Python and the NLTK library. By understanding and applying TF-IDF, we can extract valuable insights and build robust NLP models.

Did you find this article valuable?

Support Aswin Lal by becoming a sponsor. Any amount is appreciated!