Introduction: TF-IDF (Term Frequency-Inverse Document Frequency) is a popular technique used in natural language processing to represent the importance of a term in a document within a collection of documents. In this blog post, we will explore the concept of TF-IDF and demonstrate its implementation using Python and the NLTK library.
TF-IDF Calculation:
TF-IDF is calculated using the following mathematical equation:
TF-IDF = TF * IDF
Where:
TF (Term Frequency) measures the frequency of a term in a document. It is calculated as the ratio of the number of times a term appears in a document to the total number of terms in the document.
IDF (Inverse Document Frequency) measures the importance of a term in the entire collection of documents. It is calculated as the logarithm of the ratio of the total number of documents to the number of documents containing the term.
Let's consider the following example texts:
Texts = ['Girl is good', 'Boy is good', 'Girl and Boy is Good']
To calculate the TF-IDF values for the terms in these texts, we need to perform the following steps:
Tokenization:
Text: 'Girl is good'
Tokens: ['girl', 'is', 'good']
Stemming:
- Stemmed Tokens: ['girl', 'is', 'good']
Stopword Removal:
- Stopword-Free Tokens: ['girl', 'good']
Preprocessing:
- Preprocessed Text: 'girl good'
We repeat the above steps for each text in the collection. The processed texts become:
Processed Texts = ['girl good', 'boy good', 'girl boy good']
Now, let's take the term 'girl' as an example to calculate its TF and IDF values:
Term Frequency (TF):
'girl good': Term 'girl' appears once, and the total number of terms in the document is 2.
TF('girl', 'girl good') = 1 / 2 = 0.5
'girl boy good': Term 'girl' appears once, and the total number of terms in the document is 3.
TF('girl', 'girl boy good') = 1 / 3 ≈ 0.333
Inverse Document Frequency (IDF):
The term 'girl' appears in two documents out of the three in the collection.
IDF('girl') = log(3 / 2) ≈ 0.176
Finally, we can calculate the TF-IDF value for the term 'girl' in each document by multiplying its TF and IDF values:
TF-IDF('girl', 'girl good') = TF('girl', 'girl good') IDF('girl') ≈ 0.5 0.176 ≈ 0.088
TF-IDF('girl', 'girl boy good') = TF('girl', 'girl boy good') IDF('girl') ≈ 0.333 0.176 ≈ 0.059
We repeat this calculation for each term in the processed texts to obtain the TF-IDF values.
Now, let's implement the TF-IDF calculation in Python using the NLTK library.
import nltk
from pprint import pprint
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
texts = ['Girl is
good', 'Boy is good', 'Girl and Boy is Good']
processed = []
for text in texts:
tokens = word_tokenize(text.lower())
stemmer = SnowballStemmer('english')
tokens = [token for token in tokens if token.isalpha() and token not in stopwords.words('english')]
preprocessed = [stemmer.stem(token) for token in tokens]
processed.append(" ".join(preprocessed))
pprint(processed)
Tf = TfidfVectorizer(use_idf=False)
idf = TfidfVectorizer(use_idf=True)
term_freq = Tf.fit_transform(processed)
inver_doc_freq = idf.fit_transform(processed)
import pandas as pd
Tf_m = term_freq.toarray()
idf_m = idf.idf_
tfidf = Tf_m * idf_m.T
df = pd.DataFrame(tfidf, columns=Tf.get_feature_names_out())
print(df)
The code snippet above demonstrates the calculation of TF-IDF values for the given example texts. First, we tokenize each text using the word_tokenize
function from NLTK. Then, we apply stemming to the tokens using the SnowballStemmer
to reduce words to their root form. Next, we remove stopwords and preprocess the texts.
After preprocessing, we create an instance of TfidfVectorizer
to calculate the TF-IDF values. We use two instances: Tf
with use_idf=False
to calculate only term frequency and idf
with use_idf=True
to calculate both term frequency and inverse document frequency.
We fit and transform the processed texts using Tf
and idf
respectively. Then, we obtain the TF-IDF matrix by multiplying the term frequency matrix (Tf_m
) with the inverse document frequency matrix (idf_m.T
). Finally, we create a pandas DataFrame to display the TF-IDF values for each term.
Conclusion:
TF-IDF is a powerful technique for representing the importance of terms in textual data. It allows us to quantify the relevance of terms within a collection of documents and is widely used in various NLP applications. In this blog post, we have explored the mathematical concept of TF-IDF and demonstrated its implementation using Python and the NLTK library. By understanding and applying TF-IDF, we can extract valuable insights and build robust NLP models.