Sentiment Analysis with Naive Bayes Classifier using NLTK and Scikit-learn

Introduction:

In this blog post, we will explore how to perform sentiment analysis on movie reviews using the Naive Bayes classifier. Sentiment analysis is a popular task in natural language processing that involves determining the sentiment or opinion expressed in a given piece of text. We will use the NLTK (Natural Language Toolkit) library in Python along with the Scikit-learn library to build and train a sentiment classifier.

Step 1: Preprocessing the Data

To begin, we import the necessary modules and define a preprocessing function. The preprocessing function performs the following steps:

Tokenization: The text is split into sentences and then into individual words using NLTK's sent_tokenize and word_tokenize functions.
Lemmatization: Each word is lemmatized to reduce it to its base or dictionary form using NLTK's WordNetLemmatizer.
Filtering: Non-alphabetic words and punctuation are removed from the text.

Snippet:

from nltk.corpus import movie_reviews
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocess(text):
    lemmatizer = WordNetLemmatizer()
    sentences = sent_tokenize(text)
    filtered_sentences = []
    for sentence in sentences:
        words = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(sentence) if word.isalpha()]
        filtered_sentences.append(" ".join(words))
    return " ".join(filtered_sentences)

Step 2: Loading and Preparing the Data

Next, we load the movie reviews dataset provided by NLTK's movie_reviews corpus. We divide the reviews into positive and negative categories and create separate lists for training and testing data. We preprocess the reviews using the previously defined preprocessing function.

Step 3: Vectorization

In this step, we use the TfidfVectorizer from Scikit-learn to convert the preprocessed text into numerical features. The TfidfVectorizer calculates the Term Frequency-Inverse Document Frequency (TF-IDF) representation of the text, which assigns weights to each word based on its frequency and importance in the corpus.

Snippet:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectorized_train_reviews = vectorizer.fit_transform(processed_train_reviews)
vectorized_test_reviews = vectorizer.transform(processed_test_reviews)

Step 4: Training the Naive Bayes Classifier

We create an instance of the MultinomialNB classifier from Scikit-learn and fit it to the vectorized training data. The Multinomial Naive Bayes algorithm is a popular choice for text classification tasks, including sentiment analysis.

Snippet:

from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(vectorized_train_reviews, train_classes_reviews)

Step 5: Evaluating the Model

Finally, we use the trained classifier to predict the sentiment of the test reviews. We evaluate the performance of the classifier by comparing the predicted sentiment labels with the true labels. We calculate the accuracy of the classifier and print it.

Snippet:

pred_class = nb.predict(vectorized_test_reviews)

accuracy = (pred_class == test_classes_reviews).mean()
print("Accuracy:", accuracy)

Entire Code :

from nltk.corpus import movie_reviews
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

def preprocess(text):
    lemmatizer = WordNetLemmatizer()
    sentences = sent_tokenize(text)
    filtered_sentences = []
    for sentence in sentences:
        words = [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(sentence) if word.isalpha()]
        filtered_sentences.append(" ".join(words))
    return " ".join(filtered_sentences)

positive_reviews = [(movie_reviews.raw(fileid), 1) for fileid in movie_reviews.fileids(categories=['pos'])]
negative_reviews = [(movie_reviews.raw(fileid), 0) for fileid in movie_reviews.fileids(categories=['neg'])]

all_train_reviews = positive_reviews[:-1500:-1] + negative_reviews[:-1500:-1]
all_test_reviews = positive_reviews[:500] + negative_reviews[:500]
all_train_reviews_only = [review for review, tag in all_train_reviews]
train_classes_reviews = [tag for review, tag in all_train_reviews]
all_test_reviews_only = [review for review, tag in all_test_reviews]
test_classes_reviews = [tag for review, tag in all_test_reviews]
processed_train_reviews = [preprocess(review) for review in all_train_reviews_only]
processed_test_reviews = [preprocess(review) for review in all_test_reviews_only]

vectorizer = TfidfVectorizer()
nb = MultinomialNB()
vectorized_train_reviews = vectorizer.fit_transform(processed_train_reviews)
vectorized_test_reviews = vectorizer.transform(processed_test_reviews)

nb.fit(vectorized_train_reviews, train_classes_reviews)
pred_class = nb.predict(vectorized_test_reviews)

accuracy = (pred_class == test_classes_reviews).mean()
print("Accuracy:", accuracy)

Conclusion:

In this blog post, we have learned how to perform sentiment analysis on movie reviews using the Naive Bayes classifier. We used NLTK for text preprocessing and Scikit-learn for vectorization and classification. Sentiment analysis is a powerful technique that has various applications

in areas like customer feedback analysis, social media sentiment analysis, and more.

By understanding and implementing sentiment analysis techniques, you can gain valuable insights from text data and make informed decisions based on the sentiment expressed in the text.

Remember, programming and natural language processing can be exciting fields to explore, and sentiment analysis is just one of the many interesting applications. Keep exploring and learning, and you'll be amazed at the possibilities that programming offers!

Note: Don't forget to import the required libraries and datasets before running the code.

Sentiment Analysis with Naive Bayes Classifier using NLTK and Scikit-learn

Did you find this article valuable?