Named Entity Recognition (NER) using NLTK in Python Using Grammar

Named Entity Recognition (NER) using NLTK in Python Using Grammar

·

4 min read

Introduction:

Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) that involves identifying and classifying named entities within text data. In this comprehensive guide, we will explore the concept of NER and delve into the implementation details using the NLTK library in Python. We will walk through a step-by-step code example that demonstrates the process of tagging named entities with BIO (Begin, Inside, Outside) labels.

Understanding Named Entity Recognition (NER):

Named Entity Recognition is the process of identifying and classifying named entities in text data. Named entities are specific elements within the text that carry important information, such as names of people, organizations, locations, dates, and more. By extracting and classifying these named entities, we can gain valuable insights and facilitate various NLP applications like information extraction, question answering, and text summarization.

NER Implementation using NLTK:

To perform Named Entity Recognition using NLTK in Python, we will follow these steps:

  1. Tokenization:

    • Tokenization is the process of splitting a text into individual words or tokens. NLTK provides the word_tokenize() function to tokenize a sentence into words.
  2. Part-of-Speech (POS) Tagging:

    • POS tagging involves assigning grammatical tags to each word in a sentence. NLTK provides the pos_tag() function to assign POS tags to words.
  3. Chunking and NER Tagging:

    • Chunking is the process of grouping words together based on their POS tags. We can define a grammar pattern using regular expressions to identify named entities. In NLTK, we can use the RegexpParser class to create a chunk parser based on the defined grammar.

    • We will define a grammar pattern that specifies the structure of named entities, such as noun phrases (NP) and proper nouns (NNP). This grammar pattern will be used to create a chunk parser using RegexpParser.

    • The chunk parser is then applied to the POS tagged words to identify named entities and generate a named entity tree.

  4. BIO Tagging:

    • BIO (Begin, Inside, Outside) tagging is a popular method to label the words within a named entity. It assigns a tag of "B" (Begin) to the first word of a named entity, "I" (Inside) to subsequent words of the entity, and "O" (Outside) to words that are not part of any named entity.

    • We iterate over the named entity tree and generate the BIO tags accordingly.

Let's now dive into the code example that demonstrates these steps:

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import RegexpParser

def bio_tag(sentence):
    words = word_tokenize(sentence)
    tagged_words = pos_tag(words)
    bio_tagged = []

    # Define the grammar for named entity recognition
    grammar = r"""
        NP: {<DT|PP\$>?<JJ>*<NN>+}
            {<NNP>+}
            {<NN>+}
    """
    cp = RegexpParser(grammar)

    # Perform named entity recognition
    named_entities = cp.parse(tagged_words)
    print(named_entities)

    # Generate BIO tags for named entities
    for subtree in named_entities:
        if isinstance(subtree, nltk.tree.Tree):
            if subtree.label() != "S":
                bio_tagged.extend(["B"] + ["I"] * (len(subtree) - 1))
            else:
                bio_tagged.extend(["O"] * len(subtree))
        else:
            bio_tagged.append("O")

    print(list

(zip(words, bio_tagged)))

# User input
user_input = input("Enter a sentence: ")
bio_tag(user_input)

Explanation of the Code:

The code snippet above demonstrates the implementation of Named Entity Recognition using NLTK in Python. Let's walk through each step:

  1. We import the required modules from NLTK: nltk, word_tokenize, pos_tag, and RegexpParser.

  2. The bio_tag() function takes a sentence as input and performs the NER process.

  3. We tokenize the input sentence using word_tokenize() and assign POS tags to the words using pos_tag().

  4. Next, we define the grammar pattern for named entity recognition using regular expressions. The grammar pattern specifies the structure of named entities, such as noun phrases (NP) and proper nouns (NNP).

  5. We create a chunk parser cp using RegexpParser and pass the grammar pattern as an argument.

  6. The chunk parser is applied to the POS tagged words to identify named entities and generate a named entity tree.

  7. We iterate over the named entity tree and generate BIO tags for each word. If a subtree is a named entity, we assign the "B" (Begin) tag to the first word and "I" (Inside) tags to subsequent words. If the subtree is not a named entity, we assign the "O" (Outside) tag.

  8. Finally, we print the named entity tree and the words along with their corresponding BIO tags.

Conclusion:

In this comprehensive guide, we have explored Named Entity Recognition (NER) and its implementation using NLTK in Python. NER is a fundamental task in NLP that helps in extracting valuable information from text data. By following the step-by-step code example and understanding the concepts behind NER, you can leverage NLTK's powerful capabilities to perform named entity tagging in your NLP projects. Start harnessing the power of NER to unlock meaningful insights from your textual data!

Did you find this article valuable?

Support Aswin Lal by becoming a sponsor. Any amount is appreciated!