Building a SMT Language Translation Model using Python and NLTK

Building a SMT Language Translation Model using Python and NLTK

Statistical Machine Translator using IBM Model ENG->SPANISH

·

4 min read

Language translation plays a crucial role in bridging communication gaps between different cultures and communities. In this blog, we will explore how to build a language translation model using Python and the Natural Language Toolkit (NLTK). We will use a parallel corpus containing English and Spanish sentences, preprocess the data, train an IBM Model 1 translation model, and create a translation function to interactively translate English sentences into Spanish. Let's dive into the code and learn how it all works.

Importing the Required Libraries:

To begin, we import the necessary libraries. We import pandas as pd to work with data frames and read the parallel corpus from a CSV file. Additionally, we import the re library for text cleaning and the nltk library for language processing tasks.

import pandas as pd
import re
import nltk
from nltk.translate import AlignedSent, IBMModel1

Loading and Preprocessing the Data:

Next, we load the parallel corpus data from a CSV file using pandas. We extract the English and Spanish sentences into separate lists. Then, we define a function called clean_sentences that cleans the sentences by removing leading/trailing spaces, converting them to lowercase, and removing special characters and punctuation using regular expressions.

download the csv file from here

df = pd.read_csv("./engspn.csv")
english_sentences = df['english'].tolist()
spanish_sentences = df['spanish'].tolist()

def clean_sentences(sentences):
    cleaned_sentences = []
    for sentence in sentences:
        sentence = sentence.strip()
        sentence = sentence.lower()
        sentence = re.sub(r"[^a-zA-Z0-9]+", " ", sentence)
        cleaned_sentences.append(sentence.strip())
    return cleaned_sentences

cleaned_english_sentences = clean_sentences(english_sentences)
cleaned_spanish_sentences = clean_sentences(spanish_sentences)

Training the Translation Model:

To train the translation model, we utilize the NLTK library. We first create aligned sentence pairs from the cleaned English and Spanish sentences. Then, we train the IBM Model 1 using these aligned sentences.

def train_translation_model(source_sentences, target_sentences):
    aligned_sentences = [AlignedSent(source.split(), target.split()) for source, target in zip(source_sentences, target_sentences)]
    ibm_model = IBMModel1(aligned_sentences, 10)
    return ibm_model

translation_model = train_translation_model(cleaned_english_sentences, cleaned_spanish_sentences)

Translating Input Sentences:

Finally, we define a translation function called translate_input that allows the user to interactively enter English sentences to be translated into Spanish. The function takes user input, cleans the sentence, tokenizes it, and generates translations for each source word based on the IBM Model 1 probabilities. The translated words are joined to form the translated sentence, which is then displayed to the user.

def translate_input(ibm_model):
    while True:
        source_text = input("Enter the English sentence to translate (or 'q' to quit): ")
        if source_text.lower() == 'q':
            print("Quitting...")
            break
        cleaned_text = clean_sentences(source_text.split())
        source_words = cleaned_text
        translated_words = []
        for source_word in source_words:
            max_prob = 0.0
            translated_word = None
            for target_word in ibm_model.translation_table[source_word]:
                prob = ibm_model.translation_table[source_word][target_word]
                if prob > max_prob:
                    max_prob = prob
                    translated_word = target_word
            if translated_word is not None:
                translated_words.append(translated_word)
        translated_text = ' '.join(translated_words)
        print("Translated

 text:", translated_text)
        print()

translate_input(translation_model)

Full Code:

import pandas as pd
import re
import nltk
from nltk.translate import AlignedSent, IBMModel1

# Load and Preprocess the Data
df = pd.read_csv("./engspn.csv")
english_sentences = df['english'].tolist()
spanish_sentences = df['spanish'].tolist()

def clean_sentences(sentences):
    cleaned_sentences = []
    for sentence in sentences:
        sentence = sentence.strip()
        sentence = sentence.lower()
        sentence = re.sub(r"[^a-zA-Z0-9]+", " ", sentence)
        cleaned_sentences.append(sentence.strip())
    return cleaned_sentences

cleaned_english_sentences = clean_sentences(english_sentences)
cleaned_spanish_sentences = clean_sentences(spanish_sentences)

# Train the Translation Model
def train_translation_model(source_sentences, target_sentences):
    aligned_sentences = [AlignedSent(source.split(), target.split()) for source, target in zip(source_sentences, target_sentences)]
    ibm_model = IBMModel1(aligned_sentences, 10)
    return ibm_model

translation_model = train_translation_model(cleaned_english_sentences, cleaned_spanish_sentences)

# Translate Input Sentences
def translate_input(ibm_model):
    while True:
        source_text = input("Enter the English sentence to translate (or 'q' to quit): ")
        if source_text.lower() == 'q':
            print("Quitting...")
            break
        cleaned_text = clean_sentences(source_text.split())
        source_words = cleaned_text
        translated_words = []
        for source_word in source_words:
            max_prob = 0.0
            translated_word = None
            for target_word in ibm_model.translation_table[source_word]:
                prob = ibm_model.translation_table[source_word][target_word]
                if prob > max_prob:
                    max_prob = prob
                    translated_word = target_word
            if translated_word is not None:
                translated_words.append(translated_word)
        translated_text = ' '.join(translated_words)
        print("Translated text:", translated_text)
        print()

translate_input(translation_model)

This is the complete code for building a language translation model using Python and NLTK.

To run the code, ensure that you have the required packages installed (e.g., pandas, nltk). Provide the path to your CSV file containing the parallel corpus, and then execute the translate_input function to start translating English sentences into Spanish.

Conclusion:

In this blog, we have explored how to build a language translation model using Python and NLTK. We learned how to load and preprocess a parallel corpus, train an IBM Model 1 translation model, and create a translation function to interactively translate English sentences into Spanish. Language translation is a fascinating field with numerous applications, and by leveraging the power of Python and NLTK, you can create your own translation models for various language pairs.

Feel free to experiment with different parallel corpora, explore other translation models, or extend the functionality of the translation function. Happy translating!

If you have any questions or need further assistance, please let me know.

Did you find this article valuable?

Support Aswnss Blog by becoming a sponsor. Any amount is appreciated!