Building a SMT Language Translation Model using Python and NLTK
Statistical Machine Translator using IBM Model ENG->SPANISH
Language translation plays a crucial role in bridging communication gaps between different cultures and communities. In this blog, we will explore how to build a language translation model using Python and the Natural Language Toolkit (NLTK). We will use a parallel corpus containing English and Spanish sentences, preprocess the data, train an IBM Model 1 translation model, and create a translation function to interactively translate English sentences into Spanish. Let's dive into the code and learn how it all works.
Importing the Required Libraries:
To begin, we import the necessary libraries. We import pandas as pd to work with data frames and read the parallel corpus from a CSV file. Additionally, we import the re library for text cleaning and the nltk library for language processing tasks.
import pandas as pd
import re
import nltk
from nltk.translate import AlignedSent, IBMModel1
Loading and Preprocessing the Data:
Next, we load the parallel corpus data from a CSV file using pandas. We extract the English and Spanish sentences into separate lists. Then, we define a function called clean_sentences
that cleans the sentences by removing leading/trailing spaces, converting them to lowercase, and removing special characters and punctuation using regular expressions.
download the csv file from here
df = pd.read_csv("./engspn.csv")
english_sentences = df['english'].tolist()
spanish_sentences = df['spanish'].tolist()
def clean_sentences(sentences):
cleaned_sentences = []
for sentence in sentences:
sentence = sentence.strip()
sentence = sentence.lower()
sentence = re.sub(r"[^a-zA-Z0-9]+", " ", sentence)
cleaned_sentences.append(sentence.strip())
return cleaned_sentences
cleaned_english_sentences = clean_sentences(english_sentences)
cleaned_spanish_sentences = clean_sentences(spanish_sentences)
Training the Translation Model:
To train the translation model, we utilize the NLTK library. We first create aligned sentence pairs from the cleaned English and Spanish sentences. Then, we train the IBM Model 1 using these aligned sentences.
def train_translation_model(source_sentences, target_sentences):
aligned_sentences = [AlignedSent(source.split(), target.split()) for source, target in zip(source_sentences, target_sentences)]
ibm_model = IBMModel1(aligned_sentences, 10)
return ibm_model
translation_model = train_translation_model(cleaned_english_sentences, cleaned_spanish_sentences)
Translating Input Sentences:
Finally, we define a translation function called translate_input
that allows the user to interactively enter English sentences to be translated into Spanish. The function takes user input, cleans the sentence, tokenizes it, and generates translations for each source word based on the IBM Model 1 probabilities. The translated words are joined to form the translated sentence, which is then displayed to the user.
def translate_input(ibm_model):
while True:
source_text = input("Enter the English sentence to translate (or 'q' to quit): ")
if source_text.lower() == 'q':
print("Quitting...")
break
cleaned_text = clean_sentences(source_text.split())
source_words = cleaned_text
translated_words = []
for source_word in source_words:
max_prob = 0.0
translated_word = None
for target_word in ibm_model.translation_table[source_word]:
prob = ibm_model.translation_table[source_word][target_word]
if prob > max_prob:
max_prob = prob
translated_word = target_word
if translated_word is not None:
translated_words.append(translated_word)
translated_text = ' '.join(translated_words)
print("Translated
text:", translated_text)
print()
translate_input(translation_model)
Full Code:
import pandas as pd
import re
import nltk
from nltk.translate import AlignedSent, IBMModel1
# Load and Preprocess the Data
df = pd.read_csv("./engspn.csv")
english_sentences = df['english'].tolist()
spanish_sentences = df['spanish'].tolist()
def clean_sentences(sentences):
cleaned_sentences = []
for sentence in sentences:
sentence = sentence.strip()
sentence = sentence.lower()
sentence = re.sub(r"[^a-zA-Z0-9]+", " ", sentence)
cleaned_sentences.append(sentence.strip())
return cleaned_sentences
cleaned_english_sentences = clean_sentences(english_sentences)
cleaned_spanish_sentences = clean_sentences(spanish_sentences)
# Train the Translation Model
def train_translation_model(source_sentences, target_sentences):
aligned_sentences = [AlignedSent(source.split(), target.split()) for source, target in zip(source_sentences, target_sentences)]
ibm_model = IBMModel1(aligned_sentences, 10)
return ibm_model
translation_model = train_translation_model(cleaned_english_sentences, cleaned_spanish_sentences)
# Translate Input Sentences
def translate_input(ibm_model):
while True:
source_text = input("Enter the English sentence to translate (or 'q' to quit): ")
if source_text.lower() == 'q':
print("Quitting...")
break
cleaned_text = clean_sentences(source_text.split())
source_words = cleaned_text
translated_words = []
for source_word in source_words:
max_prob = 0.0
translated_word = None
for target_word in ibm_model.translation_table[source_word]:
prob = ibm_model.translation_table[source_word][target_word]
if prob > max_prob:
max_prob = prob
translated_word = target_word
if translated_word is not None:
translated_words.append(translated_word)
translated_text = ' '.join(translated_words)
print("Translated text:", translated_text)
print()
translate_input(translation_model)
This is the complete code for building a language translation model using Python and NLTK.
To run the code, ensure that you have the required packages installed (e.g., pandas, nltk). Provide the path to your CSV file containing the parallel corpus, and then execute the translate_input
function to start translating English sentences into Spanish.
Conclusion:
In this blog, we have explored how to build a language translation model using Python and NLTK. We learned how to load and preprocess a parallel corpus, train an IBM Model 1 translation model, and create a translation function to interactively translate English sentences into Spanish. Language translation is a fascinating field with numerous applications, and by leveraging the power of Python and NLTK, you can create your own translation models for various language pairs.
Feel free to experiment with different parallel corpora, explore other translation models, or extend the functionality of the translation function. Happy translating!
If you have any questions or need further assistance, please let me know.