Neural Machine Translation — with Attention and Tensorflow 2.0

Shravan C
5 min readMar 4, 2020

--

https://www.kdnuggets.com/2019/08/tensorflow-20.html

This article will cover the translation for the Indian language(Hindi). Finally, after a lot of trials got the code working. This article is based on this solution in the TensorFlow website on NMT. I have followed the encoder, decoder, and attention as it is from the code mentioned. This is an attempt to learn by coding and taking a bit hard way by trying it on Indic language like Hindi. It is not straight forward as it is available in the code.

One major thing for getting it to work perfectly is the need for data, it is hard to get the datasets for the Indian language. I have picked three different datasets and merged them and came up with the solution. One is from manythings.org, another and last by not the least. There is one more dataset if interested, I will keep it updated if I try the other one as well. All the dataset is available in the GitHub. This training is done locally and hence a limited amount of data. I will update this article if I am able to run on the whole dataset. I will update you with details on memory consumption.

Before we start with the code, all the code is available in the GitHub and Colab. This may not be the right approach for NMT, this is just an experiment from my side, which I hope could help others in terms of kick-starting with translation and also datasets are available for further study. In this code I have modularized it, to make it more understandable.

Downloading the datasets locally

Here I have downloaded the dataset locally for my own convenience, it can be altered according to specific needs. My intention was to get a working program first and then experiment more on it.

http = urllib3.PoolManager()def extract(path, url, zipfilename):
with http.request('GET', url, preload_content=False) as r, open(zipfilename, 'wb') as out_file:
shutil.copyfileobj(r, out_file)
print("file--->", zipfilename)
with zipfile.ZipFile(zipfilename, 'r') as zip_ref:
zip_ref.extractall(path)
# Dataset 1
url = 'https://github.com/shravanc/datasets/blob/master/hin_eng/hi-en.zip?raw=true'
filename = 'hi-en.zip'
path = os.getcwd()
zipfilename = os.path.join(path, filename)
extract(path, url, zipfilename)

This code will download the dataset from GitHub and then unzip it to the current working directory. Likewise, many other datasets can also be added or removed based on the purpose. The dataset could be indifferent format so I just let it kept separately rather making it too generic.

Data Preparation

#==================================DataPreparation=======================================
from lib.utils import unicode_to_ascii, preprocess_sentence, create_dataset, create_new_dataset, load_dataset, max_length, convert
en_sentence = u"May I borrow this book?"
sp_sentence = u"क्या मैं यह पुस्तक उधार ले सकता हूँ?"
print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_sentence).encode('utf-8'))
path_to_file = os.path.join(os.getcwd(), "parallel_corpora/hin.txt")
en_1, hi_1 = create_dataset(path_to_file, None)
en_path = os.path.join(os.getcwd(), "hi-en/train.en")
hi_path = os.path.join(os.getcwd(), "hi-en/train.hi")
en_2, hi_2 = create_new_dataset(en_path, hi_path)
en_path = os.path.join(os.getcwd(), "hi-en/test.en")
hi_path = os.path.join(os.getcwd(), "hi-en/test.hi")
en_3, hi_3 = create_new_dataset(en_path, hi_path)
en_path = os.path.join(os.getcwd(), "hi-en/dev.en")
hi_path = os.path.join(os.getcwd(), "hi-en/dev.hi")
en_4, hi_4 = create_new_dataset(en_path, hi_path)
en_path = os.path.join(os.getcwd(), "HindiEnCorp/data.en")
hi_path = os.path.join(os.getcwd(), "HindiEnCorp/data.hi")
en_5, hi_5 = create_new_dataset(en_path, hi_path)
en = en_1 + en_2 + en_3 + en_4 + en_5
hi = hi_1 + hi_2 + en_3 + en_4 + en_5
print(len(en))
print(len(hi))
# Try experimenting with the size of that dataset
num_examples = 350000
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)
# Calculate max_length of the target tensors
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)
# Show length
print(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))
print ("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print ()
print ("Target Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1
dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape
#==================================DataPreparation=======================================

This code will basically format the string from the dataset and add start and end keyword for the attention model to identify the start and end of the sentence. And then with the help of dataset API, a data pipeline is created.

Attention, Encoder, and Decoder

For more details on attention, there is a very good article. This gives a very good visualization into the topic and can be understood after many reads again and again. And also this is also a good read.

#==================================Encoder==========================
from lib.encoder import Encoder
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)
print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))
#==================================Encoder==========================#==================================Attention========================
from lib.attention import BahdanauAttention
attention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)
print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))
#==================================Attention========================#==================================Dencoder=========================
from lib.decoder import Decoder
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)sample_decoder_output, _, _ = decoder(tf.random.uniform((BATCH_SIZE, 1)),
sample_hidden, sample_output)
print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))#==================================Dencoder=========================

Clubbing all the code into one file was not helping me understand. This gives a better intuition to what the theory tells us about. So one layer output is applied to the another and at last the decoder predicts each word at a time.

Training and Evaluate

Training is almost the same as it is mentioned in the TensorFlow website whereas there is a small bug fix in the evaluation. The original code breaks if there is a key that is not found previously.

#==================================Evaluate=========================
def evaluate(sentence):
attention_plot = np.zeros((max_length_targ, max_length_inp))
sentence = preprocess_sentence(sentence)
inputs = []
for i in sentence.split(' '):
if i == "":
break
wi = inp_lang.word_index.get(i)
if wi:
inputs.append(wi)
inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
maxlen=max_length_inp,
padding='post')
inputs = tf.convert_to_tensor(inputs)
result = ''
hidden = [tf.zeros((1, units))]
enc_out, enc_hidden = encoder(inputs, hidden)
dec_hidden = enc_hidden
dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)
for t in range(max_length_targ):
predictions, dec_hidden, attention_weights = decoder(dec_input,
dec_hidden,
enc_out)
# storing the attention weights to plot later on
attention_weights = tf.reshape(attention_weights, (-1, ))
attention_plot[t] = attention_weights.numpy()
predicted_id = tf.argmax(predictions[0]).numpy()result += targ_lang.index_word[predicted_id] + ' 'if targ_lang.index_word[predicted_id] == '<end>':
return result, sentence, attention_plot
# the predicted ID is fed back into the model
dec_input = tf.expand_dims([predicted_id], 0)
return result, sentence, attention_plot
#==================================Evaluate=========================

And that's it for this article. Finally was able to get it to work on my local machine, and looking forward to improving the results and on more languages. The rest of the code is available in Github. Good to right own translator, though it is not that good.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Shravan C
Shravan C

Written by Shravan C

Software Engineer | Machine Learning Enthusiast | Super interested in Deep Learning with Tensorflow | GCP

No responses yet

Write a response