Neural Machine Translation — with Attention and Tensorflow 2.0

5 min readMar 4, 2020

https://www.kdnuggets.com/2019/08/tensorflow-20.html

This article will cover the translation for the Indian language(Hindi). Finally, after a lot of trials got the code working. This article is based on this solution in the TensorFlow website on NMT. I have followed the encoder, decoder, and attention as it is from the code mentioned. This is an attempt to learn by coding and taking a bit hard way by trying it on Indic language like Hindi. It is not straight forward as it is available in the code.

One major thing for getting it to work perfectly is the need for data, it is hard to get the datasets for the Indian language. I have picked three different datasets and merged them and came up with the solution. One is from manythings.org, another and last by not the least. There is one more dataset if interested, I will keep it updated if I try the other one as well. All the dataset is available in the GitHub. This training is done locally and hence a limited amount of data. I will update this article if I am able to run on the whole dataset. I will update you with details on memory consumption.

Before we start with the code, all the code is available in the GitHub and Colab. This may not be the right approach for NMT, this is just an experiment from my side, which I hope could help others in terms of kick-starting with translation and also datasets are available for further study. In this code I have modularized it, to make it more understandable.

Downloading the datasets locally

Here I have downloaded the dataset locally for my own convenience, it can be altered according to specific needs. My intention was to get a working program first and then experiment more on it.

http = urllib3.PoolManager()def extract(path, url, zipfilename):
  with http.request('GET', url, preload_content=False) as r, open(zipfilename, 'wb') as out_file:
    shutil.copyfileobj(r, out_file)print("file--->", zipfilename)
  with zipfile.ZipFile(zipfilename, 'r') as zip_ref:
    zip_ref.extractall(path)# Dataset 1
url = 'https://github.com/shravanc/datasets/blob/master/hin_eng/hi-en.zip?raw=true'
filename = 'hi-en.zip'
path = os.getcwd()
zipfilename = os.path.join(path, filename)
extract(path, url, zipfilename)

This code will download the dataset from GitHub and then unzip it to the current working directory. Likewise, many other datasets can also be added or removed based on the purpose. The dataset could be indifferent format so I just let it kept separately rather making it too generic.

Data Preparation

#==================================DataPreparation=======================================
from lib.utils import unicode_to_ascii, preprocess_sentence, create_dataset, create_new_dataset, load_dataset, max_length, converten_sentence = u"May I borrow this book?"
sp_sentence = u"क्या मैं यह पुस्तक उधार ले सकता हूँ?"
print(preprocess_sentence(en_sentence))
print(preprocess_sentence(sp_sentence).encode('utf-8'))path_to_file = os.path.join(os.getcwd(), "parallel_corpora/hin.txt")
en_1, hi_1 = create_dataset(path_to_file, None)en_path = os.path.join(os.getcwd(), "hi-en/train.en")
hi_path = os.path.join(os.getcwd(), "hi-en/train.hi")
en_2, hi_2 = create_new_dataset(en_path, hi_path)en_path = os.path.join(os.getcwd(), "hi-en/test.en")
hi_path = os.path.join(os.getcwd(), "hi-en/test.hi")
en_3, hi_3 = create_new_dataset(en_path, hi_path)en_path = os.path.join(os.getcwd(), "hi-en/dev.en")
hi_path = os.path.join(os.getcwd(), "hi-en/dev.hi")
en_4, hi_4 = create_new_dataset(en_path, hi_path)en_path = os.path.join(os.getcwd(), "HindiEnCorp/data.en")
hi_path = os.path.join(os.getcwd(), "HindiEnCorp/data.hi")
en_5, hi_5 = create_new_dataset(en_path, hi_path)en = en_1 + en_2 + en_3 + en_4 + en_5
hi = hi_1 + hi_2 + en_3 + en_4 + en_5print(len(en))
print(len(hi))# Try experimenting with the size of that dataset
num_examples = 350000
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset(path_to_file, num_examples)# Calculate max_length of the target tensors
max_length_targ, max_length_inp = max_length(target_tensor), max_length(input_tensor)# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)# Show length
print(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))print ("Input Language; index to word mapping")
convert(inp_lang, input_tensor_train[0])
print ()
print ("Target Language; index to word mapping")
convert(targ_lang, target_tensor_train[0])BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word_index)+1
vocab_tar_size = len(targ_lang.word_index)+1dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape
#==================================DataPreparation=======================================

This code will basically format the string from the dataset and add start and end keyword for the attention model to identify the start and end of the sentence. And then with the help of dataset API, a data pipeline is created.

Attention, Encoder, and Decoder

For more details on attention, there is a very good article. This gives a very good visualization into the topic and can be understood after many reads again and again. And also this is also a good read.

#==================================Encoder==========================
from lib.encoder import Encoder
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)# sample input
sample_hidden = encoder.initialize_hidden_state()
sample_output, sample_hidden = encoder(example_input_batch, sample_hidden)
print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))#==================================Encoder==========================#==================================Attention========================
from lib.attention import BahdanauAttentionattention_layer = BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))#==================================Attention========================#==================================Dencoder=========================
from lib.decoder import Decoderdecoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)sample_decoder_output, _, _ = decoder(tf.random.uniform((BATCH_SIZE, 1)),
                                      sample_hidden, sample_output)print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))#==================================Dencoder=========================

Clubbing all the code into one file was not helping me understand. This gives a better intuition to what the theory tells us about. So one layer output is applied to the another and at last the decoder predicts each word at a time.

Training and Evaluate

Training is almost the same as it is mentioned in the TensorFlow website whereas there is a small bug fix in the evaluation. The original code breaks if there is a key that is not found previously.

#==================================Evaluate=========================
def evaluate(sentence):
  attention_plot = np.zeros((max_length_targ, max_length_inp))
  sentence = preprocess_sentence(sentence)
  inputs = []
  for i in sentence.split(' '):
    if i == "":
      break
    wi = inp_lang.word_index.get(i)
    if wi:
      inputs.append(wi)inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],
                                             maxlen=max_length_inp,
                                             padding='post')
inputs = tf.convert_to_tensor(inputs)
result = ''hidden = [tf.zeros((1, units))]
  enc_out, enc_hidden = encoder(inputs, hidden)dec_hidden = enc_hidden
  dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)for t in range(max_length_targ):
    predictions, dec_hidden, attention_weights = decoder(dec_input,
                                                         dec_hidden,
                                                         enc_out)# storing the attention weights to plot later on
    attention_weights = tf.reshape(attention_weights, (-1, ))
    attention_plot[t] = attention_weights.numpy()predicted_id = tf.argmax(predictions[0]).numpy()result += targ_lang.index_word[predicted_id] + ' 'if targ_lang.index_word[predicted_id] == '<end>':
      return result, sentence, attention_plot# the predicted ID is fed back into the model
    dec_input = tf.expand_dims([predicted_id], 0)return result, sentence, attention_plot
#==================================Evaluate=========================

And that's it for this article. Finally was able to get it to work on my local machine, and looking forward to improving the results and on more languages. The rest of the code is available in Github. Good to right own translator, though it is not that good.

Neural Machine Translation — with Attention and Tensorflow 2.0

Downloading the datasets locally

Data Preparation

Attention, Encoder, and Decoder

Training and Evaluate

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Shravan C

No responses yet