
At this time of writing this article, I had completed my Masters in Software Engineering a week ago. In my final sem thesis project, I had done the sentiment analysis for Kannada Language using Multi-Lingual BERT.
TL;DR: Translated Dataset and Source Data.
After spending a lot of time browsing for any Kannada Dataset for Sentiment Analysis, I was not able to find much of the dataset. And finally, I found Dataset related to Covid-19. This dataset has a huge collection that is labeled. Then after downloading I tried to translate it to the Kannada language. First I tried to create my own machine translation model to do the translation. Ended up writing this article Neural Machine Translation. Since the dataset to train the translation model is very less, I ruled out this method. Then tried to check the cost of Google Translation API. It is a very expensive solution to do, so ruled out this as well. Then tried to find a freely available Python library like NLTK, but there was no Kannada support yet in this library. Finally, I found another python library that uses Google Translation API internally. TextBlob is the Python library that I settled down to use for the translation.
There is a limitation for using the TextBlob translation service because it is using Google Translation, it is limited to use only for 400 times a day. But the dataset that I had found to translate had like 5 Lakhs of news articles. Again I was let down by this.
Google Colab to the Rescue
With this limitation, the first thing that came in mind is to run the translation using Google Colab with multiple sessions. Though the limitation of 400 applies to Google Colab as well with multiple sessions opened, I was able to translate like 15 thousand articles a day. I continued this till I get like close to 70 thousand of articles. Google Colab to create this can be found here.