OCR (pytesseract) and NLP (SpaCy) Application

Shravan C
3 min readNov 18, 2018

--

OCR

This project is about exposing the possibility and learning about what OCR and NLP can do and explore more opportunity to find more solution.

Problem trying to solve now is to extract license number, issued date and expiry date of real estate agents from the License Certificate. This article does not say about how to coding and structuring, it’s only demonstrating how a solution is obtained. The primary focus is on the task it performs.

An approach of solving the same is by extracting each line with the help of OCR (pytesseract python library) and find the keywords inside each line. Assumption done here is keywords will have symbol i.e. it will be followed by a colon (EXPIRES: , Effective: , License Number:). Sample keywords are as below:

LICENSE Number
EXPIRES
Expiration
Effective
Identification Number
Issuance Date
Expiration Date

Once the keyword is found in the line, based on the assumption stated above, it is separated and then it is stored in the list. Where the first element is the key and second element is the value. It can be stored in the hash as well.

Analyze function does all the work of separating the lines with the assumption made and responsible for putting it into the list. Below is the code snippet.

def analyse(lines, keywords):
slist = []
i = 0
while i < len(lines):
word = re.split(":|-|~|#", lines[i])
if isKeyWord(keywords, word[0]):
if (isEmpty(word[1])):
i += 1
while (i < len(lines)):
flag = False
word[1] += " " + lines[i]
str = lines[i].split(", ")
for fu in str:
if (isPincode(fu)):
flag = True
break
if flag:
break
i += 1
push(slist, word)
i += 1
return slist

Now is the entry for SpaCy Library:

Before starting with pytesseract, have used google vision API to get the text from a given image. At that time concentration was on to get the text analyzed. I was searching for a ready-made library. And found SpaCy very helpful. It comes with a pre-trained entity detection and it’s awesome. It is configurable anyway. It is up to the application or the individual how it has to be used. I would rather want to write a separate article explaining the power of the SpaCy library. For now, will focus on the current article.

Analyze function described above would work if there is one keyword in a line. If it found two keywords it would fail. Maybe we could improve the analyze code itself. But I chose to use the SpaCy library to get my work easy and fast. Created a set of match rule for the SpaCy object to identify the rule and create an entity out of it. So interface for this integration is as below:

#line = 'License Number: 12345 Issued on: November 10, 2018 Expiry: November 10, 2028'obj = spacy_lib.split_entity(line)# obj = ['License Number: 12345', 'Issued on: November 10, 2018', 'Expiry: November 10, 2028']

On passing each line, it returns the list of identified keywords. Now my analyze function will be happy to receive and parse it as I want it to.

Conclusion:

Parts involved in the code are

  • Extract text from the License Certificate using pytesseract
  • Process the extracted data by passing to SpaCy. Leveraging the support given by the NLP library(SpaCy).
  • Analyze the processed data. And extract the required Keywords that are required out of the certificate.

GitHub Link for the code base: https://github.com/shravanc/ocr_nlp

--

--

Shravan C

Software Engineer | Machine Learning Enthusiast | Super interested in Deep Learning with Tensorflow | GCP