Indic Tesseract - Malayalam OCR

tvsijin · February 27, 2017, 5:40pm

I would like to suggest an idea on which I would like to work on. This project aims at improving accuracy of Indic Languages (Malayalam for now), in Tesseract OCR Engine.

The extend of accuracy of any language in Tesseract depends upon the initial training data and the corpus which explain things like ambiguity in recognition (unicharambigs file) also the standardised dictionaries which will also explain the frequency of certain words likely to happen.

Working on this project involves extensive work on collecting and creating a specific, logically standardised dataset from the corpus and later doing training on these. Also may need constant testing for accuracy measurements and repeating the training process correcting mistakes untill a 99% accuracy is achieved. This i believe will take some serious amount of manual work time.

As of now, I have studied the project workflow and have also done some test runs. https://github.com/tvsijin/indic-tesseract