POS Tagging: A review of BIS POS tagset and ILCI-II Malayalam Text Corpus

discobot · September 30, 2020, 1:03pm

The Bureau of Indian Standards(BIS) had published a Part of Speech(POS) tagset for Indian languages. POS is the process of assigning a part of speech marker to each word in a given text. In this article, I am reviewing the tag set defined in it. While developing mlmorph project I had explored a candidate POS tagging schema for Malayalam. I did not choose BIS tagset for the reasons I am going to explian in this article. Along with the tagset, we will also analyse the ILCI-II Malayalam text corpus published by TDIL using the BIS POS tagset. I will start with some of the concepts and how that applies to different languages.

This is a companion discussion topic for the original entry at https://blog.smc.org.in/pos-tagging-a-review-of-bis-pos-tagset-and-ilci-ii-malayalam-text-corpus/