Sindhi Diacritics Restoration by Letter Level Learning Approach
Main Article Content
Abstract
Sindhi is one of those languages that require diacritics for exact reading and comprehension, but in routine compositions diacritics are almost ignored. Hence it brings about many syntactical, morphological and phonological ambiguities for computational processing. The diacritics can be restored at letter and word levels, in this paper, letter level learning method is used for the task of Sindhi diacritics restoration in which surrounding letters of the specific letter are calculated and stored into a feature vector in order to compare them with the new examples which are input from the non-diacritized text. These letters are computed with different window sizes, the N=5 is observed most efficient one. The k-nearest neighbor classifier is implemented for the classification of instances and at last, the nearest instance is taken for the replacement of non-diacritized letter. The evaluation of results is represented in terms of Diacritic Error Rate (DER), which is 1.9%. The proposed approach is tested on Sindhi but can be used for other Arabic script based languages because the character set of Sindhi is the superset of Arabic character set.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.