Sindhi Diacritics Restoration by Letter Level Learning Approach

Main Article Content

J. A. MAHAR
G. Q. MEMON
H. SHAIKH

Abstract

Sindhi is one of those languages that require diacritics for exact reading and comprehension, but in routine compositions diacritics are almost ignored. Hence it brings about many syntactical, morphological and phonological ambiguities for computational processing. The diacritics can be restored at letter and word levels, in this paper, letter level learning method is used for the task of Sindhi diacritics restoration in which surrounding letters of the specific letter are calculated and stored into a feature vector in order to compare them with the new examples which are input from the non-diacritized text. These letters are computed with different window sizes, the N=5 is observed most efficient one. The k-nearest neighbor classifier is implemented for the classification of instances and at last, the nearest instance is taken for the replacement of non-diacritized letter. The evaluation of results is represented in terms of Diacritic Error Rate (DER), which is 1.9%. The proposed approach is tested on Sindhi but can be used for other Arabic script based languages because the character set of Sindhi is the superset of Arabic character set.


 

Article Details

How to Cite
J. A. MAHAR, G. Q. MEMON, & H. SHAIKH. (2011). Sindhi Diacritics Restoration by Letter Level Learning Approach. Sindh University Research Journal - SURJ (Science Series), 43(2). https://doi.org/10.26692/surj-ss.v43i2.6007
Section
Articles