Automatic Diacritics Restoration System for Sindhi
Keywords:
N-grams; Viterbi Algorithm; Diacritics Restoration; Sindhi LanguageAbstract
Sindhi language is based on the pattern of Arabic script and usually both are written without diacritics in the routine applications. The absence of diacritics creates many ambiguities and confusions for the possible vowel sounds of the group of characters used in the composition of the word. Moreover, the morphological and lexical ambiguity is also a case for the correct pronunciation in computational systems. Realizing the cause, this paper is composed to present an innovated and improved mechanism that inserts the diacritic signs correctly into the non-diacritized text by the multiplications of three N-gram probabilities with Viterbi algorithm, the probabilities of words are calculated by using unigram, bigram and trigram models. The performance of system is achieved in word error rate as 0.71% and diacritic error rate as 3.21%. A few languages i.e., Arabic, Urdu and Persian have the same characteristics as Sindhi does for the reason proposed system may be useful for mentioned languages on same scale.


