A Model for Sindhi Text Segmentation into Word Tokens

Main Article Content

J. A. MAHAR
H. SHAIKH
G. Q. MEMON

Abstract

The corpus is prerequisite to conduct the experiments of computational linguistic applications on any language. Generally, the corpora are downloaded from Internet in different formats. Usually, the downloaded corpora have some types of word ambiguities regarding computational processes; however, it is observed that in Sindhi language, two types of ambiguities are commonly found i.e. compound words typed without embedded space and typo errors. Without correct segmentation of text into word tokens, it is difficult to get better results of linguistic applications. Therefore, tokenization is the inevitable component of natural language and speech processing applications. This paper presents a new model that correctly segments the words of Sindhi language. The model consists of three layers; layer 1 is used to input the text and segment the words using white space, simple and compound words are segmented in layer 2 and complex word are segmented in layer 3. The tokenizer is tested on 2792 Sindhi words and it achieved the accuracy of 91.76%.

Article Details

How to Cite
J. A. MAHAR, H. SHAIKH, & G. Q. MEMON. (2012). A Model for Sindhi Text Segmentation into Word Tokens. Sindh University Research Journal - SURJ (Science Series), 44(1). Retrieved from https://sujo.usindh.edu.pk/index.php/SURJ/article/view/5668
Section
Articles