A Model for Sindhi Text Segmentation into Word Tokens

J. A. MAHAR; H. SHAIKH; G. Q. MEMON

Full Text

Published: Mar 8, 2012

Keywords:

Sindhi Language; Tokenization; Corpora; Layers; Word Ambiguities

J. A. MAHAR

H. SHAIKH

G. Q. MEMON

Abstract

The corpus is prerequisite to conduct the experiments of computational linguistic applications on any language. Generally, the corpora are downloaded from Internet in different formats. Usually, the downloaded corpora have some types of word ambiguities regarding computational processes; however, it is observed that in Sindhi language, two types of ambiguities are commonly found i.e. compound words typed without embedded space and typo errors. Without correct segmentation of text into word tokens, it is difficult to get better results of linguistic applications. Therefore, tokenization is the inevitable component of natural language and speech processing applications. This paper presents a new model that correctly segments the words of Sindhi language. The model consists of three layers; layer 1 is used to input the text and segment the words using white space, simple and compound words are segmented in layer 2 and complex word are segmented in layer 3. The tokenizer is tested on 2792 Sindhi words and it achieved the accuracy of 91.76%.

How to Cite

J. A. MAHAR, H. SHAIKH, & G. Q. MEMON. (2012). A Model for Sindhi Text Segmentation into Word Tokens. Sindh University Research Journal - SURJ (Science Series), 44(1). Retrieved from https://sujo.usindh.edu.pk/index.php/SURJ/article/view/5668

Issue

Vol. 44 No. 1 (2012): Sindh University Research Journal (Science Series) SURJ

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details