Issues and Challenges in Sindhi OCR

Main Article Content

D. N. HAKRO
I. A. ISMAILI
A. Z. TALIB
Z. BHATTI
G. N. MOJAI

Abstract

Optical Character Recognition (OCR) is the reading (recognition) of a written or printed document. Many of the languages are enriched with the OCR but OCR is lacking in Sindhi Language which has a golden 5000 year history. OCRs for some of the languages including Latin script and some other languages with isolated characters (non-cursive) are easy to develop whereas developing an OCR for a cursive language and a language possessing a large set of characters such as Sindhi is a challenging job. Sindhi Language has 52 characters as compared to 28 in Arabic, 32 in Persian and 39 in Urdu. This paper presents the various scripts of Sindhi Language including very old scripts, and issues and challenges in Sindhi OCR posed by cursive nature and other features of the current standard script. The main challenges include cursiveness, more characters dots, and variation of the placement and orientation of dots, four dotted characters, a large set of characters for recognition, Unicode representation, more base shape group characters, same base shape with variation in number and placement and orientation of dots, ambiguity between the characters with very slight difference, more characters with dots, context sensitive shapes, ligatures, noise, skew and fonts in Sindhi OCR. We also provide a summary of issues and challenges for the development of Sindhi OCR. This summary is useful for the researchers of OCR as well on Sindhi computing

Article Details

How to Cite
D. N. HAKRO, I. A. ISMAILI, A. Z. TALIB, Z. BHATTI, & G. N. MOJAI. (2014). Issues and Challenges in Sindhi OCR. Sindh University Research Journal - SURJ (Science Series), 46(2). Retrieved from https://sujo.usindh.edu.pk/index.php/SURJ/article/view/5337
Section
Articles