Kannada Text Normalization in Source Analysis Phase of Machine Translation System

— Almost all documents used in text processing applications contain raw or real text. Some of words in raw text are represented in non-standard form. In this context, there is a need of text normalizer to transform or convert non-standard forms of words into standard and consistent forms. Design of text normalizer depends on the kind of data and applications. In Machine Translation System (MTS), a normalizer is required to categorize raw input text into morpheme based and non-morpheme based words and process non-morpheme based words by assigning their respective Parts of Speech (PoS) tags. In this paper, a text normalizer is proposed to normalize Kannada source text in MTS. The proposed text normalizer is tested on Enabling Minority Language Engineering (EMILLE) corpus and nearly 45%- 57% of input text has been filtered during normalization itself.

The aim of MTS is to convert input text from one language called source language to target language. Mainly, there are three phases in MTS viz., i) Source analysis phase ii) Semantic analysis phase iii) Target language generation phase. In source analysis phase, input raw texts need to be normalized. In general, raw text contains set of paragraphs. These paragraphs need to be split into sentences and further these sentences into words/tokens. Some of tokens like punctuation marks, numbers, acronyms, abbreviations, etc., that are present in raw text need to be extracted and processed during normalization process itself. In this context, a text normalizer is proposed to normalize Kannada text (source language) in source analysis phase of machine translation system. The paper is organized as follows. Section II gives the literature survey on existing tokneizer tools and text normalizers. Section III describes the details of proposed text normalizer for Kannada language in machine translation system. In Section IV, performance evaluation and result analysis of proposed text normalizer on EMILEE corpus is explained. Conclusion is given in Section V.
II. LITERATURE SURVEY Text segmentation and tokenization are two important tasks in normalization of given raw text. In literature, many tokenizers are reported for both Indian and non-Indian languages. But, most of these tokenizers consider space as delimiter and split given text into set of tokens These tokenizers work well for both Indian and non-Indian languages. A special, Indic tokenizer [9] is designed specifically for Indian languages. Some limitations are observed in Indic tokenizer. These limitations are listed below.  Numbers with period, comma, and hyphen split into separate tokens.  Abbreviations and acronyms are separated based on period (.) as delimiter.  Digit followed by alphabets or alphabets followed by digits will not be split into separate tokens.
Detailed descriptions about these limitations with sample examples are given in Table I. Literature shows that existing text normalizers are designed for both Indian and non-Indian languages. But most of these text normalizers are designed for text to speech synthesis applications [10][11][12][13][14][15]. In literature no text normalizer for MTS is reported. In this context, there is a great demand for the design of text normalizer in MTS. In this paper, text tokenizer and normalizer for Kannada language in MTS are presented.  Table I shows the limitations of Indic tokenizer with examples. To overcome these limitations, a special tokenizer is proposed. It is also found that in literature, almost all existing text normalizers are specifically designed for text to speech synthesis. Hence a text normalizer for Kannada language in MTS is also proposed.

III. PROPOSED WORK
The architecture of proposed text normalizer in source analysis phase of MTS is shown in Fig. 1. There are six phases in text normalization process, viz., i) Segmentation of text into set of sentences ii) Splitting of sentences into set of tokens, iii) Assignment of unique identification numbers to each token, iv) Identification and classification of tokens, v) PoS tagging for non-morpheme based words, vi) Removal of redundant morpheme based words. Detailed description of these six phases is given below.

Segmentation of text in to set of sentences:
Sentence segmentation is the process of dividing running text into sentences. In natural language processing applications, sentence boundary disambiguation is the major problem to decide where sentences begin and end. Due to the use of full stop character in abbreviations, acronyms, decimal point, email address, etc., may or may not also terminate a sentence. For example, the sentence "Mr. Nuthan went to market.", can be split into two sentences as i) "Mr" and ii) "Nuthan went to market", by considering full stop character as delimiter. By considering such kind of ambiguities, a rule based sentence segmentation tool is proposed.

IV. PERFORMANCE EVALUATION AND RESULT ANALYSIS
Publicly, no standard Kannada data set is available for research purpose. However, the EMILLE corpus is distributed free of cost for use in non-profit-making research. We have chosen 50 and 25 documents from stories and novels category of EMILEE corpus. These documents contain punctuation-marks, numbers, special symbols, words, acronyms, abbreviations in Kannada. The result obtained by proposed text normalizer on chosen Kannada EMILEE corpus is shown in Table II. The performance evaluation of proposed text normalizer is calculated using the following formulae.