|
ABSTRACT
Title |
: |
A Novel Approach for English to South Dravidian Language Statistical Machine Translation System |
Authors |
: |
Unnikrishnan P, Antony P J, Dr. Soman K P |
Keywords |
: |
SMT; Dravidian languages; parsing; morphology;
inflections |
Issue Date |
: |
November 2010 |
Abstract |
: |
Development of a well fledged bilingual machine
translation (MT) system for any two natural languages with
limited electronic resources and tools is a challenging and
demanding task. This paper presents the development of a
statistical machine translation (SMT) system for English to South
Dravidian languages like Malayalam and Kannada by
incorporating syntactic and morphological information. SMT is
a data oriented statistical framework for translating text
from one natural language to another based on the knowledge
extracted from bilingual corpus. Even though there are efforts
towards building such an English to South Dravidian translation
system ,unfortunately we do not have an efficient translation
system till now. The first and most important step in SMT is
creating a well aligned parallel corpus for training the system.
Experimental research shows that the existing methodology for
bilingual parallel corpus creation is not efficient for English to
South Dravidian language in the SMT system. In order to
increase the performance of the translation system, we have
introduced a new approach in creating parallel corpus. The main
ideas which we have implemented and proven very effective for
English to south Dravidian languages SMT system are: (i)
reordering the English source sentence according to Dravidian
syntax, (ii) using the root suffix separation on both English
and Dravidian words and iii) use of morphological information
which substantially reduce the corpus size required for training
the system. Since the unavailability of full fledged parsing and
morphological tools for Malayalam and Kannada languages,
sentence synthesis was done both manually and existing morph
analyzer created by Amrita university. From the experiment we
found that the performance of our systems are significantly well
and achieves a very competitive accuracy for small sized bilingual
corpora. The proposed ideas can be directly used for other south
Dravidian languages like Tamil and Telugu with some minor
changes.
|
Page(s) |
: |
2749-2759 |
ISSN |
: |
0975–3397 |
Source |
: |
Vol. 2, Issue.8 |
|