Science Fair Project Encyclopedia
Traduki
Traduki is an open source machine translation program, developed with the Lua programming language and released under the GNU General Public License. It is a tool being developed to give free speech and translation to everyone. Traduki means "to translate" in Esperanto.
Development was suspended in mid-2002, but has restarted in 2003.
Machine Translation is a complex task. The following are preliminary ideas.
| Contents |
Input
Input is the reading the original English text. This can be from a simple console, GUI, or web interface, but it can also be from more complicated things such as OCR, handwriting recognition or speech recognition.
Tokenization
Tokenization is the division of the text into sentences and of sentences into words and punctuation. The division of the text into sentences can be done using "!", "?" and "." as separators. But sometimes, "." is used in numbers (i.e. 10.233), abbreviations (i.e. Dr.) and Initials (i.e. A. C. Doyle). The punctuation marks ",", ";", "", »«, :. () and [] can also be used to separate semi-independent sentences.
The article "What is a word, What is a sentence? Problems of Tokenization" is a good discussion of tokenization problems. It can be downloaded here
Morphological analysis
Each word must be analyzed to identify derived words. Dictionaries used in Machine Translation do not have words derived from simpler words. Derived words must be identified by the program itself. Verbal forms and plurals are the most common derived words.
Project Natural Language Toolkit[1] has some python code that could be reused in Traduki. However, Natural Language Toolkit is released under the IBM Common Public License 0.5. Can we use the code?
Syntactical analyses
Syntactical analysis is the determination of the syntactic function of the words. The program should discover if a word is a "verb" or a "noun". A dictionary with the syntactic classification of all root words must be used. WordNet[2] is a good source of data to build a good English dictionary.
Disambiguation
A word can have more than one syntactic function. For example, "fat" can be an adjective ("The fat boy eats hamburgers") and can be a noun ("Hamburgers have lots of fat"). So, how do we know that "fat" in the sentence "Hamburgers have lots of fat" is a noun? There are two methods:
- Statistical methods use large annotated corpora. Annotated corpora could tell us that "lots of " is always followed by a noun. Traduki should not use this method because all useful annotated corpora is proprietary.
- Constraint Grammar methods use grammar rules to exclude invalid combinations of syntactic functions. For example, "the" is never followed by a verb. There are more than 1000 rules that can be use to disambiguate a sentence.
Semantic Disambiguation
Sometimes, some ambiguity may remain after the application of the methods described above. Semantic information may be used to solve the problem. That's why a good dictionary must have some semantic information. For example, words related to music should be marked as such.
Translation to an interlanguage
All the syntactic, morphological and semantic information should be codified in an interlanguage. All the source language root words should be translated to root words. Esperanto is often used as an intermediate language (including in Traduki) because 99% of esperanto words have only one sense and because Esperanto is already somewhat of an interlanguage.
Ergane is a free to use multilanguage dictionary that use Esperanto as an interlanguage can be useful for Traduki.
Destination language syntheses
The syntheses of the destination language from interlanguage is an easy step. There is, however, some problems:
- there is the need for a verb conjugator
- there is the need for plurals generator
- translation from esperanto to the destination language can be ambiguous because there are more than one word for each esperanto word. Semantic information from the source text can be used to disambiguate.
See also
External links and references
Useful resources for the Traduki project
- Traduki page on SourceForge
- Pytalk: English parser and spellchecker
- WordNet - A Lexical Database for English
- página oficial: http://www.cogsci.princeton.edu/~wn/
- A Python interface to the WordNet lexical http://www.cs.brandeis.edu/~steele/sources/python.html
- Natural Language Toolkit
- vortaro
- linguaphile
- The VISL Constraint Grammar Compiler is a natural language parser generator. It is an implementation of Pasi Tapanainen's CG-2 constraint grammar formalism.
- The VISL Phrase Structure Grammar Compiler is an implementation of a parser generator for ambiguous context-free grammars, ambiguous input, and ambiguous output.
Online articles
- A phd thesis:"The present project has as its goal to incorporate a semantic component into an English Constraint Grammar parser so as to augment parserís performance."
- Should I use machine translation?
- Why Can't a Computer Translate More Like a Person?
- "Types of Semantic Information Necessary in a Machine Translation Lexicon" **http://talana.linguist.jussieu.fr/taln99/ps/A77/A77.pdf (PDF File)
Books
- Constraint Grammar : A Language-Independent System for Parsing Unrestricted Text (Natural Language Processing, No 4) ISBN 3110141795
- books
The contents of this article is licensed from www.wikipedia.org under the GNU Free Documentation License. Click here to see the transparent copy and copyright details


