For a long time now, machine translation (MT) has been the holy grail of language understanding, ultimately seeking to provide high-quality, idiomatic translation between any pair of languages. Its roots go back to the early days of the Cold War, when the promise of automatic translation led to substantial government sponsorship, and with it, the genesis of NLP itself.
Today, practical translation systems exist for particular pairs of languages, and some are integrated into web search engines. However, these systems have some serious shortcomings. We can explore them with the help of NLTK's "babelizer" (which is automatically loaded when you import this chapter's materials using from nltk.book import *). This program submits a sentence for translation into a specified language, then submits the resulting sentence for translation back into English. It stops after 12 iterations, or if it receives a translation that was produced already (indicating a loop): >>> babelize_shell()
NLTK Babelizer: type 'help' for a list of commands. Babel> how long before the next flight to Alice Springs? Babel> german Babel> run
0> how long before the next flight to Alice Springs? 1> wie lang vor dem folgenden Flug zu Alice Springs? 2> how long before the following flight to Alice jump? 3> wie lang vor dem folgenden Flug zu Alice springen Sie? 4> how long before the following flight to Alice do you jump? 5> wie lang, bevor der folgende Flug zu Alice tun, Sie springen? 6> how long, before the following flight to Alice does, do you jump? 7> wie lang bevor der folgende Flug zu Alice tut, tun Sie springen? 8> how long before the following flight to Alice does, do you jump? 9> wie lang, bevor der folgende Flug zu Alice tut, tun Sie springen? 10> how long, before the following flight does to Alice, do do you jump? 11> wie lang bevor der folgende Flug zu Alice tut, Sie tun Sprung? 12> how long before the following flight does leap to Alice, does you?
Observe that the system correctly translates Alice Springs from English to German (in the line starting 1>), but on the way back to English, this ends up as Alice jump (line 2). The preposition before is initially translated into the corresponding German preposition vor, but later into the conjunction bevor (line 5). After line 5 the sentences become non-sensical (but notice the various phrasings indicated by the commas, and the change from jump to leap). The translation system did not recognize when a word was part of a proper name, and it misinterpreted the grammatical structure. The grammatical problems are more obvious in the following example. Did John find the pig, or did the pig find John? >>> babelize_shell()
Babel> The pig that John found looked happy Babel> german Babel> run
0> The pig that John found looked happy
1> Das Schwein, das John fand, schaute gl?cklich
2> The pig, which found John, looked happy
Machine translation is difficult because a given word could have several possible translations (depending on its meaning), and because word order must be changed in keeping with the grammatical structure of the target language. Today these difficulties are being faced by collecting massive quantities of parallel texts from news and government websites that publish documents in two or more languages. Given a document in German and English, and possibly a bilingual dictionary, we can automatically pair up the sentences, a process called text alignment. Once we have a million or more sentence pairs, we can detect corresponding words and phrases, and build a model that can be used for translating new text.
Was this article helpful?