WASHINGTON, Oct. 31 (Xinhua) -- American researchers have developed a novel language translation model that could run without the need for human annotations and guidance and lead to fast, efficient computer-based translations of far more languages.
Current translation systems from Google and Facebook have to look for patterns in millions of documents that have been translated into various languages by humans. Those data are difficult to gather and simply may not exist for many of the 7,000 languages spoken worldwide.
In a paper being presented this week at the Conference on Empirical Methods in Natural Language Processing, researchers from Massachusetts Institute of Technology (MIT) described a model that performed as accurately as state-of-the-art monolingual models but much more quickly and using only a fraction of the computation power.
The model leverages a metric in statistics that essentially measures distances between points in one computational space and matches them to similarly distanced points in another space, according to the study.
The researchers applied that technique to "word embeddings" of two languages, which are words represented as arrays of numbers or vectors, with words of similar meanings clustered closer together.
In doing so, the model quickly aligns the word in both embeddings that are most closely correlated by relative distances, meaning they're likely to be direct translations.
"If you don't have any data that matches two languages," said the paper's first author David Alvarez-Melis with MIT, "you can map two languages and, using these distance measurements, align them."
For instance, the vector for "father" may fall in completely different areas in two matrices. But vectors for "father" and "mother" will most likely always be close together.
"By looking at distance, and not the absolute positions of vectors, then you can skip the alignment and go directly to matching the correspondences between vectors," said Alvarez-Melis.
Alvarez-Melis called it a "soft translation," "because instead of just returning a single word translation, it tells you 'this vector, or word, has a strong correspondence with this word, or words, in the other language.'"
It represented a step toward one of the major goals of machine translation, which is fully unsupervised word alignment, according to the researchers.