There are many reasons that current state-of-the-art machine translation (MT) does not live up to its promise. We examine those issues in relation to Google Translate in this article: http://kamusi.org/google_translate.
MT can do better – a lot better. The systems we are developing at Kamusi can produce a quantum leap in translation technology, creating much more accuracy across many more languages than today’s best efforts. The argument against using dictionary data as a launching point for MT always boils down to one thing: we don’t have the data. Kamusi sees the collection of high-quality data about every word in every language as a challenge, not a barrier. As the data comes online, we can put it in the service of MT and other language technologies.
Here are ten ways that Kamusi technology can fundamentally transform machine translation:
Efficiency: Standard statistical machine translation (SMT) and the newly vogue Neural Networks (NN) work by comparing large bodies of parallel text between languages. For English to French, NN and SMT need a big set of comparable documents. For English to Spanish, a different set, and another for Spanish to French. Lessons for a language from one set do not transfer to another. Kamusi’s approach begins on a per-language basis – we focus on the specific meanings of terms in one language, and match them directly with the comparable term in the translation language. This process reduces the burden of pair-by-pair analysis of large volumes of parallel text, and can bring a pair without a parallel corpus (>99.999% of all 25,000,000 language pairs) quite some distance down the path. Instead of searching millions of lines of text to predict possible matches from one language to another, Kamusi performs a dictionary search that cuts directly to sense-specified options for each word. We anticipate that Kamusi’s attention to the lexicon of each language will lead to reduced effort with higher accuracy when integrated with other approaches to MT
Scalability: Adding new languages is entirely modular – any language can be plugged into the system and grown at the pace of its contributors. When a language comes into Kamusi, its terms are immediately available for translation to all other languages. Launching a new language involves about an hour of configuration, upon which all of the project’s tools are available for experts and the crowd to develop and use the data.
Expandability: New terms, senses, and translations can be added as they are discovered, in nearly real time. Imagine that you notice that the Kamusi results for order do not include this sense: “A request for something to be made, supplied, or served”. You use a simple tool to submit the new sense. Once a moderator approves, your sense is immediately published for use by the public or in downstream technologies like MT. Moreover, the concept is put before the contributors for other languages, so can gain meticulously curated translations virtually over night.
Clarity: MT has two major chores: vocabulary and grammar. Kamusi is designed to get the vocabulary right every time. A word like “light” is treated as many different concepts – not heavy, not dark, not serious, etc. – each with its own entry. Those entries are each paired with terms for equivalent concepts in other languages. There are therefore clear relations at the level of the concept. This human-cultivated concept set solves part of the problem inherent within MT, knowing that if a particular sense of l-i-g-h-t appears in the source document, it should be translated with a particular term in the target language. It does not eliminate the task of figuring out which sense is intended on the source side; for that, Kamusi is building a pre-disambiguation interface for users to select the original sense from the defined dictionary entries, ranked in relation to computational word sense disambiguation (WSD) techniques. Used in combination, WSD and lexicon-based sense matching can produce precise vocabulary choices that NN and SMT never will.
Elasticity: MT often faces the problem of determining whether consecutive terms are different words, or should be translated as a single unit. For example, is an African fish eagle an African, a fish, and an eagle, or is it one bird with a long name? Kamusi puts party terms (multi-word expressions (MWEs)) in the dictionary as independent entries, with defined meanings. These entries are then lexicalized concepts that can be translated across languages. When MT encounters a series of words that appear together as a unit in the dictionary, it can translate the unit rather than the component parts.
Separability: Many MWEs can be broken apart, which usually breaks NN, and throws SMT entirely off the trail. For example, drive crazy can be separated: Your perfume drives me and my furry pet hamster crazy. Using Kamusi data, we can tell when words in a lexicalized MWE might have been broken apart, and in the future we may be able to predict the range of terms that can go in between.
Variability: When it comes to grammar, a term may take many forms. The verb “see”, for example, has the inflections sees, saw, seen, and seeing. In Kamusi, each entry is a container for many types of data, including these variations. When we configure a language, we figure out that language’s categories and forms, and produce customized interfaces to catalog those elements. Those tailored word forms can them be mapped across languages, with conjugations, contractions, and other transformations tied to appropriate translations.
Transitivity: We can predict translations even if we are not sure about them. While human-confirmed translations are our goal, transitive links across concepts are our starting point. If we know based on human confirmation that a term in Language A is equivalent to one in Language B, and the term in Language B is equivalent to one in Language C, then we can have high confidence about the match between Language A and Language C – but we won’t lock it in stone until a person who knows A and C can confirm or reject it. In the meantime, the provisional vocabulary postulates can be used within MT, though taken with salt. In all cases, this method will produce more precise results than the disastrous method contemporary MT employs for going between languages that are not directly paired, using statistical guesses to go from Language A to English and another round of statistical guesses from English to Language C.
Non-equivalence: Sometimes one language has a term that does not exist in another, or is expressed in a very different way. Kamusi has methods for producing explanatory translations when direct equivalents do not exist, and for showing bridges between different modes of expression. This information, never before modeled or documented, can be extracted from Kamusi in ways that are friendly to MT processes.
Topical terminology: Many domains have specialized terms with meanings that differ from daily language. In sailing, for example, beam, beat, bend, and block all have meanings related to boats and wind. Terms in Kamusi can be designated as belonging to terminology sets for particular domains, making it possible to identify the vocabulary that should be preferred for particular documents.
These aspirations for next-generation MT are built into Kamusi’s design. The current state of the art offers translations that range from awful to adequate, depending on the language pair, the complexity of the text, and the user’s expectations. We contend that SMT is reaching the limits of its potential, NN will quickly max out, and radical progress in MT will only come from approaches that focus on how vocabularies and grammars interact, within and across languages. The Kamusi structure is crafted to support the fine-grained data needed for a quantum leap in translation technology, the jump from adequate to excellent. In our effort to produce a global online living dictionary, we have embarked on collecting rich data for many languages – data that will serve as the bedrock of excellent universal translation.