You can Register Here ,   OR

Differences with Wiktionary

People often ask, "What is the difference is between Kamusi and other projects that deal with words in many languages?" Here, we answer that question in regards to Wiktionary.

The English Wiktionary describes itself as:

"A collaborative project to produce a free-content multilingual dictionary. It aims to describe all words of all languages using definitions and descriptions in English. ... We aim to include not only the definition of a word, but also enough information to really understand it. Thus etymologies, pronunciations, sample quotations, synonyms, antonyms and translations are included."

These goals are very similar to Kamusi's, and Wiktionary is far ahead of us in terms of the amount of existing information - in fact, we sometimes adopt great definitions from Wiktionary, with attribution. The design and methods of the two projects, though, are radically different, as well as some of the goals. We believe that the Kamusi approach will eventually lead to much more comprehensive and accurate dictionaries for many more languages, and that it will be easier for people to use and more helpful for their everyday language needs.

Kamusi-Wiktionary Comparisons:

Design: How entries are organized and how languages are linked

How entries are organized makes an enormous difference in the way they can be used, like the difference between designing a transport system around roads versus rails. Wiktionary is arranged around spellings. Kamusi is organized around concepts.This is the single biggest reason why Wiktionary is incapable of developing into a satisfying multlingual dictionary. Consider a word like spring. On Wiktionary, spring is a single page, with 33 different definitions. On Kamusi, a search for spring will give you a list of terms spelled s-p-r-i-n-g (which should and will be systematically arranged, when the code is working again), but each of those matches is an individual entry.

Since Kamusi's terms are arranged by concept, each individual Kamusi entry can be treated completely independently. Spring the season has a life that is completely separate from spring the water source. Each of those entries is its own container in Kamusi, which can be filled with lots of information, from definitions to synonyms to images to multiple usage examples. Anybody looking for information about the season can find everything we know about that sense of the word on one page, while all our data about the water source is in a location dedicated to the second concept.

This way of organizing Kamusi entries has huge implications for the potentials of a multlingual dictionary.

How languages are linked. Because Kamusi organizes entries based on individual concepts, we can match those ideas to terms in other languages, with a great deal of confidence that we are talking about essentialy the same thing. If you ask someone who speaks English and French, "What is spring?", chances are they'll tell you "printemps", the season. When you clarify, "I mean, the source of water," they will then reply with the French word "source". Wiktionary is a bit better - below all of the definitions for the different senses in English you will find summaries of those meanings, and clicking on the "translations" link there will bring you to a list of languages and a term in each of those languages that is said to match that idea. In Kamusi, you are always only one click away from having an assured translation; if we have paired "spring" and "printemps", then you know at a glance that you are dealing with roughly the same concept in both languages.

We are able to do a number of things with the Kamusi linking methodology that are simply not possible with Wiktionary:

  1. The language chain. If we have a link between English spring and French source, and then we add in a Japanese equivalent matched to the English (いずみ, 泉, izumi), then presto, we have a link between Japanese and French. If we add a link between the Japanese and Zulu, we can also immediately show Zulu-French and Zulu-English. The more languages that link together, the longer the chain - with a fair amount of reliability, because we are basing our links on the underlying concept, not some coincidence of spelling. In Wiktionary, where translations for individual senses of a spelling can be added with no quality checks and almost no way to revise later, the set of translations is not carried over to the other languages. Fortunately, this means that an erroneous Wiktionary claim that the season spring is masika in Swahili does not get automatically propagated to French printemps, but it also means that people looking up the concept in French do not have access to the perfectly valid Romanian primăvară, and people coming from the Swahili or Romanian Wiktionaries have no translations of the concept in any language.
  2. Degrees of separation. When Language A is linked to Language B is linked to Language C, and on, it is too easy to introduce noise into the language chain, like the game of telephone - how confident can you be in the relationship between Language A and Language G? Kamusi resolves this problem by showing the distance in the chain between concepts. If two languages are directly linked, they are one degree apart. If they share an intermediary language (printemps is linked to English and primăvară is also linked to English), then they are separated by two degrees. Kamusi can show a chain of any length, which allows readers to make their own assessments of the likely value of the translation. They can also easily confirm a direct translation link between terms in two languages they know, making those terms a first degree pair. Wiktionary has no similar system or intent.
  3. Degrees of equivalence. There is a dirty little secret in the dictionary world: concepts in one language often do not line up neatly with those in another. Sometimes, the alignment is perfect: English spring and French printemps and Romanian primăvară all refer to the same warming season. Often, the alignment is only approximate. For example, English hand refers to the part of the body from the wrist to the tips of the fingers, while Swahili mkono refers to the part from the fingertips all the way to the shoulder. And many concepts exist in one language but not in another - for example, most African languages do not have a term to match spring because weather patterns in Africa differ from those in Europe. Somehow, a dictionary should show when pairs are close but not exact, and speakers of a language like Swahili will want to understand concepts like spring from other languages (for example, a Tanzanian student who is applying to a US college that refers to a "spring deadline"). Kamusi resolves this problem by showing terms as parallel, similar, or explanatory - hand and mkono are similar, while msimu katika Ulaya baada ya baridi na kabla ya joto is a short, functional Swahili explanation of the English concept, literally "season in Europe after the cold and before the hot". Instead of pretending, like Wiktionary and most other dictionaries do, that ideas are all parallel from one language to the next, Kamusi confronts non-equivalence head-on.
  4. Own-language definitions and definition translations. There are actually many different Wiktionaries, each for its own language. Those Wiktionaries have definitions for their own terms in their own languages. On the other hand, to find a translation of the definition, you have to look up the term in the language you speak, and search to see if an entry has been provided for the term in the original language: first look up primavera in the Italian Wiktionary to find out what it means in Italian, and then look up primavera in the English Wiktionary to find English information about the Italian concept. By contrast, in Kamusi, if you look up primavera in Italian, you will see both the Italian definition, and a translation of that definition into your preferred languages (if that translation has been written). Look up mkono in Swahili, and, if your primary language is English, you can read the definition translation in English and immediately understand why this concept is similar, not parallel, to English hand. This multilingual definition translation layer is unique to Kamusi - lexicographers have no term to describe this feature because it was never previously imagined.
  5. Machine applications. Having this level of nuance and specificity for each sense of each term, it will be possible to use Kamusi data as the basis for many human language technologies and natural language processing applications. For example, you could easily mark the sense of spring that you intend in your English document, and submit it to a translation service with near-certainty that the choice of vocabulary will match your original idea. Wiktionary, on the other hand, adheres to the 19th century design notion that dictionary entries will be read as stories, yielding definitions that would be entirely mysterious when read alone, such as, "Any similar dispersion". Because Wiktionary data for a single spelling is lumped together in an essentially unparsable block, even when the translations are reliable, the embedded information cannot be used for sophisticated downstream technologies.

Methods - Wiki markup and the Wild West, versus Kamusi editing, crowdsourcing, and moderation

Wiktionary and Kamusi both invite public contributions to the growth of the lexicons. The methods for interacting with contributors, though, are notably different.

Wiki markup and the Wild West. All Wiki projects require contributors to understand something called Wiki markup. This markup language requires users to work with a variety of obscure codes in order to make text achieve certain effects, such as formatting or linking. Wiktionary gives a quick introduction to basic markup, and then hopes that contributors will read its 8500-word "Entry layout explained" in order to understand how to write an actual entry. The learning curve is very steep and presents a massive barrier for ordinary people to begin contributing, On the other hand, the system is as open as the the Wild West to people who are both clever and mischievous. For example, on 6 February, 2006, an anonymous user decided it would be funny to claim that one sense of English spring is "an erection". As of 6 October, 2014, that spam is not only still in place, but has been expanded upon and propagated to many other corners of the Internet that use Wiktionary as a source. As a worst-case example, in the Malagasy Wiktionary, a robot spammer has proudly generated an ever-expanding collection of gibberish, three-and-a-half million pages and counting, that the organization has been unable to limit or get rid of. In sum, the wiki method makes Wiktionary both difficult to edit and difficult to control.

Kamusi methods: the edit engine, crowdsourcing, and moderation. Kamusi tools are designed to be easy to use and hard to abuse. The primary interface is the edit engine. If you can book a flight online, you can use the edit engine. No markup is required, just filling in information or selecting options - for example, one simply has the option to type the forms of see (sees, seeing, saw, seen) in the boxes configured for a particular part of speech in a particular language (in this case, English verbs), rather than this exact sequence of characters for Wiktionary:

::'''to see''' (''third-person singular simple present'' '''[[sees]]''', ''present participle'' '''[[seeing]]''', ''simple past'' '''[[saw]]''', ''past participle'' '''[seen]]''')

Some of Kamusi's underlying lexicographic concepts are not intuitive, such as how to write a definition, or the difference between a definition, a translation, and a definition translation, so we provide help text in simple language to explain the various fields. We are also introducing crowdsourcing features, through which we elicit very specific types of data from Kamusi users, such as simple translations of terms from one of the user's languages to another. Crowd members also rate the submissions of other contributors, in a way that allows us to consider data valid when it passes a certain threshold. Finally, contributions from both the Edit Engine and the crowd are subject to review by a moderator who has demonstrated expertise in the language. In this way, we can keep pending submissions hidden until they pass moderator review, we can present crowd-validated data as tentative, and we can have confidence in data that has been approved by a moderator.

Outcomes - Random data covering certain languages? Or comprehensive interoperable data in languages big and small?

Wiktionary's results to date have been impressive, with at least 72 languages claiming more than 10,000 pages (including the bogus Malagasy pages mentioned above and many other robot creations for other languages, e.g. the 194,425 articles in Cherokee are mostly pap). Far be it for us to minimize Wiktionary's accomplishments, given that Kamusi's five year goal is ten thousand parallel terms in 100 languages, and given that we make bold statements about what we can achieve with a slender data set to date and many features still under construction. Yet, Wiktionary appears headed for a plateau where it reaches a steady state of useful data in major languages, without a coherent strategy for addressing languages with few volunteers.

Whether a concept appears in Wiktionary in a given language depends on whether it catches the attention of the language's volunteer editors, while Kamusi is developing systems to harvest all known lexical items (from PanLex and other sources) and to elicit undocumented concepts from each language's speakers. We currently have a priority list of about 20,000 English terms that we use to sequence the concepts we request from people working on other languages (with future programming planned to make a language's queue responsive to priorities in related languages in order to reduce the cultural imperialism inherent in an English-centric approach). By having a list of concepts of general concern that are treated across languages, Kamusi is building a core vocabulary that speakers of any language will know they can use to communicate with speakers of any other.

Further, Wiktionary's data is not designed to be interoperable. That is, Wiktionary does not play well with others, and does not even play nicely from one language to the next within the project. A large obstacle is that the data is unstable; e.g., in April 2014, one sense of spring appears as "6. (countable) A place where water emerges from the ground", but in April of 2012 that same sense was numbered "5" and in April of 2009 it was shown as

"# {{context|countable|lang=en}}& A place where [[water]] emerges from the ground."
Because there are no fixed reference points and no fixed fields for indicating different types of information (such as example sentences), there is no way to link permanently to any data element - something like numbering that seems trivial to a human reader is vital for computer systems to work with the data. Nonetheless, because many Wiktionaries contain a lot of good data, some people have written parsers that are able to extract certain types of data for some European languages, for use in natural language processing and other applications. In other words, it is possible for clever systems to snake their ways through some Wiktionaries and extract some information that can be used for other human language technologies, but doing so is difficult and gives limited results.

By contrast, the entire Kamusi architecture is geared toward collecting comprehensive data for languages big and small, and making it interoperable. One approach is to mine data from existing sources, such as old bilingual dictionaries, and match the translation senses to entries within Kamusi. Another approach is the crowdsourcing we talked about above, which will include games and social activities to encourage widespread participation. Rather than stating that the project is open to anyone, and then seeing what happens, Kamusi is committed to working with teams that emerge for many languages, and to recruiting volunteers and professionals for many more. Crucially, we insist that data must be confirmed by a qualfied moderator for a language before we advance to the point where we can declare it fully trustworthy. The data that emerges is designed to interact extremely well with other language technology projects in order to produce better tools and resources not just for languages like English and French, but for undersupported, minority and endangered languages worldwide.

Wiktionary and Kamusi both have the goal of producing data for all words in all languages. Wiktionary is farther down that stream, but, because of core differences in design and methods, we propose that Kamusi - with a framework that re-engineers the dictionary in part by learning from the mistakes of Wiktionary and other projects - is the ship that will prove more seaworthy in the long voyage toward that goal.