My Kamusi - Login
username
password
You can Register Here ,   OR

Big Data Beta

Archived Page

This is a page from the Kamusi archives. The information below may be out of date, and the links may no longer be valid. Please visit kamusi.org for current information. If you know of links or information on this page that can be updated, please let us know.

1.2 Million New Records. Good, but not good enough.

Now you can find your words with confidence across a large and growing collection of languages.

We are extremely excited to announce a big leap in the language resources we offer you from the Kamusi Project – 1.2 million new records that link over 100,000 concepts among more than 20 languages. This is a stepping-stone, not an end point. Let us explain what we've done, and where this puts us on the path to completely new possibilities in building language knowledge and advanced communication tools for you.

What's New: The data we just imported comes from the Open Multilingual Wordnet (OMW), which provides the results of the tremendous efforts of numerous people to produce equivalents in their languages for the concepts in the path-breaking Princeton Wordnet of English (PWN). What we have done that is new is:

  1. Improved how data is aligned among languages. PWN shows how languages match with English, and OMW shows how they match with each other through the assumed accuracy of the English bridge. We show the degree of separation among terms, so that you can know whether a translation has been derived by a computer or confirmed directly by a human, among 210 language pairs*.
  2. Split the data into individual terms. Wordnet is organized by ""synsets"", clusters of words that have similar meanings, such as {cloth, fabric, material, textile}. Clustering is helpful for making broad comparisons among words, but defeats the ability to treat the nuances between words. We show the clusters in English and we show linked synsets in other languages, but we also make it possible to view extended data for each term in its own right.
  3. User-focused search. PWN has a buried English-only search tool that relies on your back-arrow to move among entries, and presents results in a way that is impenetrable to many users. The OMW search tool leads to full information if you know to click on the synset reference number after conducting a primary search. In Kamusi, search brings you directly to all relevant information for results in one language and for translations with all the other languages in our system. Further, you can move easily from a search result to the detailed entry for a linked term, and then onward to intra-language contextual searches, which is not possible in other systems.
  4. Made the data editable. Wordnet data is very basic – the canonical form of a word, a definition and perhaps an example (in English), and information about relationships among synsets. The true potential of the Kamusi data model will be realized through the accumulation of rich data for each entry, such as word forms (see/ sees/ seen/ saw/ seeing), regional pronunciations, history, usage examples, illustrations, video clips, and more, described at kamusi.org/molecular_lexicography. We have systems for specialists and the general public to add much more information about each term.
  5. Made the data linkable at the sense/ spelling intersection. In Wordnet-RDF, you can link to a synset, which is the general sense of a cluster of words. Conversely, in most electronic dictionaries, you can link to the headword, which is often a cluster of different senses that happen to share a spelling. With Kamusi, you can pinpoint each exact sense of a particular spelling. Down the road, this will offer numerous advantages for bridging among language technology activities. Linkability also means that we can readily bring in data from the many other projects that link to PWN, such as Wordnets for other languages that have not yet opened themselves to OMW.
  6. Made the data expandable to more languages. Our goal is every word in every language. The Wordnet data provides a set of 100,000 concepts, for which Kamusi has developed systems to elicit equivalent terms and extended data across all 7000 known languages**.

What's Needed: Wordnet was never intended to be a definitive dictionary, for English or any other language. The data is excellent seed, but there is room for growth and improvement in many directions:

A poor WordNet working definition leads to wrong translation matches (above, what lawyers do rather than an office of lawyers, in 3 languages). Kamusi enables improved definitions, and subsequent updates in linked languages.
  1. Improved English definitions. Many PWN definitions are quick sketches that give the broad idea, but are far from polished lexicographical gems. For example, an elevator car is defined as, ""where passengers ride up and down"". Such definitions are functional for working purposes, but eventually need to be upgraded. Kamusi has designed a game within Facebook, which we will launch soon, that challenges players to zero in on great definitions for each Wordnet sense.
  2. Definitions in other languages. The English definitions for equivalents of the 73,350 terms we have in Thai are only useful to those Thai speakers who also have a good command of English. What Thai speakers would really gain from are Thai definitions of Thai words, while Basque speakers want Basque definitions of Basque words, and Persian speakers need Persian definitions of Persian words. We will soon open KamusiGames to players to produce definitions for the terms that we have in each language.
  3. Additional information about each word. Now that we have a good set of base forms, we can use our games, our mobile app, and our expert interface to collect much more detailed information for each term.
  4. More languages. Many more languages. With a large set of concepts now specified, we will soon be open for contributions of parallel terms for thousands of languages, from Afrikaans to Zulu.
  5. More Wordnets. For various reasons, Wordnets have been produced for many languages that are not included in OMW (http://globalwordnet.org/wordnets-in-the-world). We hope that the developers of those projects that currently have copyright restrictions are inspired to contribute their data to the unified system, and that we can find the funding resources to import the data that is already available with an open license.
  6. Wordnet relational data. PWN has a lot of information about relationships among synsets that we have not yet imported, along the lines that a ship is a type of vessel, or that a horn is a part of a car. Similar information is available for some other languages. However, we did not want to overly complicate our import process, so we saved this task for a later day when we can invest the time to do it right. Look forward to seeing this as part of our as-yet-unfunded transition from a relational database to a more efficient graph database structure.
  7. Inter-language confirmations beyond English. Because all of the other Wordnets are matched to PWN by bilingual people, we can say with confidence that terms in one language are roughly equivalent to the words in the English synset with which they are matched. However, we can only predict relationships between terms in other languages at a second-degree level of confidence. For example, both Finnish ajoneuvo and Spanish turismo link to English motorcar, so Kamusi posits a link, but a Finnish-Spanish bilingual might reject the bridge because the former is any sort of wheeled vehicle while the later is specifically a family automobile. As bilingual speakers join the project for diverse language pairs, we have the ability to move from speculation to confirmation of proposed translations for many thousands of language combinations.
  8. Merging with other Kamusi data. Before the Wordnet import, we already had about 60,000 English terms that are linked to equivalents in Swahili or other languages. We are on the cusp of bringing in tens of thousands of terms for Vietnamese, have large data sets ready to import for several African languages, and anticipate merging existing data sets from other sources for many more languages. We have systems to merge items that are the same concept without mixing up other senses, such as merging the Wordnet-derived entry for a vessel for transportation with the Swahili-linked entry for the same idea from the original Kamusi, and separately aligning and merging the entries for a vessel for food or a vessel for blood.
  9. Difference explanations. There are usually subtle differences between terms that are clustered within a synset, which is why a language has more than a single term to begin with. For example, what is the difference between a talk and a lecture? We will soon introduce a field for explaining such nuances, for conceptual differences between terms that are equated within and between languages.
  10. Fewer words. (a) There are many words in PWN that have no equivalent or are of no interest to speakers of other languages, such as the term snick as it pertains to the game of cricket. Our system for adding terms across languages will soon include programming that recognizes when people consistently skip a term, and demote the priority with which we present unpopular items to other communities. (b) Some of the words in OMW are simply wrong. The French data seems particularly filled with errors, such as {aimer, amoureux, amour, proche} instead of ""zéro"" for the tennis score love. Kamusi enables bad data to be fixed or eliminated, but, because so many external language projects build from Wordnet, breadcrumbs must be left to the original, erroneous data.
  11. More words, more senses. Now that we have the Wordnet data in front of us, we can see what is missing. For example, the sense of light as a traffic signal is absent from Wordnet, as are hundreds of thousands of other words, such as intergovernmental. We have methods such as corpus analysis to locate those missing terms and senses, for English and for other languages, that we can now deploy to expand the global lexicon beyond the concepts that have already been identified in Wordnet.

In sum, we are pleased to present you with this new and unique resource, but we cannot rest here. In 2013 we demonstrated our proof-of-concept with 100 parallel terms in 20 languages. The million+ Big Data beta we now announce shows the system can scale. As we reach toward our long term goal, every word in every language, this new data opens the gates for much more work ahead. We do not have nearly all the information you need about the terms we have already imported. We do not have fully comprehensive terms for English or any of the other languages we are making available today. And we are still missing most of the world's languages; we've now harvested generously available open, digitized data for many developed languages, but seek to gather similar knowledge for thousands of languages spoken by billions of people. With your support, we look forward to using the systems we have developed to fill in all of these gaps – and meanwhile, we invite you to explore the window into the world's languages that Kamusi now opens for you.


* The data is more complete for some languages than for others, e.g. we have about 130,000 terms in Finnish, but only 4,500 in Danish. We are not including our 61,000 Swahili terms in the count because they have not yet been aligned with the other data, nor our pilot data in 15 additional languages.

** We have designed methods to accommodate languages that are spoken but do not have a formal or agreed writing system. Devising methods to index the various sign languages developed by deaf communities around the world is high on our agenda.

","Now you can find your words with confidence across a large and growing collection of languages. We are extremely excited to announce a big leap in the language resources we offer you from the Kamusi Project – 1.2 million new records that link over 100,000 concepts among more than 20 languages. This is a stepping-stone, not an end point. Let us explain what we've done, and where this puts us on the path to completely new possibilities in building language knowledge and advanced communication tools for you.