Every Word in Every Language

Our motto is our goal, "Every Word in Every Language". Completely fulfilling this mission is impossible, but it sets the target toward which we aim. We aspire to merge all that is known and all that is knowable about language, in one data system that is free to everyone.

More than 7000 different languages are spoken around the world. Some languages have hundreds of millions of speakers and many dialects. Hundreds of endangered languages, meanwhile, might disappear in a generation. Fully documenting every expression from every language variation would require thousands of researchers and decades of work.

At Kamusi, we concentrate on ways to collect as much data for as many languages as possible, arrange that data precisely, and put the information at the service of students, the public, and the technology we use to communicate. We have games and other tools for people worldwide to systematically share what they know about their language, and we have a global partner network for experts to contribute their knowledge. We also integrate existing datasets whenever possible, though most languages do not have any meaningful available digital resources.

The Kamusi architecture supports, in principle, a complete matrix of human expression across time and space. This is a mission to Mars, an ambition to be attempted by pulling together people and resources in a concerted international collaboration. It is a hard, hard road ahead. We seek every word in every language, but we cannot even really say what is a word or what is a language, much less how we can pay for it all. However, we have the core systems in place, or ready to release when we can afford to manage the data influx, and additional components that are fully specified and just need funds to implement.

Please join us in turning our Quixotic quest into a reality that creates and unites as much information as possible for the world's languages! We greatly appreciate your financial contributions, and eagerly anticipate your participation as we roll out our language games to reach for the stars, one word at a time, toward every word in every language.

/info/every_word

Navigating Kamusi

KamusiGOLD.org is different from other websites, with a big design to serve you a lot of information. Watch our 52-second slideshow for an insider's guide to discovering everything you need!

/info/navigation

Kamusi GOLD Languages

Privacy | Terms of Service | Copyright | Sources | Contact

The Kamusi 🌎 Global Online Living Dictionary is bringing together millions of terms in dozens of languages. As each language is added to our new graph 🔢 database system, the search box above makes it available to you. These are the languages for which we have datasets that we are actively working toward putting online. Languages that are Active for you to search are marked with "A" in the list below.

Key

•A = Active language, aligned and searchable
•c = Data 🔢 elicited through the Comparative African Word List
•d = Data from independent sources that Kamusi participants align playing 🐥📊 DUCKS
•e = Data from the 🎮 games you can play on 😂🌎🤖 EmojiWorldBot
•P = Pending language, data in queue for alignment
•w = Data from 🔠🕸 WordNet teams

Select a language marked with "I" to read Info about it. A = Active, P = Pending, Ø = InfoBox not yet created.

We have additional 🔢 datasets for many more languages, including dozens of 👅🔫 endangered and minority languages in Africa, Asia, Australia, and both North and South America, as well as methods to work with communities to bring in hundreds more. However, without funding, we cannot promise forward motion on those languages. We will add them to the Pending or Active lists when we have the resources in sight to move them to the action list. /info/gold_languages_info

Software and Systems

Kamusi is all about sharing. The 🔢 data we have, we share with you, and in return we ask that you share with us what you know about your language. We are actively creating new software for you to make use of and contribute to the 🎓 knowledge we are bringing together. Learn about software that is ready for you to download or in development, and the unique data systems we are putting in place for advanced language learning and technology:



/info/software_info

Articles and Information

Kamusi has many elements. With these articles, you can read the details that interest you:

Privacy | Terms of Service | Copyright | Sources | Contact

/info/articles_info

Themes

Our homepage has information about everything we are working on. To help guide you to the topics most important to you, the information is clustered loosely around common themes that you can glide to from the list below.

Videos and Slideshows

Some of what you need to know about Kamusi can best be understood visually. Our 📽 videos are not professional, but we hope you find them useful:



/info/videos_info

Partners

Our work is made possible through partnerships with extraordinary individuals and organizations around the 🌍 world who are devoted to building resources for the 🎓 knowledge of language and the technology to make language serve society. We have many active partners, and dozens who are committed to joining the Human Languages Project when we can gather major support. Our partners - past, present, and future - include:

/info/partners_info

Hack Kamusi

We have a long set of coding and 🔢 data tasks that will result in a range of tools to use language in ways that have never before been possible. Many of these tasks can be implemented immediately by talented coders. We welcome volunteer hackers, or students seeking exciting research projects, to contact us to discuss how you can jump in.

Of course, much or our work would go much more quickly and reliably if we could pay professional developers. Please help us locate 💛😇 GOLD Angels who can support infrastructure development.

Here are some of the work elements you can help do or fund:



/info/plans_info

Theory of Kamusi

Kamusi is different from other dictionaries you have encountered. Select a link below to learn about the principles that guide the project's unique approach to lexicography and public service.


/info/theory_info

Commentary

Discussion items about language, technology, and society, from the Kamusi editor and others. This box is growing. To help develop or fund the project, please contact us!

/info/commentary_info

Press Coverage

Kamusi in the news: Reports by journalists and bloggers about our work in newspapers, television, radio, and online.



/info/press_info

Frequently Asked Questions

Answers to general questions you might have about Kamusi services.

We are building this page around real questions from members of the Kamusi community. Send us a question that you think will help other visitors to the site, and frequently we will place the answer here.

Try it : Ask a "FAQ"!

/info/faq_info

GOLD Theme: Financing

Help us find a visionary financier!

American foundations do not invest in languages other than English. Europe only funds languages with potential for European businesses. Kamusi is developing services for the billions of 👪 people who do not speak those languages - but still deserve equitable access to 🎓 knowledge and the tools for prosperity. We have a well-developed network of partners, tools, and techniques. Can you help us reach out to individuals or organizations with the financial muscle to push our services forward, and the interest in doing so? Please contact us!

/info/visionary

💛😇 GOLD Angels for Languages

To provide our services to the public for 🆓 free, we need support to maintain, serve, and develop our resources. We are now seeking 💛😇 GOLD Angels for each language - local companies, agencies, or philanthropists who can underwrite the monthly cost of providing Kamusi to their language community. These supporters will be recognized on the information and results pages for the languages they sponsor. For example, sponsor Chinese, and you will be acknowledged every time we answer a Chinese search! If you can sponsor a language, or know an organization that might, please contact us.

/info/gold_sponsors

Startup Ideology

In Europe, startups have seized the imagination of policy-makers as the source of innovation and progress. But what happens if a "startup" is a non-profit organization that is devoted to building resources for the public good? If a library had to have a business plan, would we have any libraries?

As an example, the Swiss Commission for Technology and Innovation (CTI) offers the "Swiss Technology Award" at the "Swiss Innovation Forum". Billed as "Switzerland's leading technology prize" for 28 years, CTI says:

This coveted distinction, awarded in the three categories of «Inventors», «Start-ups» and «Innovation Leaders», crowns outstanding innovations and developments that exhibit higher-than-average market potential and major opportunities for growth.
Obviously, the profit motive spurs many great technological advances. However, the equation of INNOVATION = MARKETABILITY is false and dangerous. The 🌍 world faces real problems where the profitable solution is not the best solution, or may even be the cause. For example, we are overheating our planet because it is cheaper to pump fossil fuels out of the ground than to install a renewable energy infrastructure. The ecosystem that supports startups emphasizes finding a "unique selling proposition" that will attract both clients and "business angels" 🏦😇, not finding a unique social need that will improve lives. Case in point, you can see why a startup could make money selling $250 "mass customized, perfectly fitting jeans with online 3D visualization technologies", but nobody would argue that the enterprise contributes to the greater good.

Without a path to profitability, a technology enterprise has no chance to attract the public or private support aimed at startups that could push innovative and important 💭 ideas to fruition. At Kamusi, we face a double-whammy: First, as a non-profit organization, we have no legal way to offer a financial 🎰 return on investment. Second, our obsession with providing services to 👪 people whose languages are marginalized - precisely because they have no money to invest - makes our business plan entirely illogical to a 🌍 world that measures success in terms of market valuation.

In our experience, startup fever has taken all the oxygen in the European funding environment for technology initiatives, leaving no space for innovation for the public good. If your market is Pokemon Go players who can't keep their mobile batteries charged, you can get startup funding. If your market involves language services for 😷 doctors to communicate with pregnant women in Africa? Yeah, no. Not being viable as a startup, we have not found a major funder in Europe who will answer the 📞 call.

Meanwhile, if you are tempted by the 👖 jeans shown above, won't you please consider donating that money to support our hopelessly non-profit language technology innovations instead?

/info/startups

🏇 PonyUp!


When you donate through our website, we get 👛 every penny if you send a 💵 check or make a bank transfer. If you pay by 💳 credit card, PayPal keeps 2.9% + $0.30, and we get the rest. If you prefer, we also have ongoing campaigns through Global Giving and GoFundMe - they keep a larger percentage of your 🎁 donation to fund their costs, but they also provide services that help us reach a larger donor audience.



/info/ponyup_info

Business Plan: Get Poor Quick

A reader on Quora asks, "What forms the cornerstone of a modern dictionary company's business model?" This is an excellent ❓ question, for which we have no good answer from our experience. We continue struggling to develop a successful business model, and in fact were completely offline for a year because we had a budget of approximately $0. Kamusi is the Swahili word for dictionary, and Kamusi is a non-profit organization dedicated to producing advanced dictionaries for as many of the 🌍 world's languages as possible. Our admittedly impossible aim is rich interlinked 🔢 data for "every word in every language" - in principle, that's the OED multiplied 7000 times over. And here's the kicker: all of the data is to be available to the public for 🆓 free.

As innovative as the dictionary model is, the business model is horrible. There are three basic problems:

  1. The average dictionary consumer cannot or will not pay the costs needed to get 🔢 data into their hands - a student in the Andean highlands would be very glad for a comprehensive resource for her Quechua language on her smartphone, but she's not going to ante up for the service, and nor should she be expected to.
  2. A good dictionary entry takes time to produce, and the only reliable way to get the necessary amount of a specialist's time is to pay them. Money. In our experience, a solid entry takes an average of 6 minutes to produce when you're cruising. So, 10 entries an hour, call it 70 entries a day, you could have a respectable 20,000 terms completed in about a year. How much does it cost to hire a language professional for an hour, or a year? That's an indication of what we need to raise, potentially for 7000 different languages.
  3. Because Kamusi is a non-profit organization (501(c)(3) in the US and similar status in Switzerland), there are many business opportunities we cannot pursue, legally. There is no possibility for outsiders to see a return on investment - no chance for venture capital or startup funding, were someone to see the goal of all the 🌍 world's language 🔢 data as holding a profit-making potential.
So, we need to pay 👪 people to produce the product, but we cannot charge people to use it. We've developed a few solutions, some of which cannot be implemented until we complete some more programming, and some of which (variations on the theme of asking 👪 people for money) have yet to see a glimmer of success. The overall strategy is to raise money to pay for the production of data. We've got dozens of languages configured and 👪 people trained and standing by who we can pay per-entry when funds are available. Money in, 🔢 data out. Repeat, and add a dash of crowdsourcing. More specifically:
  1. The fastest way to pay for development of a dictionary in a particular language would be a grant, whether from a government agency, a philanthropy, or a 🏦🎁 corporate giving department. Grant funds from both the US and Canadian governments have gotten us to the point we are, but those streams are currently dry (note to self: add "sequester" to the dictionary). Finding grants takes a lot of time and labor, and most granting agencies don't start salivating when you talk about dictionaries. Also, the grants that we're hopeful will become available to us, now that we've proven the Kamusi concept for twenty languages, will probably be for the usual suspects, particularly for European languages from governments that understand the importance of addressing 👅👅👅 multilingualism - but that will leave economically weaker languages out in the cold. If we think of a dictionary as a public 🎓 knowledge resource, like a school or library or museum, then building the resource with education-oriented funds from public or private sources is a perfectly logical business model, if only those funds were available.
  2. We ask the public to join as members, for almost nothing a year. If a thousand 👪 people were to join, we would have some core revenue that would take care of a lot of basic needs. Have you joined? Didn't think so. We could probably make this work if we had a full-time fundraiser, but we don't, and the amount of effort needed to get the stray thirty bucks just doesn't pay.
  3. We sell some things through links on our pages, particularly a unique clock that tells time according to the Swahili system. We used to have Google Adwords as well, but ads really cheapen the look of the site, versus the revenue they generate. These are minor sources of petty cash, useful to pay for things like renewing domain name registrations, not for creating great 🔢 data.
  4. We have planned a system we call "Buy This Word". BTW will enable 👪 people to contribute the amount needed for a specialist to produce as many entries as the donor chooses to sponsor, in the language they wish to support. You could buy your dad 10 Xhosa words for Father's Day, for example, while a Korean bank could support 10,000 entries for their language. When it's developed, this will be a cool system - money in, 🔢 data out. You'll get credit and links to the completed entries you've sponsored, and you'll be able to see the results of your 🏦🎁 donations materialize in real time and chart the amount of times those entries are accessed, in perpetuity. Maybe crowdfunding will be the magic bullet for sustainability. At least, we'll give it a try, as soon as we can pay for the coding.
  5. We have also planned a crowdsourcing system we call "Play to Pay". Essentially, we will change our model from 🆓 free to 🆓⭐ free(*), where you're the ⭐: in order to access the data, you'll have to contribute something that you know. There are a zillion ways that this could go bad, so the programming will be very intricate, with independent rating, commenting, and reversioning for just about every element of every entry, and ❓ questions that adjust to the 🎓 knowledge and skills of the particular user. The goal is a scalable, self-regulating system that will produce initial data of useable quality on the cheap, that can then be reviewed by specialists when we have money to pay them. The programming specification is complete, and we could implement in about a month if we had funds to pay the coders.
  6. In the long run, we hope to license the 🔢 data for commercial use. For example, the data model we have designed will enable 🎮 game-changing leaps in machine translation technology. Some 👪 people have a moral aversion to the 💭 idea that a technology or language service company would make money off of a resource they contribute to. We don't. If a big company wants to license the data we produce for, say, German to Portuguese translations, and those license fees help us develop resources for, say, an endangered language 👅🔫 in Namibia, then we're far ahead of where we'd be if we remained pure and untainted.
In fact, the business of producing dictionaries has been far too pure and untainted, at least as far as Kamusi's business model is concerned. The fundamental problem is that our goal is making dictionaries, not making money. Because money is the pathway to language 🔢 data, not the organization's mission, we are constantly banging our heads about how to pay for what we want to accomplish. It's a lousy business model but, lacking an 😇 angel donor, it's the best we've got. /info/business_plan

Your Name Here

To keep Kamusi growing as a "free" knowledge resource for the world's languages, we need the support of generous sponsors. We will be delighted to acknowledge your support for the languages or software you help fund as a 💛😇 GOLD Angel, with a box like this one that provides information about your organization. For example, if you sponsor our work on Armenian, your logo will be displayed every time a user searches for Armenian words, and we will be able to pay our Armenian partners to develop our resources for their language.

For some languages, this could be millions of searches a month. We get to move the project forward, the public gets better and better language resources, and you are gratefully recognized for making it possible.

Win-Win-Win!

Angels can choose the GOLD 💍 ring that fits their 👛 purse: Ruby (name on every result for a sponsored language), Sapphire(name and short message), Emerald (logo), and 💎 Diamond (logo and top positioning). To discuss how we can work together, please contact us!

Note: the logos pictured above are NOT actual Kamusi sponsors. We just nabbed the image off the web to show some samples of logos that could appear in an 😇 Angel infobox 😻. Guess where Kamusi's editor is from?

/info/angel

Pearl Supporters

We greatly appreciate the financial support of those generous Kamusi users who can help keep us free for the many people who cannot afford to contribute to the costs of the project. Building and serving our resources costs money - $1000 per developed-country language every month, which also pays for languages without the financial base to support themselves. To sustain our service for everyone, we need to string together a lot of contributions.

In appreciation of the pearls whose financial support makes our work possible, we gratefully acknowledge those community members who wish, when they donate $100/year or more, right on the pages of their preferred language.

To be listed as a pearl, please send us a note telling us your name as you would like it to appear, and the language you are sponsoring. We'll put you on the list of supporters, and we'llput your donation to work right away!

/info/pearls

GOLD Theme: Vocabulary

Words Words Words

What words belong in a dictionary? We want them all - every word, in every language. To reach anywhere close to that goal, we combine a number of strategies:
  1. Aligned open 🔢 data. Terms for about 50 languages have been matched to the English version of 🔠🕸 WordNet. Each English 💭 idea has a number, and each language in WordNet has matched its terms to the same number. We've assembled all those words in Kamusi Here!, but that is just a starting point because (a) most languages don't cover the full 🔠🕸WordNet, (b) WordNet does not include nearly all concepts or terms in English, (c) every language has its own indigenous concepts that are not included in WordNet, and (d) 🔠🕸 WordNet only includes basic canonical forms, not the range of shapes that comprise a term. Additional data that, such as images, that has been elsewhere aligned to 🔠🕸 WordNet can also be automatically imported.
  2. Non-aligned open 🔢 data. Many collections of words are 🆓 freely available, but are not already matched to enumerated concepts. If we have a 🔢 data point that something is equivalent to l-i-g-h-t, is that ⚖ (not heavy), 💡 (not dark), or 😆 (not serious)? We have designed 🐥📊 DUCKS for our visitors to match data they recognize to the data we already have, which both lines up equivalent concepts and reveals concepts we don't yet have, in English and indigenous to other languages.
  3. Legacy dictionaries. Most dictionaries were not conceived as "🔢 data", but rather little bricks of information about their terms. There is no consistency among old dictionaries, or even within one - for example, whether a comma is used to separate synonyms or two different senses. Such dictionaries need to be converted into operable data, either from formats such as Word (if we are 🍀 lucky), or scanned as PDFs and passed through OCR (which frequently fails with smudged 👴📃 old texts or unusual character sets). Moreover, copyrights © need to be respected, so we can only use very old dictionaries or those for which we can take months to find the owners and negotiate permission.
  4. Data 🔢 from 👪 people. Our list of concepts lets us know which 💭 ideas we do not have expressions for in any language. We display that missing data as gold boxes our visitors can fill in, and we can publish that information once it achieves consensus among a 👪🔊 speaker community. Over time, our crowd techniques are aimed to fill as many gaps as possible in collecting every word in every language.
/info/words

What is a Word?

Does the photo show one mountain, or six? Does each peak on a mountain also have its own name? Where, exactly, is the bottom of the mountain? Lexicographers have similar problems framing the boundaries of words. What merits its own entry in a dictionary? What should be shown as additional information for an entry? What is so obvious, or cumbersome, that it should be left out? The conundrum has three faces.

First is meaning. Is swing, the dance music, a different word from swing, the playground equipment? At Kamusi, we say they are different things, so they get different entries. From a conceptual perspective, separating senses is straightforward, although it means that one spelling, such as t-a-k-e, can have dozens of different entries.

For cataloguing, shape is the more difficult face. We can define a sense of big as "being of a substantial size" - but do we include bigger and biggest as forms of that idea within that entry, or are they separate words? English verbs have five forms that we can easily list in an entry, and hope all our readers will know what we mean by past (took), third person present (takes), past participle (taken), and present participle (taking). Listing such inflections is much harder for a language like French, with 96 verb forms, many of which actually require conjugating a second verb and contracting with part of a pronoun ("I have read" = j'ai lu [je+avoir + lire]). Should each sense entry list all the possible forms, or can they refer to a common table that may not apply in all cases (e.g., swings for children have a plural form, but swing music does not)? And then there are the agglutinative languages. German can bind several independent words together to make a brand new compound word that everyone understands, sometimes with letters changing internally according to various rules - think of the way English creates words like supermarket, but do that on the fly, with any idea you can make up, such as Autobahnmarkierungsentfernungskomitee (a committee that is responsible for removing the road marks on highways). Hundreds of African languages can have an entire sentence in a single unit, such as "We are squeezing each other" in Swahili: tunabanana (tu=we + na=now + bana=squeeze + na=each other). Each Kinyarwanda verb can take 900,000,000 different forms. Obviously, a dictionary cannot list nearly a billion forms within an entry. Our strategy is computational, to find the rules that people use to agglutinate words in their language, and build parsers that locate the entries for their component parts.

The third face is the most difficult for computer processing: multiword expressions. We call these party terms because they are composed of items that dance together. That is, the words in the expression each have their own meaning, but together they form a new idea - a head case is a disturbed person, for example, and a northern right whale dolphin is a type of dolphin that is neither a compass direction nor correct nor a whale. In Vietnamese, every syllable is separated by a space, so every multisyllabic word looks like a party term. Are party terms "words"? Kamusi treats them as entries, if the meaning cannot be gleaned from its parts. Party terms present additional challenges because they can be separated, can change shape, and can have multiple meanings. Fortunately, our unique architecture can handle all of that, and, using GOLDbox, a variety of games, and other techniques, we can build from the basic canonical lemmas that are the usual initial dataset or first-round user contributions for a language, toward a full set of forms associated with each concept.

Just don't ask us to tell you what exactly we mean by the word "word".

/info/what_is_a_word

🐥📊 DUCKS: Data Unified Conceptual Knowledge Sets

When a dictionary in one language says that a word means "hot" in English, do they mean it is very ♨ warm, very 🌶 spicy, or very popular? With 🐥📊 DUCKS, players can join a term in any language to its best meaning in English or another connected language, using a simple, 🎮 game-like setting. In this way, Kamusi participants can align terms from any language where 🔢 data is available - quickly merging new languages into our conceptual matrix with great accuracy. Open now, by invitation only (contact us).

/info/ducks

🔠🕸 Wordnet Definitions

Much of our early 🔢 data comes from a great collaborative resource called 🔠🕸 Wordnet. However, 🔠🕸 Wordnet falls short of the needs of a rich 👅👅👅 multilingual dictionary and data source, for a variety of reasons. We are identifying and improving definitions that don't pass muster, adding new terms and new languages, flagging false items, and offering the data back to the 🌎🔠🕸 Global Wordnet community.

Here is a more detailed paper about the relationship between Kamusi and WordNet data.

Problems and Procedures to Make Wordnet Data (Retro)Fit for a Multilingual Dictionary, presented at the Eighth Global WordNet Conference, Bucharest, Romania, 2016: The data compiled through many Wordnet projects can be a rich source of seed information for a multilingual dictionary. However, the original Princeton WordNet was not intended as a dictionary per se, and spawning other languages from it introduces inherent ambiguity that confounds precise inter-lingual linking. This paper discusses a new presentation of existing Wordnet data that displays joints (distance between predicted links) and substitution (degree of equivalence between confirmed pairs) as a two-tiered horizontal ontology. Improvements to make Wordnet data function as lexicography include term-specific English definitions where the topical synset glosses are inadequate, validation of mappings between each member of an English synset and each member of the synsets from other languages, removal of erroneous translation terms, creation of own-language definitions for the many languages where those are absent, and validation of predicted links between non- English pairs. The paper describes the current state and future directions of a system to crowdsource human review and expansion of Wordnet data, using gamification to build consensus validated, dictionary caliber data for languages now in the Global WordNet as well as new languages that do not have formal Wordnet projects of their own. /info/wordnet

Kamusi Here:

Instant translation among the languages in our system (arriving soon at 90+). Available now for 11 languages via the search box at the top of this page. /info/kamusi_here

🎉 Party Terms! 🎉

Multiword expressions like "give me a break" present insurmountable challenges for most dictionary and translation systems. Is the photo above an African, a 🐠 fish, an eagle, or an African fish eagle? Our solution is embedded in our new 🔢 data model and Pre:D. Party terms (two or more words that take on a unique meaning when they play together, also known as MWEs) are treated in Kamusi as regular dictionary entries, which can be given definitions, and linked to translations of the concept across languages. What's more, if a party term can be split across a sentence (e.g., "she drives me and all my friends up the wall"), we mark where it can be separated, so that we can find the different parts when they occur in a 📃 text and put them back together with their definitions and translations.

/info/party

Separability

The sentence "The teacher will bring all the loud and restless students to heel by threatening to fail everyone" includes the party term "bring to heel". In most systems that analyze 📃 text for natural language processing, the many words between "bring" and "to heel 👠" make it impossible to locate and treat such a multiword expression as a unit. Through our method of marking whether and where party terms can be separated, advanced technologies can locate meanings and translations, no matter how far the distance between the component words.

/info/separability

Our market value: $7 Quadrillion

With an app centered around sending 1️⃣ one word, Yo was able to raise $2.5 million in venture funding, for a market capitalization estimate of up to $10 million. At that rate, what is the value of Kamusi, if we can fulfill our ambition to provide you every word in every language? You do the math, yo:

/info/yo

Kamusi Here! Global Dictionary App for Android

Our beautiful Android app puts the 🌍 world's most effective 👅👅👅 multilingual dictionary in the palm of your hand. We are happy to say that the last sentence in our 📽 video demo is now outdated: the app is now available for free on Google Play.

As of 2 December 2016, there are 11 languages in the system, and we are bringing in more as fast as we can get the missing ingredient to process them. If your favorite language isn't in the system yet, install the app now anyway, and we'll push you your language when it is ready.

The current data is mostly based on a unique graph of the links among WordNets, but we will be bringing in data from other sources soon, as well as modifying the existing data, so this won't remain a "pure" WordNet endeavor for long. Also, our data is only as extensive as the sources we've borrowed from, so if a WordNet does not have a term or a definition, we don't either.

There is a lot more functionality that we'd like to build into the app, but our budget for the first release was only €300 from a private donor - your support can help add features quickly to version 2.0. Send feedback, and enjoy! IARC Rating Certificate /info/here_android

GOLD Theme: Games

🎮 Gaming Theory

We are developing a number of 🎮 games for 👪 people to share what they know about their language. This information box, which will discuss our forays into 🎮💾 gamification, is under construction. Please visit again soon!

/info/gamification

🎮 Games Games Games 🎮

We have developed many 🎮 games for Facebook and other platforms, for 👪 people to help us improve our 🔢 data while having fun. The games are built around the premise that the authorities about a language are the people who speak it, and many of those people will share their knowledge if they are given easy tools and spiritual rewards. Sending researchers to the field to collect consistent linguistic data for thousands of languages is impossible, but it is relatively straightforward to configure all those languages for the same set of games that can fill our database with the words people know.

Many of our games are ready, and others are under development. There is a problem, however, in that we cannot afford to open the games to the public. Simply put, it costs money to manage the volumes of data that will stream in once people begin playing. We cannot magically produce a server that can handle the traffic, or developers who can fix inevitable bugs or make improvements based on user experience, or the experts who can perform quality control for their languages. We already offer more than we can pay for, so that you get a taste of what lies ahead, but flipping the switch on our games is too much to take on by ourselves.

We'd really appreciate if you can help us put our games in play, or you can keep checking back to see if others have ponied up for you.

/info/games

WitchLake Studio

WitchLake Studio, a Swiss technology SME that develops Serious Games 🎮 to enhance learning and productivity through online fun, will be a partner in the Human Languages Project to increase the addictive capacity of our crowd-based elicitation tools through their experience with gamification 🎮💾.

/info/witchlake

Games & Markets

An article called What languages to localize your game into? opens a general window to the situation of languages within technology and many other markets. The take-away message is that companies invest in languages that have a high profit potential. That is a perfectly logical business plan. As a method of choosing languages for developing public knowledge resources, it is a disaster.

The case of electronic 🎮 games demonstrates that 🏭 industry intentionally avoids languages with few resources, both because the profit potential is relatively low, and because there are few current resources to make localization easy. When considering what languages to invest in, game makers are advised to stick with the usual suspects: English always, then some mix of French, Italian, German, and Spanish, and potentially Chinese, Japanese, and Korean. Charts show that several other markets have ROI potential, such as Russian, Dutch, and Portuguese.
Do they have money to spend? There are languages with huge numbers of speakers, such as Chinese or Hindi, but the number of potential players who can afford to buy your game or would spend money for in-game purchases is often low. Do the high numbers make up for the lower value per user?
By this measure, no African language merits industry interest. No Native American language is worthy of attention. Most Asian languages and their markets are too confusing. Even prosperous European languages like Catalan and Estonian do not rank, because the payout is too small.

This bias toward the wealthiest languages is embedded in how governments and organizations such as charitable foundations approach the development of language resources. But think about this for a minute. To address horribly unequal educational outcomes between rich neighborhoods and poor ones, would you spend all your money to expand the rich schools because they had a greater guarantee of success? To open literacy opportunities, would you build a library next to one of the world's great public knowledge centers at 42nd Street and 5th Avenue in New York, or would you seek out an unserved neighborhood in Cuzco or Nairobi? If the objective is financial reward, by all means avoid meager markets and take this advice:
In summary, the mainstream European languages are a safe bet. French, German, Spanish, Portuguese, Italian are the traditional languages to start with. In Asia, China, Japan and Korea are the massive markets, but you need to do more than just translate. Latin America, Southeast Asia, Russia and India are growing fast, but the market sizes and average revenue per user are much smaller. When deciding on languages, remember to take not only the translation cost, but also all collateral expenses into consideration.
If you wish to invest in the human potential that is hidden behind language barriers, however, we will put your donations to good work right away.

/info/game_market

Authentication

Seamless, tight 🔐 security is essential to our site. We have to protect our 🔢 data from outside hackers and protect user privacy. At the same time, we must maintain each user's contributions and custom settings across numerous 📱 devices and applications. Enabling 👪 people to log in securely must be done right, which means it must be done by professionals, which means we can't get it done for 🆓 free. Proper authentication is also the main sticking point before we can release several of our applications that are otherwise ready to ship, such as WordUp! To help fund authentication as a 💛😇 GOLD Angel, please contact us!

/info/authentication

WordUp!

We have created WordUp! to expand our 🌎 global vocabulary with more words and more languages with community input using 📱 mobile devices. We provide words and definitions in English or another validated language, and the visitor provides an equivalent term in their language. Terms are arranged in categories that users can choose based on their interests - animals, for example, or medical terms. Once we see a consensus emerge for a term, we accept it provisionally, and queue it for final review when an expert is available.

The program is complicated because play is distributed asynchronously across numerous languages and networks. In order to minimize connection costs for speakers of non-privileged languages, players load their vocabulary sets when they are online, play offline, and then sync their accumulated answers the next time they have a network connection. The 🔢 data in play may thus shift substantially during the time a player is offline. When synching, WordUp! not only needs to take the player's new answers and upload new term sets, but also selectively remove all of the terms from all of the sets that the community has completed for that language in the meanwhile.

All of this now works smoothly except for user authentication. It is crucial that we can associate users to their contributions, so they can receive appropriate credit and so that we can detect malicious users and undo any damage they attempt. Authentication must be done well so that it works smoothly across all our software and platforms, but it is too expensive to implement on our current 0️⃣ zero budget. /info/wordup

GOLD Theme: Emoji

😂🌎🤖 EmojiWorldBot

Dictionary between Emoji and 70+ languages. This 🤖 Bot currently works on the 🆓 free Telegram messaging platform. We are looking for hackers to extend it to Facebook, Skype, Slack, and others soon. Read all about it. Available for you now!

/info/emojiworldbot

University of Naples

We work with the Dipartimento di Studi Letterari, Linguistici e Comparati at "L'Orientale" on the development of 😂🌎🤖 EmojiWorldBot, 🎮💾 gamification in the acquisition of linguistic 🔢 data, and the iconographic representation of language.

/info/naples

Kamusi 😂🌎🤖 EmojiWorldBot in Naples!

We were on the big stage at Futuro Remoto in Naples on October 9 to showcase our 72-language 😂🌎🤖 EmojiWorldBot! Our presentation included a 🎮 game for festival attendees, but the prize we designed was too cool to keep there. We have a few more shirts to send as gifts to people who donate $120 or more. Seize the 🇮🇹 Italian spirit, and support 🎓 knowledge resource development for languages worldwide - we're making these shirts and mugs available to friends of Kamusi for a very limited time!

/info/emojishirt

The Theory of Emoji

People 👪 have been using images to tell stories since the time of cave paintings. Writing ✍ developed from 🖼 pictures. The Egyptians produced a standardized image set, hieroglyphs. Hieroglyphs were basically a single font so that everyone was carving "bird" the same way, with nearly 5000 total characters, and 24 hieroglyphs also functioning as a kind of alphabet based on their phonetic associations. Chinese ideograms 🈳 were similarly abstracted from pictograms that could be seen to represent physical objects. Pictures were a difficult basis for complex writing, however, first because you would need clear and consistent images for tens of thousands of different things (bird is easy, but how do you show the difference between a swallow and a sparrow?), and also because many concepts are too abstract or complex to draw. Most languages therefore cottoned on to the 💭 idea of alphabets, a limited set of characters that could represent the 🔊 sounds of their words, making it possible to communicate over time or distance with anyone else who knew the same language. Alphabets did not erase images as a means of conveying ideas - illuminated manuscripts are filled with drawings as well as writing - but 🔡 letters or ideograms instead of pictures have now carried most of the functional load of written communication for thousands of years.

And then, suddenly, there were emojis. More than the simple emoticons that can be composed on a keyboard, like ;), emojis are little 🖼 pictures that are supposed to show the same thing on different electronic 📱 devices. A 🕊 is a 🕊, irrespective of PC or Mac, and regardless of whether you speak Japanese or Zulu. More than 1300 emojis have been approved by Unicode, the 🖥🏭 electronics industry consortium that agrees on the characters that will be consistent for every language and every 📱 device around the 🌍 world (albeit with different artistic renderings allowed that can be confusing, and demonstrate the point that images are a delicate communications medium). With more than a thousand pictures on everyone's phones, 👪 people have begun using those images instead of written words in informal contexts, particularly text messaging.

This return to 🖼 pictures opens a lot of ❓ questions, both for the role of Emoji as an alternative representation system within a language like Italian, and the potential for Emoji as a communications bridge among 👪 people who speak different languages. With a limited picture set, is it possible, as is being investigated in an experiment conducted by Scritture Brevi, to form multi-emoji expressions so that 👪 people can distinguish their swallows from their sparrows? Can such combinations transcend language differences, or are they embedded in cultural or phonetic associations that are specific to certain groups? As more emojis are approved into the 🌎 global system, will they become a common feature of daily communication, or will people abandon them as too complicated to locate or interpret, versus just typing letters? To what extent will Emoji improve communications across languages - will the image set become established firmly enough for 👪 people who speak different languages to communicate complex 💬 thoughts with confidence? And, from a more technological standpoint, can we use emojis to elicit parallel terms among languages that are not currently paired by dictionaries? Can we work with the crowd to tag emojis for a range of communicative concepts, such as adding "sparrow" to the tags for 🕊? When we get good🔢 data for a language that is not well supported in the technological realm, can we push that to 📱🏭 device manufacturers so that people everywhere can easily look up "bird" or "swallow" in their own language?

With these ❓ questions, you can see that the reason we have put so much effort into the 😂🌎🤖 EmojiWorldBot project, isn't because emojis are trendy and cute. Rather, they provide an important gateway to 👅 linguistic technology and 👅👅👅 multilingual communication for the future. Please join us by installing EmojiWorldBot on your computer or 📱 mobile device, available for you now. /info/emoji_theory

Language: Emoji

Local name: 😂. As an overall communication system, we speak of Emoji with an uppercase E. When speaking of one specific character, we ✍ write with lowercase: "the taco emoji". When speaking of more than one character, we add an "s": "the monkey emojis"
Most spoken in: 📲 Text messaging
👪🔊: Installed on billions of 📱 devices
Kamusi records: 1,123 official UNICODE emojis
ISO 639(1) none/none(3): /
Links: Wikipedia | Emojipedia | Lexicon Valley: Are Emojis a Language?
Sources: Initial 🔢 data from the UNICODE CLDR Annotation Charts. New 🔢 data generated by participants playing our 🎮 games on 😂🌎🤖 EmojiWorldBot
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Emoji Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/emoji

Facebook Bot

After we developed 😂🌎🤖 EmojiWorldBot on the Telegram platform, Facebook opened Bot for Messenger. We would like to port our 🤖 Bot to Facebook, but we do not have someone in-house with enough time or experience. The bot would need to remain tightly synchronized with the other platforms, including coordination of new user-provided 🔢 data, and implementation of new features, including 🎮 games and dictionary searches. If you can work on or help fund this bot to spread Kamusi services to a billion Facebook users, contact us!

/info/facebook_bot

University of Macerata

We work with the Laboratorio di Fonetica e Scrittura (LaFoS) on the development of 😂🌎🤖 EmojiWorldBot and forthcoming projects on multi-emoji expressions and the extended use of images for cross-lingual communication.

/info/macerata

Language: Italian

Local name: Italiano
Most spoken in: 🇮🇹 Italy, Switzerland
👪🔊: 85 million
Kamusi records: 75,296
ISO 639(1) / (3): it / ita
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial 🔢 data from the ✖🔠🕸 MultiWordNet, by Fondazione Bruno Kessler, Center for Communication and Information Technology, Human Language Technology Group, Trento, 🇮🇹 Italy. Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Italian Task Force! to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/ita

Microsoft/ Skype Bot

Our 😂🌎🤖 EmojiWorldBot works nicely on the Telegram platform. Now we would like to extend the service to Skype and other systems that use the Microsoft Bot Framework, but we do not have the 🏇🔌 horsepower to port the code. The 🤖 bot would need to remain tightly synchronized with the other platforms, including coordination of new user-provided🔢 data, and implementation of new features, including 🎮 games and dictionary searches. If you can work on or help fund this bot to spread Kamusi services to hundreds of millions of users of Skype and other Microsoft-friendly services, contact us!

/info/skype

GOLD Theme: USA

American Foundations: Indifference to Language

The United States has the world's most generous tradition of distributing private wealth for the public good. Successful individuals often vest much of their wealth in foundations, which have a legal mandate to fund charitable projects of one sort or another. American foundations are major drivers in combatting diseases and funding innovative social projects around the world. However, language is rarely on their screens, and may be seen as a hindrance. Getting through doors guarded by program officers who do not see language as part of an organization’s mission is almost impossible. Endangered languages do get bits of funding for sentimental reasons, but overall, excluded languages are treated as unimportant esoterica.

For big donors, language has yet to make a mark as an area of concern. Language barely makes a dent in the grants of the Ford Foundation, for example, with $145,000 spent in 2014 and 2015 on research and development for the emerging Sheng language of Kenya, a $190,000 grant for the Hawaiian language, and $150,000 for a multilingual voter registration platform for Nigeria – not half a million dollars, from an $800,000,000 portfolio. The Gates Foundation, "impatient optimists working to reduce inequity" who "believe that the path out of poverty begins when the next generation can access ... a great education" has even less interest in language despite its well-documented effect on educational outcomes; other than support for English, they have since 2013 granted $100,000 to develop local-language health materials in Burkina Faso, $175,000 for professional development for American teachers of foreign languages, and $100,000 to support language learning for the Makah Nation near their Seattle headquarters, with another $386,000 spent on non-English in prior decades, and no way for prospective grantees to get in the door and make the case for supporting digital language diversity as a path toward the foundations goals of overcoming inequity. For the Hewlett Foundation, language funding equates to English.

In an analysis of the grants database of the Foundation Directory, Jaumont and Klempay (2015) find that 88% of the roughly 4 billion granted by American philanthropies in Africa over a decade from 2003 went to Anglophone countries, almost entirely for programs conducted in English. Lack of concern for local languages can be further observed in eleemosynary institutions in Europe and elsewhere. Understandably, big donors want projects that can make an immediate, visible impact, whereas language projects have intangible results that might not be evident for decades (if there is ever a way to measure the effect that increased knowledge has on a society, beyond saying that X number of people have used Y resource that contains Z elements). Less benevolently, few philanthropies are amenable to the case that minority languages are worth even a moment of their consideration, and neither practitioners nor potential beneficiaries are in a position to demand otherwise.

The question of how to gain philanthropic support for projects that advance equity by advancing languages, particularly from US foundations, is one for which no answers are evident. No foundation currently expresses linguistic equity as a value, nor languages as a program area beyond limited support for endangered languages, and they do not entertain proposals that seek to convince them otherwise.

/info/foundations

Language: English

Most spoken in: First language of majority in USA, UK, Ireland, Australia, Canada, and many other countries. Official language in many countries of Africa and Asia. Widely learned as a second or school language.
👪🔊: More than 1/2 billion as first or second language.
Kamusi records: 207,235
ISO 639(1) / (3): en / eng
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: English 🔢 data in Kamusi has been mixed and revised from many sources. Roughly 100,000 initial definitions and associated data have been provided by the Princeton 🔠🕸 WordNet, a project maintained by Christiane Fellbaum at Princeton University. Additional 🔢 data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the English Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/eng

US Residents Only:
Donate in US Dollars

Better yet, send a check and we are not charged a processing fee! Mail to:
Kamusi Project USA
For Deposit Only #0046 2577 6919
c/o Bank of America - Bank by Mail
PO Box 2966
Phoenix, AZ 85038-0966


Kamusi Project USA is a 501(c)(3) non-profit organization. All donations are tax deductible!



/info/us_residents

Buffalo Tongue

Buffalo Tongue is a Native American language research small enterprise that will join the Human Languages Project to work on the Crow, Blackfoot, Cheyenne, and Lakota languages of the Great Plains.

/info/buffalo_tongue

Donate in British Pounds



/info/pounds

Bias in Dictionaries

Bias regarding language occurs in many forms beyond outright bigotry. Kamusi is affected by at least four kinds:
  1. The soft bigotry of low expectations. Most people do not expect high-quality resources for languages that do not have big economic footprints. Many feel that speakers of non-lucrative languages should darn-tootin' learn a language like English or French, or make do with any crumbs that fall their way. Why should we worry about Zapotec, for example, when the path to success in Mexico involves getting jobs in Spanish? This attitude is pervasive among policy-makers and foundations, many average citizens of the wealthy world who would gladly donate to eradicate other inequities, and even sometimes held by less-resourced language speakers themselves who do not have reason to hope that they will ever find their languages viable in technological or global economic domains. If nobody expects that great things are possible, then nobody demands them - and it is therefore impossible to rouse the funds to bring them into existence.
  2. Concept bias. Kamusi is seeded by many open datasets, so our terms reflect the interests of their compilers. For example, WordNet includes Captain James Cook, but not Hawaii's King Kalaniʻōpuʻu who Cook died trying to seize. We are working toward including as many indigenous ideas as possible, but you have to know a concept exists in order to include it in the dictionary, and such terms are often absent from our seed data.
  3. Definition bias. The people of Hawaii would be surprised to hear that James Cook "discovered" them, but that view of history was what guided the author of the WordNet definition. We welcome your help finding and fixing questionable definitions.
  4. Ranking bias. People tend to assume that their top search result is the most relevant. If we have three meanings for "draw" (draw a picture, draw a sword, draw water), we have to list one first, but that arbitrary ranking has no relationship to your needs at a given moment. Teachers often complain that their students use the wrong term because it was listed first on Kamusi. Our response is to try to flesh out each sense with definitions and usage examples that make clear to the user whether they have the right term - a task that will take years, and needs your help.
/info/bias

GOLD Theme: Europe

Donate in Euros



/info/euro

ELRA

The European Language Resource Association will play a major role in the validation and evaluation of outputs from the Human Languages Project.

/info/elra

Universität Hamburg

The Asia-Africa Institute at the University of Hamburg is approaching the Human Languages Project through bridges among Mandarin Chinese, Cantonese, German, and Swahili.

/info/hamburg

Language: German

Local name: Deutsch
Most spoken in: Germany, Austria
👪🔊: 90 million as a mother tongue, up to 100 million as a second or school language
Kamusi records: We have so far failed to get permission to bootstrap German with the 142,814 lexical units produced by Germanet. We have 17,000 records pending from a different 🔢 data source, but it is not 🔠🕸 WordNet aligned to our existing data, so will require an effort with 🐥📊 DUCKS when we can muster the resources.
ISO 639(1)/(3): de/deu
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: N/A
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the German Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/deu

lexikographieblog

lexikographieblog
7 March 2013
Von Michael Mann

Vor kurzer Zeit wurde ein neues, ein- oder mehrsprachig verwendbares Internet-Wörterbuch gestartet: Kamusi (kamusi.org). Das folgende Video zeigt, was das Wörterbuch bis jetzt so kann. Schön ist, dass es zu Ausdrücken jeweils Bedeutungserklärungen gibt (wie im monolingualen Wörterbuch), aber dass den Ausdrücken auch jeweils Äquivalente in mehreren anderen Sprachen zugeordnet sind (wie im polylingualen Wörterbuch); diese können wiederum aufgerufen werden und haben dann eine Bedeutungserklärung in „ihrer“ Sprache. Dabei werden bei Homonymen die einzelnen Teil-Artikel oder Lesarten mit den jeweils semantisch korrekten Äquivalenten der Fremdsprachen verknüpft. Das ist ganz nett, obwohl man mit Wiktionary wahrscheinlich ähnlich weit kommt. Außerdem wird auf Flexionsbesonderheiten der beteiligten Sprachen hingewiesen.

Hinweis: Bei diesem Video wurde nicht getrickst.

Continue to Full Article...

/info/lexikographieblog

University of Warsaw

The Faculty of Oriental Studies has been contributing to Swahili development in the Kamusi Project for more than twenty years, and stands ready to participate in Polish and Hausa when the work can be financed.

/info/warsaw

Donate in Polish Zloty



/info/zloty

A Dictionary for Exotic Languages?

When the science journalist for the Swiss newspaper 20 Minuten set out to ✍ write about Kamusi, her efforts were reduced to a tiny paragraph because the project deals with "exotic" languages, but not yet German. She was not even given the print space to point out that Kamusi is of significant benefit for European languages because it translates among all the languages in the system, not just to English. The notion that 👅👅👅 multilingual lexicography is akin to collecting esoteric butterflies has proven to be quite an obstacle to receiving support from foundations and government agencies in Europe and the USA. For one thing, Kamusi is actually working on many languages that are central to 🌎 global commerce and 🎓 knowledge systems, building tools that have never existed even for the most favored languages, English and the FIGS. For another, to the billions of 👪 people who speak most of the languages we strive to include, there is nothing "exotic" about their efforts to learn and prosper in their own 👅 linguistic ecosystems.

Unfortunately, American agencies see education in English as the main path worth supporting, even though that means perpetual marginalization for the vast majority of 👪 people who will never learn English in their lifetimes. European agencies, meanwhile, are ideologically fixated on benefitting European business interests - which does not include 🆓 free 🎓 knowledge resources for European students, much less for people in "exotic" locations.

A Dictionary for Exotic Languages Languages like Swahili, Kirundi or Mampruli seem exotic to us - yet in Africa they are spoken by millions of 👪 people. However, no online dictionaries for such languages existed so far. Now Linguists of the EPFL have developed one. It already translates word in several African languages into English. In the future, the scientists plan to add more languages, including non-African ones like Vietnamese or Finnish. (20 Minuten, 9 Oct 2015)


/info/exotic

Spaghetti Strap Shirt

Comfortable women's tank top, available in several colors

/info/spaghetti

Fidget Widget

Many 👪 people check their 📱 devices compulsively, looking for a distraction to fill spare moments. We have coded a mobile app, the Fidget Widget, that encourages people to share tiny bits of what they know about their language while they wait for a 🚌 or brew ☕. The general 💭 idea is to ask simple ❓ questions that can be answered in seconds, such as whether a particular word in their language, which our source matches to "light", means "not dark". Swipe right for yes, left for no, down for don't know, and your 📱 is unlocked... The microdata we harvest from this app will add up to lots of new language information for Kamusi users, while providing fulfillment for fidgeters with a way to feel useful when they check their phone every minute. We have not quite finished the app, however, because we did not have our new graph 🔢 database in service when we programmed the interface. Now we need to tie together the db, authentication, the initial code, and design work we've done on Here! and WordUp!. Please contact us to claim this project as a coder or 💛😇 GOLD Angel.

Meanwhile, if you are tempted by the Fidget Cube in the 📽 video above, won't you instead please impulsively send the dosh to Kamusi to fund the development of truly useful 🔢 data for languages 🌍 worldwide?

/info/fidgetwidget

GOLD Theme: Sign Languages

Stretch Productions

Stretch, a New Zealand 📽 film company, will bring a focus on visual resources for 🌎 global sign languages to the services we seek to integrate for 🙉 Deaf people and their languages.

/info/stretch

SignUp!

Visual dictionary to connect spoken and sign languages. Prototype is working within Kamusi Here!, 20+ sign languages will be included when the 🔢 data has been aligned with 🐥📊 DUCKS. (Contact us to play this 🐥📊 DUCKS set, 🎓 knowledge of English is all that is required)

/info/signup

Uganda School for the Deaf

We are committed to working with the Uganda School for the Deaf to bring Uganda Sign Language into the Uganda National Living Dictionary, if and when we can arrange funding support for the project.

/info/uganda_deaf

🐥📊 DUCKS for Signs

We have 🔢 data for more than 20 different Deaf sign languages, that we need to align with spoken-language data. This box (under construction) will describe the process. To help develop or fund the project, please contact us!

/info/ducksigns

GearUp!

Swahili Time Wall Clock Time in East Africa is counted with 🕖 "Saa Moja" (one o'clock) as the first hour after sunrise, because along the equator, the length of every day is so consistent you can set your ⌚ watch to it. Our clock is unique, showing the numbers the way they are said when Swahili speakers 👪🔊 tell time. The Kamusi clock is our most popular product, makes a fantastic 🎁 gift, and $15 from each sale goes directly to support our work.

/info/swahili_clock

GOLD Theme: Terminology

Terminology

We have developed a comprehensive approach to producing terminology sets for any domain, for any language. Our terminology efforts are based on two premises:

  1. Communicating with 👪 people in their own language requires much less effort, and will be much more effective, than willing them to communicate in a foreign language that few understand. Put starkly, if HIV/AIDS messages are not understandable to a vulnerable person such as a sex worker or an adolescent, the opportunity to prevent transmission to that person is lost forever.
  2. Every language has the capacity for full communication in any domain, provided key principles are followed for terminology development. For example, twenty years ago English did not have ICT terms for 🕸 "web" or "browser," yet the productive capacities of the language were employed so that today most English speakers 👪🔊 understand exactly what is meant by that terminology.
Kamusi has created a unique participatory methodology for terminology development that maximizes the likelihood that a term set will be accepted and used by a language community. By using a combination of paid experts and community volunteers, the costs of this work are kept to a minimum.
  1. Subject specialists with expertise in a language are engaged for a preliminary translation of a domain-specific terminology set.
  2. The specialists are encouraged to leave ❓ question marks or multiple choices in cases of uncertainty.
  3. Then members of the public are invited to comment on the problem terms, cast non-binding votes for existing proposals, and propose their own suggestions or new ways of looking at a concept. For example, a Swahili ICT community member solved the problematic term "cache" by suggesting it be viewed as "temporary storage" rather than the English metaphor of a hiding place.
  4. While the experts are still called upon to make the final decision, this democratized input and review process results in term sets that are much more likely to have universal uptake than the traditional top-down approach to terminology development.
One recent example of Kamusi’s work is an IDRC-sponsored effort by the African Network for Localization to produce a glossary of information technology terms for 10 African languages. This provides translations and definitions across all 10 languages of 2500 terms, from “absolute path” to “zoom out.” This glossary is now central to software development projects in those languages.

We are seeking partners in development and humanitarian aid, to build language resources that will permanently improve the communications environment of those agencies. If terminology issues can be identified and addressed up front in the planning stage of a programme, the community interaction will be improved, as will the effectiveness of local agency staff. /info/terminology

Infoterm

The International Information Centre for Terminology, supports and coordinates international cooperation in the field of terminology.

/info/infoterm

Technical Terminologies

Beyond general meanings, we have millions of scientific and industrial terms at the ready in more than 24 languages. Furthermore, we have created a participatory system for experts and the public to build terminologies for their languages, for domains of local importance. For example, specific nursing vocabularies are essential to have in local languages, where neither the nurses nor their patients can communicate in the languages of most 💊📃 medical texts. Auto mechanics, sports, farming techniques - terminologies set the dictionary to work.

/info/terms

ISO

Kamusi is an official liaison organization to ISO/TC 37, for Terminology and other language and content resources, in a consultative capacity for the International Standards Organization.

/info/iso

Taxonomy of Species

Plants and animals have local names in the languages of the places they are encountered. Usually, though, it is difficult to be sure whether names in different languages refer to the same species. Kamusi uses international taxonomic designations as a pivot for linking the terms for flora and fauna across languages. When our resources allow, we will incorporate the open Catalogue of Life, the most comprehensive and authoritative 🌎 global index of species currently available. The Catalogue holds essential information on the names, relationships and distributions of over 1.6 million species, which we will match against other🔢 data sources that match to non-English species names, to seed development of a comprehensive 👅👅👅 multilingual taxonomic dictionary. As with other projects in our pipeline, we can push forward quickly to activate Taxonomy as soon as we can pay for the work.

/info/taxonomy

Language: Hebrew

Local name: עברית
Most spoken in: Israel
👪🔊: As many as 9 million speak Modern Hebrew, the language currently documented in Kamusi. Several million read or speak Classical Hebrew, the language of the Torah, to be undertaken in Kamusi when a 💛😇 GOLD Angel steps forward.
Kamusi records:
ISO 639(1) / (3): he / heb
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)

Sources: Initial 🔢 data from the Hebrew 🔠🕸 WordNet, developed by the Computational Linguistics Group, Department of Computer Science, University of Haifa, Israel. Additional data from Kamusi participants.

💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Hebrew Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/heb

Donate in Israeli Shekels



/info/shekels

Names Names Names

Consider Mwalimu Julius Kambarage Nyerere. He was an international statesman and considered the "father" of his country, yet his name is underlined as a red error in Microsoft Word. People 👪, places, and organizations are hard to identify and deal with in dictionaries and translation. The new Kamusi architecture stores and recognizes names, and provides translations when appropriate (for example, Geneva, Genève, Genf, and Ginevra). We have millions of named entities on hand, in dozens of languages, ready to share as soon as we have the muscle to serve the 🔢 data.

/info/names

GOLD Theme: Middle East

Arabize

We have a long-standing partnership with Arabize, an Egyptian SME, for producing localization and general 👅 linguistic resources for Arabic.

/info/arabize

Language: Persian (Farsi)

Local name: فارسی. When speaking in English, Persian is preferred
Most spoken in: Iran
👪🔊: Estimates between 60 and 110 million
Kamusi records: [data queued for upload]
ISO 639(1)/(3): fa/fas
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial data from the Persian WordNet, University of Tehran, NLP Lab, Tehran, Iran, maintained by Mortaza Montazery. Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Persian Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/fas

Language: Arabic

Local name: االعَرَبِيَّة / عَرَبِيّة‎‎
Most spoken in: North Africa, Western Asia, and the Arabian Peninsula
👪🔊: More than 400 million speak one of 30 varieties of contemporary Arabic, and hundreds of millions are familiar with Classical Arabic, the language of the Koran. Modern Standard Arabic is the written version that is the first to be documented in Kamusi. With our innovative approach to dialects, we will be undertaking work on other varients beginning in October 2016.
Kamusi records: 37335
ISO 639(1) / (3): ar / ara
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial 🔢 data from the Arabic 🔠🕸 WordNet, developed by the TALP Research Center at the Technical University of Catalunya (UPC) in Barcelona. Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Arabic Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/ara

Association Culturelle Imedyazen

We are working closely with Imedyazen to incorporate their Dictionnaire de la Langue Amazighe within the 👅👅👅 multilingual Kamusi framework.

/info/imedyazen

Language: Berber

Local name: several varieties
Most spoken in: Morocco, Algeria
👪🔊: between 15 and 30 million
Kamusi records: We are incorporating 13000 records from the Dictionnaire de la Langue Amazighe, to be aligned via 🐥📊 DUCKS. This Initial 🔢 data is from the Amazigh variety commonly known as Taqbaylit or Kabyle, spoken in northern Algeria.
ISO 639(2 & 5): ber | 639(3) kab
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial 🔢 data from the Dictionnaire de la Langue Amazighe, maintained by Omar Mouffok.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Berber Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/kab

Language: Darija (Moroccan Arabic)

Local name: الدارجة
Most spoken in: Morocco
👪🔊: 21 million
Kamusi records: [data queued for upload]
ISO 639(3): ary
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial 🔢 data from Richard S. Harrell, A Dictionary of Moroccan Arabic: Moroccan-English, Georgetown University Press (1963), aligned with 🔠🕸 WordNet and maintained by Khalil Mrini at EPFL. Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Darija Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/ary

GOLD Theme: Language Varieties

Language Names Switcher

The Swahili word for the Swahili language is "Kiswahili", the Swahili word for French is "Kifaransa", the French word for French is "français", and the French word for Swahili is "souahéli". Using 🔢 data we have available for several hundred languages, we can make selection tools for developers to install in their software, to ease language selection in any technology application. A 😷 doctor in Thailand, for example, might need to select the Thai word for Flemish to configure a medical app with terms for a patient visiting from Belgium. We need full lists for each language, and also a single list that only contains the term each language uses for itself, since someone who wants Chinese will often be looking for 中文, not the English or Swahili name for it. To help develop or fund the Language Names Switcher, please contact us! /info/language_names

A New Solution for Dialects

Language varieties, loosely known as "dialects", have long created enormous 👅 linguistic and political problems for lexicography. A dictionary that tries to propose one language variety as standard will be roundly rejected by all other groups. However, a dictionary that tries to include multiple varieties runs the risk of quickly becoming unusable. Our unique 🔢 data design overcomes the obstacles of defining boundaries between dialects, allowing terms to be associated with geographic ranges instead of fixed groups. The model should work equally for any language that has a diversity of written and spoken forms, from Swiss German to Berber to Arabic.

/info/dialects

Language Identification Wheels

About 7000 languages have official 🔤 three letter designations, called ISO 639-3 codes. These codes are not easy for most 👪 people to use for identifying language settings on apps or websites. What does a code like eus mean to you? Many websites choose flags for their language selectors, but that is a horrible solution. Does 🇨🇭 mean German, French, or Italian? Does English get 🇺🇸, 🇬🇧, or maybe 🇦🇺? We propose a simple, universal solution. Dividing a circle into 5 sections, we can fill each wedge with one of 9 colors, yielding nearly 2 million combinatorial possibilities. We can assign some languages manually (Chinese gets all red, Arabic gets solid green, English gets two reds, two blues, and a white...), while randomly generating 7000 distinct symbols for languages that do not have a technological identity. Yes, 👪 people will have to learn to recognize their wheels, but this is no harder than recognizing our national flags. This is a quick project that can solve a persistent problem for websites and their users 🌍 worldwide. Can you help code it or fund it? Contact us!

/info/wheels

Bisharat

We have worked for over a decade with Bisharat on research, advocacy, and networking relating to use of African languages in software and 🕸 web content.

/info/bisharat

Dictionaries Dictionaries Dictionaries

We are constantly adding new languages to the Kamusi Global Online Living Dictionary! Watch the video above to see how the number of dictionaries available to you increases for each new language we bring into the Kamusi system.

We also maintain a full list of active KamusiGOLD dictionaries. Be warned, this is a long list, and we do not actually recommend that experienced users visit the page. It is intended as a site index, a navigation beacon for search engine users who conduct a general search for a specific language or language pair. Since you have already found Kamusi, you will be happiest selecting your languages from the search bar on the page you are reading now, as your fastest way to the words you want.

To visit any Kamusi GOLD dictionary directly, you can enter the address in this format: https://kamusigold.org/dictionary/lang1/lang2

/dictionary_list

KamusiGOLD Bilingual Dictionaries

Our bilingual dictionaries usually provide the most precise equivalent terms between two languages that you can find, for any two languages that have NOT been paired by human editors. For a pair like Spanish-Portuguese, you should seek a professionally curated source. For most language pairs, however, Kamusi's concept-based alignment will give you the most accurate results available anywhere. This is because we start by lining up a term in Language 1 with a term that people have already determined to mean the same thing in English or another language in our system, and then bridge that to terms that speakers of Language 3 have associated with the same meaning. Contrast this method to Google Translate and other multilingual "translation" services, which makes guesses about the meaning you are looking for between Language 1 and Language 3 based on the spelling of their favorite English translation from Language 1, and you should find that Kamusi's term-to-term results are much more reliable.

However, Kamusi is not perfect, and we advise you to use our bilingual dictionaries with caution. Specifically: /info/bilingual

GOLD Theme: History

Hieroglyphs

After cave paintings came hieroglyphs, about 5000 🖼 images that Egyptians regularized so that they could be sure of communicating consistent messages on carvings and papyrus. Over 1000 of those have been standardized by UNICODE so that you can use them across 📱 devices, if you have a supporting font. The original meanings of these images are listed, in English only, in an obscure online table. We will soon expand our 😂🌎🤖 EmojiWorldBot system to hieroglyphs, with four consequences:
  1. The original tags will make hieroglyphs more easily available for everyone to discover and use from English.
  2. As people play the translation 🎮 games in the 🤖 Bot, hieroglyphs will also open to speakers of many other languages.
  3. As people add tags through the 🤖🎮 Bot games, the hieroglyph images can take on larger communicative capacity. For example, might people tag and start using this hieroglyph, 𓋃 , as shorthand for McDonalds 🍽 restaurants?
  4. The original hieroglyphs will provide a basis for an in-built concept dictionary of Ancient Egyptian, an important language for historical 👅 linguistics.
We have developed all the systems to link the hieroglyph set through 🐥📊 DUCKS and adapt 😂🌎🤖 EmojiWorldBot for this character set. Implementation is on our task list, when we have the time and financial strength. /info/hieroglyphs

Language: Latin

Local name: latīna
Most spoken in: Catholic Church. Official language of Vatican City.
👪🔊: The historical language of the Roman Empire, today studied for religious and scholarly purposes at educational institutions throughout the world.
Kamusi records: [data queued for upload]
ISO 639(1)/(3): la/lat
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($) | Lexicon Valley: Finding Life in a Dead Language

Sources: Initial data from Latin WordNet, maintained by Stefano Minozzi. Additional data from Kamusi participants.

💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Latin Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/lat

Time

Language is constantly changing. Words take on new meanings, 🔊 sounds shift, old uses disappear. Eventually, new languages emerge - Afrikaans forked from Dutch, for example - and old languages fade away. Where written records exist, we can pinpoint the earliest sightings of a term, and mark that usage until it drops from view.

By approaching terms at the sense level, we can treat terms as they were used in different eras, such as when "light" could be a wanton woman (Shakespeare's time) or when it became a traffic signal (first confirmed usage 1938). Among other reasons this information is important is that language change provides clues to human history, such as trade or conquest between 👪 people, the way tree rings reveal environmental events. The Kamusi architecture marks known times for a term to be used in a certain way, making it possible to trace flows, for example, from modern English back through Old English and Proto-Germanic to Proto-Indo-European, and from there across and forward to languages as diverse as Sanskrit, Greek, and Welsh.

We contend that our system will eventually open up much more 👅🎓 linguistic knowledge than can currently be gleaned from today's best references. This is how the Oxford English Dictionary treats the etymology of "light":

Old English léoht strong neuter (later lĕoht , Anglian lēht , early Middle English lĭht ) corresponds to Old Frisian liacht , Old Saxon lioht (Dutch licht ), Old High German lioht (Middle High German lieht , modern German licht ) < Old Germanic *leuhtom < pre-Germanic *leuktom (also *leukotom , whence Gothic liuhaþ ; for the suffix compare naked adj.), < Aryan root *leuk- to shine, be white.

Our design will make each of those ancestor terms an individual entry, pegged to meanings and other information in the language that was spoken at the time, with other relations linked as they are known. This has never been done; we propose that it will prove useful. While we do not have immediate plans to undertake detailed historical work, our platform can support it now. When sponsors and partners are ready to jump in, our time machine is ready to take them.

/info/time

Historical and Comparative Linguistics

Our🔢 data structure opens up new possibilities for 👅🎓 linguistic scholarship. For comparative linguistics, Kamusi enables researchers to view equivalent terms across languages, for example comparing expressions among Romance languages along the Mediterranean. For historical linguistics, we aim to produce dictionaries of past languages, with each of their words as dictionary entries in their own right. Rather than merely displaying the etymology of a term, our system will let scholars trace a word's evolution, following its roots and branches through time and space. This is a long term objective, beginning with aligned data from Latin as our first ancestor language, that we look forward to pursuing with research partners.

/info/linguistics

History of Kamusi

The Kamusi Project arose from a student's frustrations at learning the Swahili language. Martin Benjamin was learning Swahili to prepare for his Anthropology PhD research. While on a Fulbright intensive language studies program in Tanzania in 1993, he was regularly stymied by the old Swahili dictionaries then available, which were confusing and incomplete. He read about a project that had used something called “the Internet” to parcel out the work of breaking supposedly uncrackable cryptographic code, and thought a similar process could be applied to writing a new Swahili dictionary. He mentioned the idea to Ann Biersteker, his Swahili professor at Yale University, who encouraged him to write a proposal to the local branch of the Consortium for Language Teaching and Learning (CLTL). The Consortium approved the proposal in autumn 1994. In December of that year, in the same week that Netscape released the first “web browser,” the Kamusi Project was born.

The first step was for Benjamin to enter about three thousand terms into a spreadsheet, copied with permission from existing learners' glossaries. He then divided those terms into packs of 100 and put those files on a “gopher” server that people could access via a command line interface and dial-up modem. The intent was for volunteers to each expand one pack with new terms, and to keep subdividing the packs as contributions rolled in. That idea never really worked, however, because the process was too cumbersome and the number of Swahili enthusiasts using computers was too small. Instead, the project received copyright permission for a large out-of-print dictionary by Charles Rechenbach, was awarded a larger grant from the full CLTL, and concentrated on data entry and the development of a website (Yale's first in the social sciences or humanities) to distribute the results to the public.

In 1996, Dr. Biersteker was awarded funding for the project from the United States Department of Education's International Research and Studies program (IRS). This grant enabled the development of the “Edit Engine,” a tool that makes it possible for anyone to help edit dictionary entries. The Edit Engine went live in 1999, a year before Wikipedia began with a similar model (and with the important difference that all Kamusi changes must be approved by an editor before becoming public). At the same time, data became available through a searchable online database, rather than having to be downloaded as text or Excel files.

A second IRS grant in 2003 supported many additional features, such as a photo uploader for users to illustrate dictionary entries with appropriate images, a parser to return useable dictionary entries from conjugated verbs, and a grouping tool to organize entries according to priority and sense. By 2006, the Kamusi Project was being used about a million times a month by 60,000 unique visitors.

2007 marked a major transition for the project, which had run out of funding. Benjamin had left Yale, where the project was still housed, and moved to Lausanne, Switzerland for family reasons. Several interesting potential partnerships were emerging around the idea of expanding the Kamusi model to other African languages. However, these projects for the international public were better housed at an institution devoted specifically to the cause of language development. It was decided to move the project to the care of the non-profit World Language Documentation Centre, based in Wales, as an interim home while steps were taken to incorporate Kamusi independently. The online presence was established as kamusiproject.org, and then kamusi.org when that name was donated by its original registrant.

Incorporating Kamusi was completed in 2010. The organization is actually two legally independent non-profit entities: Kamusi Project USA for American-based activities and Swiss-based Kamusi Project International for projects with the rest of the world. Our US status makes it possible for Americans, historically Kamusi's most generous supporters, to continue contributing to our work. At the same time, Swiss incorporation facilitates work with partners throughout Africa, due to Switzerland's special open relations with most of the world. The two organizations have independent boards and completely separate accounting. Dr. Benjamin now serves as Executive Director of both NGOs.

Between 2007 and 2013, the Kamusi Project embarked on several exciting new initiatives:

In 2013, Kamusi joined the Distributed Information Systems Laboratory (LSIR) at EPFL in Switzerland. LSIR has provided a home for numerous exciting technical developments. However, financial viability has continued to elude us. In 2015, our move to a much larger and more intricate data model caused our server to crash, and we did not have the financial wherewithal to get back online for more than a year. In 2016, we stepped away from our limping big-data machine, and began offering the public a restricted service that provides the most accurate vocabulary translations available anywhere for numerous language pairs, while we seek a funding path that will enable us to offer the many services that have been developed behind the scenes at EPFL.

The history of the Kamusi Project has been one of both innovation and struggle. Funding resources for "exotic" languages are few and far between, and the project has found that it is very difficult to make progress unless key partners can be remunerated for their time. Nonetheless, the Kamusi Project has pressed forward and is now in a technical and regulatory position to provide advanced services for a great many languages spoken around the world. Many new and innovative projects are now in the pipeline, with partners from countries on every continent. The next chapters of this history are poised to be written.

Here are some annual highlights:

1993: Project conceived as a way to use collective resources to create new tools for learning Swahili.
1994: First proposal submitted, November. First glossary (3,000 words) begun, December.
1995: Gopher site established, January. Website established, April - first website in the social sciences or humanities at Yale. Wordlists incorporated from many remote contributors. 21,000 entry dictionary posted, September.
1996: Data entry to incorporate Rechenbach's Swahili-English Dictionary .
1997: Data editing.
1998: Programming work begins on Edit Engine. Swahili-Russian dictionary posted.
1999: 56,000 entry dictionary posted, Discussion Forum established, Africa Guide established.
2000: Revised dictionary posted, Edit Engine launched, April.
2001 - 2002: Project has no funding. Development work slows to a crawl, though Edit Engine submissions regularly incorporated into Kamusi lexicon.
2003: Renewed funding begins late July. Development work begins on Learning Guide.
2004: Move to faster, more secure server completed, March. Photo Upload feature introduced, May. Enabled search of plural forms, June. Begin formal collaboration with University of Dar es Salaam Department of Computer Science to establish a mirror server in Tanzania and incorporate computer terminology into the Kamusi lexicon, October. Launch complete site redesign, November. Introduce specialized vocabulary features, November. Continue work on Learning Center .
2005: Introduce the Grouping Tool to arrange dictionary entries. Add new data fields for terminology, dialect, taxonomy, derivation, related words, English definitions, and alternate spellings. Migrate to a more stable and flexible software platform. Improve search and display features. Add user conveniences, including more direct access to the Edit Engine.
2006: Funding runs out in January, project staff furloughed. Work continues with the help of private donations, including a generous grant from the Negaunee Foundation. The Kamusi Parser is introduced that allows users to search and evaluate conjugated Swahili verbs directly within the search engine.
2007: Project is moved from Yale to the World Language Documentation Centre and development work continues with the support of private donations.
2008: National Endowment for the Humanities grant to Grambling University to begin work within Kamusi for expanding the model to multiple languages, with a focus on Kinyarwanda. This grant was subsequently transferred directly to Kamusi after we completed our incorporation as a US legal non-profit corporation.
2009: Incorporation of Kamusi Project USA as a 501(c)(3) non-profit organization registered in Delaware, and Kamusi Project International as a non-governmental organization with the equivalent status registered in Geneva.
2010: Development of KamusiTERMS participatory terminology system and production of localization terminologies in 12 African languages, with the African Network for Localization, IT46, Translate House, and the support of IDRC in Canada. New logo unveiled.
2011: Begin work with University of Ngozi in Burundi on Kirundi language, in association with Universidad Politécnica de Madrid, with students receiving stipends in exchange for working on Kirundi entries.
2012: Programming of multilingual platform with Telamenta in South Africa.
2013: Launch of multilingual pilot, with 100 parallel terms defined in 20 languages, demonstrates that the new multilingual system works and has the potential to scale for unlimited additional languages. However, with no funding for continued language work, linguistic development grinds to a halt. In September, Kamusi joins the Distributed Information Systems Laboratory (LSIR) at EPFL in Switzerland, with support for certain technical development. In November, Kamusi is recognized as a launch partner in the White House Big Data Initiative.
2014: Focus on technical development, including games and mobile apps for engaging the public in the production of linguistic data.
2015: Our Big Data Beta introduces 1.2 million new interlinked records in more than 20 languages, proving Kamusi's capacity to scale. Work is launched on Vietnamese. Server crash in September knocked the site offline to the public for about a year.
2016: Public access moved to kamusigold.org while resources sought to restore full services on the main Kamusi site. Introduction of DUCKS shows the way Kamusi will align data across hundreds or thousands of languages. Launch of Kamusi Here! puts the world's most advanced multilingual dictionary search in the hands of users worldwide.

/info/history

GOLD Theme: Global Networks

UNESCO Multilingualism in Cyberspace

Kamusi participates in expert meetings with UNESCO and the organizations they have brought together for their initiative on Multilingualism 👅👅👅 in Cyberspace. In their words:

Language constitutes the foundation of communication and is fundamental to cultural and historical heritage.

Increasingly, information and 🎓 knowledge are key determinants of wealth creation, social transformation and human development. Language is the primary vector for communicating 🎓 knowledge and traditions, thus the opportunity to use one’s language on 🌎 global information networks such as the Internet will determine the extent to which one can participate in emerging 🎓 knowledge societies. However thousands of languages 🌍 worldwide are absent from Internet content and there are no tools for creating or translating information into these excluded tongues. The way how one accesses Internet sites through domain names is also pincipally limited to the use of Latin script.

Huge sections of the 🌍 world’s population are thus constrained in enjoying the full benefits of technological advances and obtaining information essential to their wellbeing and development. Unchecked, this will contribute to a loss of cultural diversity on information networks and a widening of existing socio-economic inequalities.

By supporting the development of 👅👅👅 multilingual cyberspace, UNESCO promotes wider and more equitable access to information networks and at the same time offers possibilities through ICT for the preservation of 👅🔫 endangered languages.



/info/unesco

The Long Now Foundation

Long Now has collected 25 million expressions, in 11,000 thousand language varieties, in a system called PanLex. As part of the Human Languages Project, we intend to disambiguate the meanings of those expressions, and make them useful as 🔢 data for advanced language technology applications. When we can cover the costs, our task is to take the 1.3 billion PanLex crude data connections, which are based on spelling, and refine them into tens of billions of valid second-degree translations among languages.

/info/longnow

Place

The "same" language often has different terms and 🔊 sounds across the regions where it is spoken. Most dictionaries attempt to paper over these variations by positing a "standard" form, usually that spoken by a particularly powerful segment of the society. We prefer to chronicle the richness of expression within each language, without making value judgments about purity and corruption. We contend that any expression that shares a communicative message for a significant number of 👪 people can be documented and geo-tagged. We do not say that a term in Egyptian Arabic is "right" while one in Moroccan Arabic is "wrong" - we show that one is right in Egypt, the other in Morocco, and perhaps both are wrong in Iraq. With such an approach, we can tailor future language technologies to local vocabularies. This has been impossible with previous constrained approaches to dialect and geographic variation, and thus, beyond a few broad strokes (UK vs. US English, Iberian vs. Brazilian Portuguese), has never before been attempted.

/info/place

Country Names Switcher

Problem: Every language has its own names for each country, but that information is not available to software developers for 👪 people to flip their local settings. Solution: We have access to country names for hundreds of languages. All we need to do is line up that 🔢 data and build some tools that developers can plug into their code, so that when users switch languages, they see the right list of countries. The tricky bit will be to keep the switcher up-to-date as new information comes into Kamusi. To help develop or fund the project, please contact us!

/info/countries

MAAYA

We work with many partners within the 🌍👅 World Network for Linguistic Diversity, a multi-stakeholder network that involves civil society, governments and international organizations in the reduction of the 🎓➗ knowledge divide promulgated through drastically unequal access to linguistic resources.

/info/maaya

Non-US Residents:
Donate in US Dollars

Or, you can make a bank transfer in US Dollars:
Account Holder: Kamusi Project International
Bank: UBS SA
Address: 1202 Geneva, Switzerland
IBAN: CH46 0027 9279 2085 9060 X
BIC/ Swift: UBSWCHZH80A




/info/dollars_international

Long Term Infrastructure

High-quality digitized linguistic 🔢 data does not exist for most languages, of a caliber needed for learning or technology. This is not because such data cannot be collected, but because nobody has taken the initiative to do so. Of course, none of this data existed 40 years ago in ways that could be used, or were even imagined, for English - 👪 people simply put in the effort to assemble resources for the language with the biggest immediate payoff.

Until twenty years ago, Bogota, Colombia, was snarled in traffic, like many cities in the world. Then they made a plan, made a network of special bus routes, and greatly reduced their transport problems. Nairobi never developed a twenty-year plan and lives in perpetual gridlock. Crafting the infrastructure for an excluded language needs long term planning, like crafting a fine whisky needs decades of patience.

As with effective mass transit, or starting a high-end distillery, developing high-quality linguistic resources for a language within the Kamusi system is more a matter of commitment than technology. We know the tools and we have the techniques; for example, rather than reinventing speech recognition technology in order for computers to recognize the spoken words of Fula, all that is needed is to collect the data about the language's terms and sounds, and then train existing technology using that data. When seen in a purely technical light, any language can join technology at the cutting edge, as long as good data can be collected in a compatible format. Creating infrastructure for one language or 7000 is a matter of planning and commitment to producing the 🔢 data over the course of years. To help develop or fund the project with an eye to the long term, please contact us!

/info/longterm

The Human Languages Project

A 🌎👅🔢🛣 global linguistic data infrastructure for societal and technological applications

The quest for Human Language Technology (HLT) development faces three large constrictions:

  1. Adequate 🔢 data does not exist in codified form for most languages.
  2. No system exists to effectively and consistently share existing 🔢 data.
  3. Technical tools often do not span languages or projects.
We are organizing a 🌎👪 global consortium to address the challenges of linguistic 👅🔢 data head on, to do for language what the Human Genome Project has done for genetics and the Human Brain Project is doing for neuroscience. The partners in this collaboration will produce 🔢 data systematically for dozens or hundreds of languages, and set that data to the service of 🎓 knowledge, communication, and HLT. The signatories seek to participate in the creation of a global infrastructure for linguistic data, either through direct 👅📃 language documentation or through the use of that 🔢 data for the development of advanced technologies and social services.

To date, we have more than 60 letters of intent from interested partners. In fact, we paused with taking new letters because, while many professionals 🌍 worldwide recognize the vast potential advantages of joining this collaboration, we have been unable to find a visionary financeer or institutional grant source that will support languages outside of the usual lucrative suspects.

You are invited to read the collaboration proposal. Linguists and language technology professionals who are interested in joining, please contact the organizer!

/info/hlp

African Academy of Languages

ACALAN is the official intergovernmental agency of the 52-nation African Union responsible for African language development and policy. Kamusi and ACALAN are working together to develop a technology platform that will serve the 2000 languages spoken around the continent.

/info/acalan

Toutes les langues de la planète

by Sara Sahli. Published in: La Côte, L'Express, L'Impartial, Le Nouvelliste, Le Journal du Jura. /info/toute

Baseball Jersey

Great shirt for fall and spring, also available in red and blue

/info/jersey

Digital Language Diversity: Seeking the Value Proposition

Full Article presented at the 2nd Workshop on Collaboration and Computing for Under-Resourced Languages, Portoroz, Slovenia, 23 May, 2016.

This article is a response to the CCURL workshop call for discussion about issues pertaining to the creation of an Alliance for Digital Language Diversity. As a global project, Kamusi has been building collaborative relationships with numerous organizations, becoming more familiar than most with global activities and the global funding situation for less-resourced languages. This paper reviews the experiences of many involved with creating or using digital resources for diverse languages, with an analysis of who finds such resources important, who does not, what brings such resources into existence, and what the barriers are to the wider development of inclusive language technology. It is seen that practitioners face obstacles to maximizing the effects of their own work and gaining from the advances of others due to a funding environment that does not recognize the value of linguistic resources for diverse languages, as either a social or economic good. Proposed solutions include the normalization of the expectation that digital services will be available in major local languages, international legal requirements for language provision on par with European regulations, involvement of speaker communities in the guided production of open linguistic resources, and the formation of a research consortium that can together build a common linguistic data infrastructure. /info/diversity

World's Only Swahili Watch ⌚

On the equator, 🕖 1:00 is the first hour after sunrise, because the length of every day is consistent all year long. Our unique ⌚ watch shows the numbers the way Swahili speakers 👪🔊 tell time. More colors available!

/info/orange_watch

GOLD Theme: Emergencies

Kamusi Help!

When an emergency strikes, first responders who rush to the scene know what they need to say, but they never know what languages they will need when they get there. With Team Translation, Help! will provide vocabulary for numerous emergency situations, that will be available across numerous languages. Childbirth, poisoning, fire, earthquake... We can share Kamusi Help! for 🆓 free on the phones of emergency responders everywhere, if a 💛😇 GOLD Angel can help develop and serve the 🔢 data. The potential for our language tools to save lives: one of our highest wishlist priorities!

/info/emergencies

Translators Without Borders

Translators Without Borders works in emergency and development situations with numerous languages that are unserved or underserved by the technology resources that would ease their vital translation work. We are seeking the funds that will enable TWB to join the Human Languages Project, using their growing corpus of translations to seed advanced language resources needed in humanitarian situations.

/info/twb

Miranda Rights

"You have the right to remain silent". Variations of this warning are legal requirements within the justice procedure of many countries. Police officers might encounter 👪 people speaking any language, but they have no way to issue warnings in languages they don't know. We can fix that, and help make sure justice is served, with a simple app. An arresting officer chooses their jurisdiction (for the required wording) and the language of the suspect, and our Miranda app can show them the 📃 text or have a recording played by a native speaker 👪🔊. Miranda is an early instance of specific tools we can develop for controlled vocabularies, which will help us refine our methods for gathering specialized translations. To help develop or fund Miranda, please contact us!

/info/miranda

💊🌎 Sante Globale

Sante Globale 💊🌎 is a small Kinshasa-based non-governmental organization that will aid the Human Languages Project in health and development terminology for the languages of Congo.

/info/sante_globale

GOLD Theme: Africa - Great Lakes

University of Ngozi

The 👪 people of Burundi speak the Kirundi language, but almost no resources are available for their children to begin learning in the language they grow up with. This makes school very difficult, since they have to learn everything from math to business in a foreign language. For several years, Kamusi and students at University of Ngozi have been building tools for Kirundi, which will be available for free 🆓 to children throughout the country. The initial work was funded with support from the US National Endowment for the Humanities. We were able to produce a promising preliminary🔢 dataset before we ran out of funds to pay the student stipends necessary to continue the work. Please join our special Together for Burundi campaign on 🌍🎁 Global Giving to help resume the collaboration, which will provide free language resources to an extremely underserved community of nearly 9 million people.

/info/ngozi

Language: Kirundi (Rundi)

Local name: íkiRǔndi
Most spoken in: Burundi
👪🔊: 8 to 9 million
Kamusi records: 1,029
ISO 639(1) / (3): rn / run
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Data🔢 from Kamusi participants, most especially students working with the project at University of Ngozi.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Kirundi Language Task Force to help plan development of robust resources for the language!

Our Kirundi 🔢 data is currently offline until we can sustain our system. Please join our special Together for Burundi campaign on 🌍🎁 Global Giving to help resurrect work on this 🆓 free language resource for a community of nearly 9 million 👪 people.

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/run

🌍🎁 Global Giving - Together for Burundi

The 👪 people of Burundi speak the Kirundi language, but almost no resources are available for their children to begin learning in the language they grow up with. This makes school very difficult, since they have to learn everything from math to business in a foreign language. Kamusi, TEDECO, and students at University of Ngozi are building tools for Kirundi, which will be available for 🆓 free to children throughout the country. Your support will fund learning stipends and the server needed for the project.

We have a special campaign on 🌍🎁 Global Giving to support Ngozi students of Translation Studies in their efforts. Your donation to this campaign makes it possible for the students to attend university. In exchange, the students work on the dictionary, learning needed language and computer skills at the same time they produce a vital national resource.

YOU can pay the full year's stipend for a student at Ngozi, and contribute to the development of Kirundi within Kamusi, for only $300. But, we really cannot do it without you. Thank you for your generous contributions!!

/info/globalgiving

Kinyarwanda.net

Kinyarwanda.net will become integral to the Human Languages Project for the complex Kinyarwanda language spoken by millions throughout Rwanda.

/info/kinyarwanda

Kigali Today

Kigali Today
22 February 2013

Ubwo hizihizwaga Umunsi Mpuzamahanga w’Indimi Kavukire, tariki 21/02/2013, Umushinga Kamusi (Kamusi Project) wamuritse ikoranya y’ikinyarwanda ishamikiye ku mushinga “Global Online Living Dictionary (KAMUSI GOLD)” ihuriza hamwe inkoranyamagambo z’indimi zigera kuri 20.

Nyuma yo gukora inkoranyamagambo y’Igiswayire n’Icyongereza yakunzwe n’abantu benshi ku isi, ubu Umushinga Kamusi ukorera muri Amerika ugeze ku ntera yo kongera izindi ndimi ku rubuga rwawo aho ikinyarwanda kiza ku isonga mu ndimi zigiye kwitabwaho by’umwihariko.

Continue to Full Article...

/info/kigalitoday

Excluded Linguistic Communities

Full Article:Excluded Linguistic Communities and the Production of an Inclusive Multilingual Digital Language Infrastructure

Presented at: 11th Language and Development Conference, New Delhi, India, November 18–20 2015
Abstract
The consequence of linguistic digital exclusion is the inability of billions of people to access vital knowledge and economic resources that contribute to prosperity in an era of globalization. However, rectifying linguistic inequity is mostly absent from development discourse and the agendas of governments and agencies that undertake development activities. Most efforts to produce content for excluded languages depend on the haphazard occurrence of a commercial, academic, or programmatic purpose for an activity in a given language at a particular moment. The Kamusi Project seeks to address the digital linguistic divide by engaging communities in the systematic collection of codified data for any language – linguistic information that can be used in many kinds of advanced knowledge and technology resources. This paper explores assumptions about participants’ motivations and behaviors that underlie the project’s methods, including participation in online games and interactive mobile apps intended to elicit speakers’ knowledge of their own languages in ways that can be shared by others. While the Kamusi system aims to welcome all, disparities may continue to exclude those without substantial time, network access, equipment, digital experience, or literacy, leaving international members of a diasporic language group as its most active contributors. Further, smaller and more remote languages have, by definition, fewer potential participants and less access for participation, thus perpetuating their inability to jump the digital divide. Without external support for the time and effort necessary to gather linguistic knowledge, even the most carefully constructed tools will fail for thousands of languages spoken by millions of people, including many languages near extinction. This paper raises, without definitively resolving, the social challenges of a multilingual digital infrastructure platform that has the technical capacity to document every word in every language, but can only approach accomplishing this objective through the involvement of those who have the least access to taking part. /info/excluded

TEDECO

TEDECO (TEcnología para el DEsarrollo y la COoperación) es un grupo de cooperación al desarrollo de la Universidad Politécnica de Madrid (UPM), compuesto por profesores y alumnos de la Facultad de Informática de la UPM y colaboradores externos.

Como su nombre indica, TEDECO es un grupo especializado en tecnologías de la información y las comunicaciones (TIC), cuyo objetivo inicial era utilizar las nuevas tecnologías para dotar a los centros (principalmente a centros educativos) de países en vías de desarrollo de la infraestructura necesaria para poder desarrollarse y mejorar su situación.

/info/tedeco

GOLD Theme: Canada

Donate in Canadian Dollars



/info/canada

CILLDI

The Canadian Indigenous Languages and Literacy Development Institute will coordinate work on First Nations' languages in the Human Languages Project.

/info/cilldi

University of British Columbia

The University of British Columbia Department of Linguistics will bring to the Human Languages Project a specialization in the development of front-end applications, including visualization and 🎮 games, to present lexical and educational resources in intuitive ways to language communities that are not normally immersed in technology.

/info/ubc

Kamusi's Boat Logo

Our logo took more than 2 years to design. Our main requirements were that the logo should:
  1. convey the notion of language and communication
  2. convey a positive image related to Africa, without resorting to archaic stereotypes
  3. be scalable, from a very small graphic all the way to something that would look good on a t-shirt or poster
  4. contain "kamusi project" or "kamusi.org" within the graphic design
Here's the full story...

/info/boat

Canvas beach bag

Attractive, sturdy bag for 🏬 shopping or carrying loose gear, and part of your purchase price supports the project! Also available in blue or pink

/info/beach_bag

GOLD Theme: Ontologies

Ontologies

Thoughts are often organized around logical systems that are intertwined with language. You may, for example, offer a guest pie when you really have 🎂 cake, because you associate both words with round baked desserts. Through WordNet now, and other ontological systems we gradually align, we chart these relationships. Beaks and wings are parts of a 🕊 bird, sparrows and swallows are types of 🕊 bird, birds are a type of animal. By fixing these connections to specific senses (wings that are flapped in the air, not eaten with beer), we can apply ontologies across languages. You will be able to find terms for specific car parts, or different types of 🚗 car, in Hindi or Portuguese.

We most look forward to Chinese. No satisfactory system has yet emerged to organize Chinese dictionaries, because it is not possible to list the tens of thousands of ideograms in an accepted sequence. Chinese students take classes where they learn to find terms based on the strokes that characters are composed of, but that does not help them find words if they do not already know exactly how to write them. Using ontological connections - knowing that a 🦁 lion is a type of 😾 cat can bring you to the term for 🐆 leopard that you are looking for - we envision that Kamusi will result in a Logical Chinese Dictionary that will be uniquely useful to learners and speakers of the language. /info/ontologies

Relations

Words do not exist in isolation. They might have synonyms and antonyms. They could be part of an ontological set (a beak is a part of a bird and so is a feather, a swallow is a type of bird and so is a goose). They might have spawn (light begat lightsaber), or close relatives (operation and operator). Knowing how words are related enhances our understanding of how a language works, and thus makes it easier to put language to work.

Within the Kamusi architecture, we can pinpoint relationships at a level of granularity that has not previously been available. Because each individual meaning of a word is its own entry, we can know, for example, that heat and warmth could be substituted for each other as synonyms in the context of a chilly person near a fire, without getting confused like Google Translate that someone who is packing heat is putting warmth = chaleur = cordialité in a suitcase. These nuanced associations among🔢 data elements will create opportunities for Natural Language Processing that can underlie next-generation advances in language technology.

/info/relations

Condillac

The main domain of interest of the Condillac Research Group at the Université Savoie Mont Blanc, in France, is 👅👅👅🎓 multilingual knowledge sharing based on terminology and ontology - that is, elaborating conceptual links across languages, a core component of the Human Languages Project.

/info/condillac

GOLD Theme: East Asia

Donate in Hong Kong Dollars



/info/hongkong

Language: Chinese (Mandarin)

Local name: 汉语/漢語 (Hànyǔ) or 中文 (Zhōngwén)
Most spoken in: China
👪🔊: 1,000,000,000
Kamusi records: 97673
ISO 639(1)/(3): zh/cmn (we use unofficial "qcn" for Traditional)

Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)

Sources: Chinese has two writing systems. For Simplified Chinese, used on the mainland, initial 🔢 data comes from the Chinese Open 🔠🕸 WordNet, maintained by the Division of Linguistics and Multilingual 👅👅👅 Studies, Nanyang Technological University, Singapore. For Traditional Chinese, used on Taiwan, initial data comes from the Chinese Wordnet, maintained by the Lab of Ontologies, Language Processing and e-Humanities, National Taiwan University, Taipei, Republic of China (Taiwan), and Academica Sinica, Taipei, Republic of China (Taiwan). Additional data from Kamusi participants.

💛😇 GOLD Angel: Nobody.

Pearls: Be the first Pearl for this language!

Join the Chinese Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/cmn

Donate in Taiwanese Dollars



/info/taiwan_dollars

Language: Japanese

Local name: 日本語 (Nihongo)
Most spoken in: Japan
👪🔊: 125 million
Kamusi records: [data queued for upload]
ISO 639(1)/(3): ja/jpn
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial 🔢 data from the Japanese 🔠🕸 WordNet, maintained by National Institute of Information and Communications Technology of Japan (NICT), Kyoto, Japan. Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Japanese Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/jpn

Donate in Japanese Yen



/info/yen

GOLD Theme: Data Structure

The Tip of the Iceberg

The dictionaries we provide through the search bar on KamusiGOLD are only a small part of the project. However, due to our infinitesimal budget, we cannot currently afford to open our data to the public in the ways we would like.

Take, for example, the word "infinitesimal" from the previous sentence. We offer you the opportunity to search for that word in English, and to find its equivalents in all the languages for which we have corresponding data. You can enter any word in the search bar, and we'll happily share much of what we know about it. What we cannot open up is a permanent address for that term, which is the starting point so you can link to it, dig into any rich information we have about it, improve and expand it, or use it as data in other applications.

The problem is that when terms such as "infinitesimal" are all given fixed URLs, they transform from data points into Big Data. Each entry becomes not just a web page, but as many web pages as the word has senses; while "light" is one page on Merriam-Webster.com, concept-specificity makes it 48 different pages on Kamusi. Each web page requires all of the ornamentation that makes each page beautiful. Each entry has extended information that we either need to present on its web page, or provide links to within the code. Every term links to translations, ancestors, or other entries, (e.g. infinity, finite, finish), with code that you don't see but your device does. The code should also contain a lengthy RDF description so the data can be pinpointed by other projects. While each word takes fairly infinitesimal resources when we serve it from our database, its associated web pages can be 🐘 elephantine. With millions of words, offering a web page for each demands a lot more resources than offering a simple query to our database.

Moreover, each static page is a target for the search engines that constantly crawl the web to index what's out there. This is really what overloaded our system and kept us offline for a year before we turned to today's query-only light-access solution. Not only Google and Bing, but also Baidu, Sogou, Yandex, and several others you've never heard of. Most websites have a few dozen pages, or maybe a few hundred. A dictionary with a healthy 100,000 terms would have 100,000 web pages, which is already a lot of 🐘🐘 for the crawlers. As a dictionary of many languages that is attempting to provide you with every word ever known to be spoken, our data contains millions of 🐘🐘🐘🐘, with links from one to the next that are irresistible for the crawlers to follow. It takes a fraction of a second to send each 🐘 back to you, and when a dozen search engines hit at the same time, our server has to give them the same attention. Unlike human users, though, the robots do not pause. As soon as we give them one 🐘, they ask for the next. And the next. And the next, following the chains through millions of links. At one second per entry, 10 million entries would take a single search engine nearly 4 months to crawl, by when it's time to start all over again. With each 🐘 exposed, we spent our all our time telling the robots what we had, with no power left over to serve the data to actual people.

We do have ways to solve the problem, but they involve money to pay our developers. Basically, we need to create a very limited index page for each entry that the robots can read quickly, and require authenticated login to get at the real pages with the rich data. Preferably, we will be able to afford a multi-server solution, with robots crawling on one machine while confirmed humans enjoy full data access on hardware that is not inundated by automatons. When we do this, we will be able to open up many more services that take advantage of our precise concept-based data, but are contingent on fixed URLs.

Giving you everything we have won't take a miracle, but does demand adequate sponsorship to implement known solutions. Meanwhile, our query-only search is not exposed to search engines, so are happy to offer it to you.

/info/iceberg

Molecular Lexicography

While we often start with basic word forms, our structure and our tools are designed to build much more complex information about each term, over time. This information is important both for the people who wish to learn from the dictionary, and the tools we build for which language data is needed for input (like voice commands) or output (such as translations). As the video expounds, terms are composed of several types of particles, interacting within and among languages in a manner similar to molecules. Terms have: Our goal is to collect information about each term, and use that info to follow associations among terms and langauges: to the greatest extent possible, a full matrix of human expression across time and space. /info/molecular
Artificial Intelligence is an ongoing quest for computer science - nowhere more so than in understanding and producing natural language. Great resources have been directed toward "machine learning", where computers learn through reactions to previous results. If 51% of 👪 people asking in English for "weather in Naples" click on the result for the city in Florida, the machine learns to provide the US answer by default. Someone using those results to pack for a trip to 🇮🇹 Italy will understand why this actual example of learning might not be so intelligent. Yet statistically-based computer inference remains the primary method upon which translation models are predicated.

At Kamusi, we aim to learn first directly from 👪 people, and only learn from patterns when we have confidence they apply. That is, if we come across the party term "bust a gut", and it isn't already in our🔢 database, we don't blithely tell our users that it means "lacerate one's intestine". Rather, using Pre-D, the user indicates that the words play together to form a unique meaning. From the crowd, we can find out what that meaning is, and how the term is expressed in other languages. This human-derived 🎓 knowledge, validated by procedures to find community consensus, can then be fixed as fact that can serve as the basis for intelligent inferences. Machine learning can be applied to the repetitive aspects of language that resist nuance, such as "busted a gut" being the past tense of the phrase. Or we can use learning algorithms to make connections that are slated for human testing, such as whether a phrase in Hausa that has been matched to a phrase in German that has been matched to our English phrase produces a legitimate Hausa <-> English pair. Because 7000 languages potentially in Kamusi equates to 25 million translation pairs, it is impossible that we will find qualified bilinguals to verify direct connections among all of them, or to confirm every prediction we make about grammatical patterns. However, the more we learn directly from people, the more confident we can be in ensuing inferences. Our objective with machine learning, then, is to find potential facts that can be verified over time by real people, and then used for the rest of time as a confirmed 🎓 knowledge base and iterative seed for further learning. The more human intelligence we can pack into our language computation infrastructure, the less artificial will be the results of "artificial intelligence" - and the more likely we will be to bring the right wear for our 👅 linguistic travels. /info/machinelearning

Tasks!

Government forms, online forms, checklists... These items have repeated vocabularies from agency to agency, website to website - date of birth, address information, and much more. With Tasks!, a form can be created in one language that will be properly understood by users in their own language. In an age where 👪 people move around the 🌍 world, this service will make it easier to pay taxes, enroll your kids in school, and also to use websites for booking tickets or buying books. Another service that will grow from our existing tools if we can fund it.

/info/tasks

CERN Talk: Particles of Language

The Particles of Language: "The Dictionary" as elemental data for 7000 languages across time and space

Click here to watch the video.

How can we document detailed 🔢 data about all the 🌍 world's languages in a consistent, unified source, in a way that can serve 🎓 knowledge and technology needs for 👪 people and their machines around the globe? Dictionaries have historically presented selective information about words and their meanings within a language, or translation equivalents between languages, in idiosyncratic, incommensurable formats with little basis in data science. The Kamusi Project introduces a new approach, conceiving of language as a matrix of interrelated data elements. By documenting these elements within each language, and linking elements at conceptual and functional nodes across languages, Kamusi aims toward an elusive Big Data goal: "every word in every language." If successful, the results will run the gamut from preserving the human heritage embedded in 👅🔫 endangered languages, to providing international vocabularies for students to succeed in science, to a Star Trek-like universal translator embedded in your smart watch ⌚. In this talk, the project's founder discusses the nefarious complexities working against the creation of a universal language data platform, and the systems Kamusi has designed to collect, codify, and deploy quantum-level 👅🔢 linguistic data within one massive 🌎 global dictionary.

/info/cern

Digital Tech Gives Wordsmiths a Whole New World

TechNewsWorld
30 April 2014
by Vivian Wagner

Digital Tech Gives Wordsmiths a Whole New World
...Another futuristic dictionary that's still in the formation stages is the Kamusi Project, which intends someday to include all words in all languages, allowing users to navigate freely between them.

"The long-term goal is to get as much data as possible about every term in every language and put it in one consistent place that is accessible to everyone, where all the data is accessible for people who want to use it,"...

Continue to Full Article...

/info/technewsworld

A Dictionary of Difference

In Chinese, a 🐑 sheep and a 🐐 goat are the same thing (羊). The lack of exact correspondences between languages creates a problem for most dictionaries. Kamusi has a unique solution to the problem that allows us to create a viable 👅👅👅🔢 multilingual dataset that does not get caught in a tangle of semantic drift. Translation equivalents are marked as either parallel, similar, or an explanation in one language of a concept that is indigenous to another (filling a lexical gap). "Similar" concepts are given a space for explication, so that a reader can know that 羊 is 1️⃣ one animal in China and 2️⃣ two in Australia.

We apply the same principle to synonyms within a language. What is the difference between a 🛳 ship and a ⛵ boat? For terms that are labelled as synonyms, we provide a field to explain the difference. This task has never been done, by any dictionary in any language. Of course, producing this information will be a lot of work, so please join us in ✍ writing the explications and making a difference!

/info/difference

Differentiator

Languages often make subtle distinctions between similar concepts, such as a dual carriageway versus a motorway versus a thruway versus a freeway. We want to build a tool that can be configured to tease out these differences for conceptual groups. Items can be tagged for characteristics they must have, must not have, or might have (such as whether a type of roadway must, must not, or might have tolls), and where they occur geographically. Those concepts can then be matched to exact or near equivalents in other languages. It's complicated: for example, in both Kenya and Tanzania, the Swahili term for the administrative territory smaller than a nation is "mkoa", but the English term is "province" in Kenya and "region" in Tanzania. The Differentiator will make it possible to hone all sorts of minute distinctions within and between languages and places, from school levels to military ranks to cuts of meat. To help develop or fund the Differentiator, please contact us!

/info/differentiator

GOLD Theme: Big Data

Mission to Mars

Sending people to Mars was once an impossible dream. Now it is a plan.

At Kamusi, we recognize that our goal of Every Word in Every Language sounds unattainable. Our architecture, though, is built toward that target, supporting a complete matrix of human expression across time and space. Like a mission to Mars, this is a project for the long term. We have the initial core systems in place and we have a global partner network that is working to launch the Human Languages Project. Developing the systems and data to link all of the world's languages together in a vast network of understanding is a massive task - but, if we succeed, a giant leap for humankind. Please join with us to set the elimination of language barriers as an objective we can achieve together: three, two, one, blast off!... /info/mars

Universal Concept Identifier

Data science is built on the ability to identify items precisely, using numbers. Books, for example, all have an ISBN so that particular editions can be found in bookstores and libraries worldwide. A great challenge for informatics is ascertaining when things are the same across systems - whether the goods leaving the supplier are the same as the goods arriving at the warehouse. Sometimes pieces of information can link databases effectively; "July 19, 2010" will always refer to the same moment in history, even if systems render it variously as "19/7/10" or "19 juillet 2010". Words, however, are clouds that do not have meanings fixed in a standard system. A 🐕 dog could be "a mammal, Canis lupus familiaris, that has been domesticated for thousands of years", or "a domestic mammal, related to wolves and foxes, that is often kept as a pet", making it impossible to automatically tie together information in different systems that try to reference the same concept.

The problem is further compounded when expressing the same basic 💭 concept in different languages, since the shapes of both the terms and the words used to define them are inherently different computationally, even if the idea is identical.
For example, 👂 is written in English as "ear", but the only fact the computer knows is the binary code for e-a-r, "01100101 01100001 01110010". In Romanian, 👂 is written as "ureche", but the computer sees "01110101 01110010 01100101 01100011 01101000 01100101". For the computer to know that two terms are equivalent, that 👂 in one language = 👂 in another, we need a set of digits that is the same for any term that expresses the same thought.

Kamusi is implementing a Universal Concept Identifier, a single number that can be assigned to a given idea. Any term that matches that idea - 🐘, éléphant in French, ndovu in Swahili, ゾウ名 in Japanese - is joined to that ID. Using our differentiator, we can split similar things with different numbers where appropriate, yet still show their close ontological relationships - a freeway and a thruway get different numbers, but are both linked through our graph architecture to the idea of limited access auto routes, and the German translation "autobahn". We will use the json "synset" numbers WordNet has established as a starting point, and integrate with the IDs created in the limited expanded vocabulary set proposed for CILI, the Collaborative Interlingual Index.

UCIs will be extended through Kamusi to cover millions of additional concepts that fall outside of the 100,000 concepts identified by WordNet, including words other than nouns, verbs, adverbs, and adjectives, and untreated forms such as gerunds, 650,000 additional concepts gleaned from Wiktionary, millions of named entities from the Joint Research Council, 1.6 million species from the Catalogue of Life, 8 million domain-specific terms in 25 languages from IATE as well as term sets from additional sources, and items indigenous to languages other than English, . The UCI will thus be available to codify open data across numerous languages, projects, and data systems, with the intent that the world's linguistic data can play together in ways that are not currently possible when links can only be inferred by guesswork based in spelling.

/info/ucid

Our Server Problems, In 🖼 Pictures

We overloaded our server, and didn't have the money to get it running again. Here is a photo essay that explains the problem.

/info/overload

🎓 Knowledge from Crowds 👪👪

Most 👅🔢 linguistic data is in the heads of the 👪 people who speak each language, not in the zeros and ones of 🔢 digitized data. We have been developing new methods for learning language information directly from the public, for languages around the 🌍 world. Our approach to the crowd is discussed in

Full article: Crowdsourcing Microdata for Cost-Effective and Reliable Lexicography

Benjamin, Martin (2015)
Editors: Li, Lan; Mckeown, Jamie; Liu, Liming
Published in: Proceedings of AsiaLex 2015 Hong Kong, p. 213-221

Abstract
Lexicography has long faced the challenge of having too few specialists to document too many words in too many languages with too many linguistic features. Great dictionaries are invariably the product of many person-years of labor, whether the lifetime work of an individual or the lengthy collaboration of a team. Is it possible to use public contributions to vastly reduce the time and cost of producing a dictionary while ensuring high quality? Crowdsourcing, often seen as the solution for large-scale data acquisition or analysis, is fraught with problems in the context of lexicography. Language is not binary, so there may be no one right answer to say that a word “means” a particular definition, or that a word in one language “is” the same as a particular translation term. People may misinterpret instructions or misread terms or make typographical or conceptual errors. Some crowd members intentionally add bad data. Without a payment system, incentives for participation are slim; micro-payments introduce the incentive to maximize income over quality. Our project introduces a public interface that breaks lexicographic data collection into targeted microtasks, within a stimulating game environment on Facebook, phones, and the web. Players earn points for answers that win consensus. Validation is achieved by redundancy, while malicious users are detected through persistent deviations. Data can be collected for any language, in an integrated multilingual framework focused on the serial production of monolingual dictionaries linked at the concept level. Questions are sequential, first eliciting a lemma, then a definition, then other information, according to a prioritized concept list. The method can also be used to merge existing data sets. Intensive trials are currently underway in Vietnamese, with the inclusion of additional Asian languages an explicit objective. /info/crowds

1.3 Billion PanLex Connections

Our partners at Long Now are hard at work on PanLex, a repository of term-to-term translations from as many sources as they can collect. Our task is to take their crude data, which is based on spelling, and refine it into meaning.

That is to say, PanLex data currently shows l-i-g-h-t in English matches to 9658 of terms in 1599 language varieties. From the data, we cannot tell whether a term in Language 2 means "light - not heavy" or "light - not dark" or "light - not serious". Nor can we tell that information for Language 3. Which means the PanLex list gives us no way we can infer which term in Language 2 that matches to l-i-g-h-t is equivalent to a term in Language 3 that matches to the same English spelling.

PanLex currently has 24.9 million expressions from 11,044 language varieties, resulting in some 1.3 billion direct spelling-based translations.

Kamusi is preparing to align these terms based on meaning, so that the term for "light - not dark" in Language 2 matches the term for that idea in Language 3, Language 4, and Language 1599, and never gets confused with "light - not heavy" or "light - not serious" when connecting any of those languages. We have already developed the core functionality in our DUCKS software (Data Unified Conceptual Knowledge Sets), for speakers to match the given term in their language with the right definition in English or another bridge language. When we have the financial horsepower, we can automate the system so that it arranges the PanLex data into 11,044 DUCKS sets, and, with proper authentication, sets players to work choosing the right meanings for their terms.

Using DUCKS and other computational techniques, we can in principle convert the 1.3 billion raw PanLex connections into tens of billions of valid translations among all the world's languages. We are ready to blast off technically, and are now seeking visionary funders who can light the fuse.

/info/panlex

White House Big Data Initiative 🔢


November 13, 2013: The White House  highlighted several new and recent partnerships and collaborations focused on 🔢 data in the areas of urban policy, development, science, health and research to further the goals set by the Big Data Research and Development Initiative in 2012. The White House highlighted the recent partnership between DataKind, which connects non-profits facing 🔢 data analysis challenges with pro-bono data scientists, and Pivotal for Good, which will be contributing to those "data philanthropy efforts," Google's and USAID's support of the World Resources Institute's 🌍🌲👀 Global Forest Watch 2.0 monitoring tool and the Kamusi Project, which is working on establishing a dictionary of every word of every world language, with support from the National Endowment for the Humanities.

/info/whitehouse

GOLDbox Lexicography

The empty GOLDbox is central to the success of Kamusi. Most 🔢 data for most languages has never been digitized in ways that can be easily used by the public and their machines. Finding and codifying this missing 🔢 data is our quest. When we know there is something we don't know, we ask the 👪 people who speak that language to share what they know. If we don't have the Ukrainian term for "doll", for example, the GOLDbox invites Ukrainians to fill in that information. Once we know the basic form, we can make new GOLDboxes to find the plural, the gender, and many more molecular details.

GOLDbox must work sequentially, so that the completion of one element through the validation process triggers the opening of the next step. An interface must be designed so each element in each language can be independently configured (e.g., languages have different numbers of noun plurals). The software must also play nicely with our system-wide authentication, and of course be fully intertwined with the Neo4j 🔢 database. To help develop or fund GOLDbox, please contact us!

/info/goldbox

Wiktionary

Wiktionary

is an open dictionary that is not readily useable as a 🔢 data source for natural language processing or human language technologies. We will soon open a special version of 🐥📊 DUCKS that finds the common senses between the English Wiktionary and 🔠🕸 Wordnet, and makes the "missing" Wiktionary senses available for a plethora of technological extensions.

/info/wiktionary

Donate in Swedish Krones



/info/sweden

Canvas messenger bag

Carry all your basics around town - this sturdy bag fits a medium-sized laptop, and part of your purchase price supports the project!

/info/messenger_bag

Search De-Optimization

Conundrum: we want 👪 people to find our 🔢 data, but we don't want the search 🤖 robots to kill us with their crawls. We have millions of entries for myriad languages. Many entries have multiple data elements (plural forms, definitions, etc). Each element is wrapped in html for display, and rdf for sharing as open data. Many entries have hyperlinks to their equivalents in languages around the globe. This is Big Data, and it is too much to expose in its entirety to big search engines. We could put all the data behind search walls, but then the search engines would never be able to direct people to the unique language information they are looking for. For example, with our 🔢 data hidden in our database rather than having 🌍📖 world-readable URLs for each term, you cannot go to specific entries, only to dictionary lookup results. Instead, we need some clever engineering that exposes the very basic information to the 🤖 robots, but reserves serving the good stuff to real 👪 people. To help fine-tune how we are searched, or to fund de-optimization, please contact us!

/info/search

GOLD Theme: Switzerland

Universität Basel

Zentrum für Afrikastudien Basel will join forces in the Human Languages Project to coordinate partners and research activities for African languages, including the social implications of using advanced technology with diasporic populations for the benefit of their homelands.

/info/casb

Modulo Language Automation

Modulo is a small Swiss enterprise that specializes in translation quality control, an important element in building our crowd participation systems.

/info/modulo

A Plan for Swiss Languages

Switzerland speaks German, French, Italian, and Romansh. We have developed approaches to integrate all these languages in the Kamusi system. You may be surprised that German is the most difficult to document - so difficult, in fact, that no reference resource exists for popular use. The problem is that Swiss German is generally a spoken language, with more than 100 dialects and sub-dialects. For reading, the Swiss use "Standard" or "High" German, which they call Schriftdeutsch (written German). In certain technological situations, though, such as voice interactions and informal messaging, they would greatly prefer to interact in their own language variety. Unfortunately, services cannot be provided in Swiss German because the 🔢 data is not available, and the data will continue to be unavailable because nobody will agree to a standard. Fortunately, our novel data structure for dialects will make it possible to produce all the 🔢 data necessary to preserve 🎓 knowledge about Swiss German, starting with 5 major varieties, and put it to use in advanced technology applications. We have great partners, and we can start as soon as we find a 💛😇 GOLD Angel.

/info/swiss

Language: French

Local name: français
Most spoken in: France, Quebec, Belgium, language of government and higher education in many African countries
👪🔊: 80 million speak French as a mother tongue, and another 200 million or so speak it as a second or school language.
Kamusi records: Unfortunately, previous attempts to build a French 🔠🕸 WordNet have not produced results of a high enough quality for Kamusi. We therefore do not have aligned 🔢 data for bootstrapping. We do have unaligned data that we are gradually processing through 🐥📊 DUCKS, including 1700 items from CAWL, about 2000 from Emoji, about 7000 aligned through Songhay, and about 13000 aligned through Berber.
ISO 639(1)/(3): fr/fra
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: N/A
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the French Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/fra

Maternity Shirt

The baby who our model had in her belly for this photo is now fluent in three languages. A great 🎁 gift for expectant moms!

/info/maternity

Un dictionnaire multilingue accessible à tous

IC | Computer and Communication Sciences
27 janvier 2014
par Alexandra Walther

Un dictionnaire multilingue, bénéficiant des contributions des internautes, disponible gratuitement sur la toile dans toutes les langues du monde : tel est l’objectif de Kamusi. Or, développer un tel outil est un véritable défi technologique, car chaque langue possèderait en moyenne 100'000 mots dotés souvent de plusieurs significations. Sachant qu’il existe aujourd’hui jusqu’à 7'000 langues vivantes de par le monde, les correspondances entre tous les mots de toutes les langues représentent donc un réseau de données aux possibilités quasi infinies. Comment relever un tel défi ? Grâce à la rencontre entre un anthropologue féru de langues et un informaticien passionné du Big Data.

Continuer à l'article complet...

/info/kamusi_epfl_fr

Donate in Swiss Francs

Or, you can make a bank transfer in CHF:
Account Holder: Kamusi Project International
Bank: UBS SA
Address: 1202 Geneva, Switzerland
IBAN: CH46 0027 9279 2085 9001 V
BIC/ Swift: UBSWCHZH80A



/info/swissfrancs

University of Geneva

The Faculty of Translation and Interpreting at the University of Geneva is a leading research group in 👅👅👅 multilingual Switzerland/a>, poised to join the Human Languages Project consortium.

/info/geneva

GOLD Theme: Australasia

Donate in Australian Dollars



/info/australian_dollars

iTalk

iTalk Studios, based in Alice Springs, Australia, is a SME that develops educational tools and resources for Aboriginal languages. We are seeking sponsors to bring Arrernte and several other languages into Kamusi, but so far have not located donors in Australia who have concern for their country's marginalized languages. Please join the Australia Task Force to help strategize about how we can push forward.

/info/italk

Donate in New Zealand Dollars



/info/nz

GOLD Theme: Images

Visual Dictionary

A 🖼 picture is worth a thousand words, and there are thousands of open images that we can use to enhance the entries for millions of words. If we have a picture of a "lemon", that image can illustrate the concept in any linked language. However, does that picture attach to a fruit or to a defective car? Computer algorithms will usually revolve around the most common meaning of a term. We need to build special visual 🐥📊 DUCKS to align image sets with the correct meanings of the terms they are tagged with. For 📱 mobile devices, users should be able to upload 🖼 pictures of things they see to the relevant entries, on the spot, with geo-tags; this functionality needs to be embedded as an extension of our existing mobile app. We also need a permissions and review mechanism for 👪 people who share their own images, either from their computers or their phones, to grant us usage rights and vet that the 🖼 pictures are appropriate. To help develop or fund our visual dictionary features, please contact us!

/info/visual

Swahili Watch ⌚ in Six Colors

On the equator, 🕖 1:00 is the first hour after sunrise, because the length of every day is consistent all year long. Our unique ⌚ watch shows the numbers the way Swahili speakers 👪🔊 tell time. Available in classic black, modern white, hot pink, flaming orange, dark blue, and baby blue!

/info/watch

Image Squad

Almost every phone is a camera these days, and for Kamusi users every phone is a dictionary as well. We plan to combine our visitors' interests in words and photographs, with tools to upload images of their physical world. More than giving firm substance to concepts that can only be roughly described in words, images submitted from around the 🌍 world will show cultural distinctions that enhance our understanding of what the same term might mean to different 👪 people. For example, a stool in Tanzania evokes this notion that might not be shared in Latvia:

We need to build in several features for users to share their photographs as illustrations for specific dictionary entries.


To help develop or fund our photography tools, please contact us!

/info/images

Shape

Words have "canonical" forms (lemmas), those that are usually shown in dictionaries. Other forms can be easy to include in an entry, such as plurals or gender variations, or the five shapes an English verb can take (see/ sees/ saw/ seen/ seeing). Even simple shapes can be complicated - spelled or spelt? - but the our flexible architecture makes it easy to catalogue variations and show where they are used. Multiple alphabets, or spellings with and without tones, are charted as🔢 data elements within an entry.

More difficult are variations caused by "agglutination", where prefixes, infixes, suffixes, or entire words can be glued together, often causing further changes internally. We have a parser for Swahili that filters words through all 300-odd grammatical rules, revealing the underlying terms. We hope to develop similar routines for other languages as student projects at partner universities, and expect to apply existing tools to parse German when we have dedicated resources. While agglutinative terms are complex, they necessarily adhere to a finite set of rules that speakers 👪🔊, and therefore computers, can apply in each new construction.

The final objective is to figure out the components of any 📃 text that a user searches for, whether it is stored in the database as a form of a single word, a known party term, or an on-the-fly compound that would never appear in a standard catalogue of terms.

/info/shape

GOLD Theme: Fonts and Orthography

Evertype

Evertype will contribute to the development of fonts, scripts, and orthographies in the Human Languages Project.

/info/evertype

Localization

Beyond the terms we catalogue, all of the navigation and informative content on Kamusi should be available in every language. Translating such a large and growing amount of content is well beyond our budget and personnel capacity. We therefore need to implement a localization backbone, using the open source Mojito Continuous Localization Platform, and then recruit and maintain a network of volunteer translators to work through our existing content and keep their languages updated.

To help develop or fund localization, please contact us!

/info/localization

Athinkra

We work with Athinkra in the areas of font design & development, programming, and script consulting to meet Unicode and ISO-10646 standards, particularly for West African languages.

/info/athrinka

Tous les mots de toutes les langues dans un seul dictionnaire

EPFL Mediacom
19 septembre 2015
par Cécilia Carron

Afin de poursuivre son objectif de traduire en ligne chacun des sens de chaque mot dans toutes les langues du monde, le projet de dictionnaire universel Kamusi vient d’ajouter 1,2 million de termes provenant plusieurs base de données. Trois langues africaines ainsi que 200'000 mots de Vietnamien vont être intégrés prochainement.

Kamusi, qui signifie dictionnaire en Swahili, a pour objectif de traduire chacun des sens d’un mot de chacune des 7000 langues de la planète dans toutes les autres, avec leur définition et des exemples d’utilisation. Ce vaste projet, démarré il y a une vingtaine d’années, croît désormais de manière exponentielle. Certaines langues, comme l’anglais et le Swahili, sont d’ores et déjà en grand partie disponibles...

Continuer à l'article complet...

/info/epfl_mediacom_fr

Fonts for All Languages

Google has been developing a font family called Noto, which aims to support all languages with a harmonious look and feel. Noto is Google’s answer to "tofu", the little ☐☐☐ boxes that appear when your 📱 device doesn't have a font to properly display the 📃 text in a language. The name noto is to convey the 💭 idea that Google’s goal is to see “no more tofu”. Noto has multiple styles and weights, and is 🆓 freely available to all. Because many Kamusi languages use alphabets and scripts that are not supported by default on many systems, we recommend that you install Noto now so you can read any content we provide. Download Noto here! /info/noto_fonts

Language: Hindi

Local name: हिन्दी
Most spoken in: India
👪🔊: More than 1/2 billion
Kamusi records:
ISO 639 (1)/(3): hi/hin
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial 🔢 data from the Hindi 🔠🕸 Wordnet, by the Centre for Indian Language Technology (CFILT), Computer Science and Engineering Department, IIT Bombay, Mumbai. Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Hindi Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/hin

GOLD Theme: Translation

Multilingual dictionary keeps humans in the loop

New Scientist
26 February 2013
by Jim Giles

Multilingual dictionary keeps humans in the loop
There's an old dream shared by artificial intelligence researchers: an algorithm that can create perfect translations of any text, in any language, at the press of a button.

And then there is today's automated translation technology. Services like Google Translate can provide the gist of a passage of text. But if you're a newspaper publisher seeking foreign readers or a public health expert wanting to educate speakers of another language, human translation remains the only option.

Continue to Full Article...

/info/newscientist

Where No Corpus Has Gone Before

Much work in computational 👅 linguistics depends on digitized corpora for a given language, and machine translation is largely based on parallel corpora that have close alignment between languages. A corpus is a a compilation of written texts that, together, contain a great many of the words and word forms used in a language, usually produced by native speakers. The best corpora (or corpuses) contain tens or hundreds of millions of words. A corpus can be monolingual, such as the Corpus of Contemporary America English, or can contain parallel translations of the same documents aligned sentence by sentence between languages, such as the proceedings of the European Parliament.

For a well-digitized language, a clever researcher can use a corpus to reveal many things about a language. For example, each word that appears in the corpus can be listed, ranked by frequency, and often analyzed as to its part of speech. This gives the lexicographer a launching point to know what words must appear in a dictionary.

A parallel corpus makes it possible to see what terms in one language appear in exactly the same context as terms in another language, and are therefore probably translations of each other. That is, if "shoe" appears in 10 English sentences and "topanka" appears in parallel versions 10 times in Slovak, then the English term and the Slovak term probably both refer to the same idea of 👞 . (Word nerds can have fun searching parallel corpora for 25 European languages, beautifully assembled by Linguee, to suss out translations for party terms like "an arm and a leg".) However, parallel data tends to be based on topics of interest to those willing to pay for high-quality public translation, and thus particularly useful for business and government affairs. No similar resource exists for many topics that matter to the general public. In sports, for example, even within the BBC, the English service and the Swahili service might report about the same marathon, but the articles are often divergent re-tellings of the event, not sentence-by-sentence reproductions of a single script, and therefore ineffective as parallel text, were the two stories to be automatically compared. Case in point, as a result of their training corpora, Google Translate generally thinks people run organizations, not races. Additionally, most published literature that has been professionally translated across languages remains under copyright, so a trove such as the Harry Potter series, in 64 languages, is not available for public use.

Learning from corpora is the focus of intense international study. Many of our partners do incredible work with corpora, and with Transtechno we have developed a game (which needs a sponsor to put online) to find high-quality usage examples from the Helsinki Corpus of Swahili. However, corpora bang against several limitations for documenting most languages or finding translations between them. It is difficult or impossible for corpora to: The common thread among these problems is lack of data. Even within English, we can easily tell that "break" occurs many thousands of times, but we cannot know how many of these refer to a rest versus a fracture. Parallel texts hit their limits at the edges of professional translations - sources such as Wikipedia have similar texts in numerous languages, but the articles do not line up at a sentence level and thus do not provide reliable data. Natural speech could be brought in via transcriptions of field recordings, which is probably the best strategy for many minority and endangered languages, but doing so is extremely expensive and time consuming, and thus unlikely to occur for most languages even if the audio is in the archives.

The biggest problem for corpus lexicography, though, is the thousands of languages for which no corpus exists. Corpora are luxuries that are enjoyed by languages with a lot of published documents, and especially those spoken in countries that have a lot of official support from governments with resources to invest in knowledge about their languages. Africa's 2000 languages, and most of the thousands of indigenous languages of Asia, Australia, and the Americas, are basically left out in the cold. Essentially, the most significant lexicographic tool of the past half century is completely unavailable to most of the world's languages. Of course, building a corpus is not wizardry, and could be accomplished for any language in reasonable time with a scanner and good OCR, or a smart phone recorder and transcription software, but doing so requires time and budget that is not going to be invested by the powers of the purse for the great bulk of languages.

Because corpus lexicography is a useless technique without corpora, Kamusi's main methods do not rely on their existence. Instead, we focus on gathering knowledge directly from the public, either as a supplement to existing data sources, or as our primary source of information. Finding the people to play our games and share their knowledge via our apps is no easy task, but, we propose, will ultimately be the most effective way of documenting languages where no corpus has gone before.

/info/corpus

Doesn't Google Translate Already Do That?
No, They Don't
No, They Can't

Have you ever come to an intersection in a strange place and found yourself following the bulk of the cars, on the gamble that the most popular direction is the place you are most likely headed? That's a similar premise to statistical machine translation, and the Achilles' heel of today's 🏭 industry leader, Google Translate. In their words:
"Typically, when we produce a translation, our system searches through millions of possible translations, selecting the best -- that is, the most statistically likely -- translation.
Google Translate is useful for interpreting the general gist of a 📃 text, and can be quite good in certain circumstances for translations between English and a few lucrative languages. Were it called "Google Approximate", users would know that they are getting a best guess that has a high probability of choosing the wrong vocabulary. The likelihood of error is to a large degree a function of the number of senses a polysemous term has and the number of times that term appears in a parallel corpus with the translation language. Google passes all or most other-to-other translation tasks through English, thereby multiplying the polysemy error probability and eliminating the mitigating aid of parallel text. This makes non-English Google Translate pairs existential 🚂☠ train wrecks.

We contend that overlaying Kamusi's 🎓 knowledge-based structure and methods will ultimately lead to much more accurate machine translations, among many more language pairs. Certainly Google and the other players in the MT 🏭 industry have made tremendous efforts that should be built upon. Kamusi is open to collaborating with anyone in MT who wishes to benefit from our sense-specific 👅👅👅🔢 multilingual data, including lexicalized 🎉 party terms that are marked for separability. We are currently programming our source side pre-disambiguation tool that will be your first opportunity to see our claims in action.

Google makes lots of claims questioned by experts about its leaps using neural networks, and there is no doubt their precision numbers can improve as they tweak their methodology. Any Swahili speaker, for example, who has tried to make it to their boarding gate using the embedded Google "translation" service in the Chicago O'Hare Airport website, knows they have nowhere to go but up. Machine translation is only as good as the underlying data that lets you know a term in one language has the equivalent meaning of a term in another. Google's method is to draw inferences from texts that they think line up between languages. Kamusi's method is to look at each concept, have people determine the links, and lock down that knowledge for machines to learn from. Of course, manually reviewing translation terms for every word is a very large task, which is much more labor intensive and takes a lot longer than setting computers to whir through the numbers. In the long run, however, we suggest that the effort to have people determine mappings across languages will lead to translations that get it right, in ways that statistical methods such as Google Translate have not done and cannot ever do.

/info/google_translate

10 Ways that Kamusi can Revolutionize Machine Translation

There are many reasons that current state-of-the-art machine translation (MT) does not live up to its promise. We examine those issues in relation to Google Translate in this article: http://kamusi.org/google_translate

MT can do better – a lot better. The systems we are developing at Kamusi can produce a quantum leap in translation technology, creating much more accuracy across many more languages than today’s best efforts. The argument against using dictionary data as a launching point for MT always boils down to one thing: we don’t have the data. Kamusi sees the collection of high-quality data about every word in every language as a challenge, not a barrier. As the data comes online, we can put it in the service of MT and other language technologies.

Here are ten ways that Kamusi technology can fundamentally transform machine translation:  
    1. Efficiency: Standard statistical machine translation (SMT) works by comparing large bodies of parallel text between languages. For English to French, SMT needs a big set of comparable documents. For English to Spanish, a different set, and another for Spanish to French. Lessons for a language from one set do not transfer to another. Kamusi’s approach begins on a per-language basis – we focus on the specific meanings of terms in one language, and match them directly with the comparable term in the translation language. This process reduces the burden of pair-by-pair analysis of large volumes of parallel text, and can bring a pair without a parallel corpus (>99.999% of all 25,000,000 language pairs) quite some distance down the path. Instead of searching millions of lines of text to predict possible matches from one language to another, Kamusi performs a dictionary search that cuts directly to sense-specified options for each word. We anticipate that Kamusi’s attention to the lexicon of each language will lead to reduced effort with higher accuracy when integrated with other approaches to MT
    2. Scalability: Adding new languages is entirely modular – any language can be plugged into the system and grown at the pace of its contributors. When a language comes into Kamusi, its terms are immediately available for translation to all other languages. Launching a new language involves about an hour of configuration, upon which all of the project’s tools are available for experts and the crowd to develop and use the data.
    3. Expandability: New terms, senses, and translations can be added as they are discovered, in nearly real time. Imagine that you notice that the Kamusi results for order do not include this sense: “A request for something to be made, supplied, or served”. You use a simple tool to submit the new sense. Once a moderator approves, your sense is immediately published for use by the public or in downstream technologies like MT. Moreover, the concept is put before the contributors for other languages, so can gain meticulously curated translations virtually over night.
    4. Clarity: MT has two major chores: vocabulary and grammar. Kamusi is designed to get the vocabulary right every time. A word like “light” is treated as many different concepts – not heavy, not dark, not serious, etc. – each with its own entry. Those entries are each paired with terms for equivalent concepts in other languages. There are therefore clear relations at the level of the concept. This human-cultivated concept set solves part of the problem inherent within MT, knowing that if a particular sense of l-i-g-h-t appears in the source document, it should be translated with a particular term in the target language. It does not eliminate the task of figuring out which sense is intended on the source side; for that, Kamusi is building a pre-disambiguation interface for users to select the original sense from the defined dictionary entries, ranked in relation to computational word sense disambiguation (WSD) techniques. Used in combination, WSD and lexicon-based sense matching can produce precise vocabulary choices that SMT never will.
    5. Elasticity: MT often faces the problem of determining whether consecutive terms are different words, or should be translated as a single unit. For example, is an African fish eagle an African, a fish, and an eagle, or is it one bird with a long name? Kamusi puts party terms (multi-word expressions (MWEs)) in the dictionary as independent entries, with defined meanings. These entries are then lexicalized concepts that can be translated across languages. When MT encounters a series of words that appear together as a unit in the dictionary, it can translate the unit rather than the component parts.
    6. Separability: Many MWEs can be broken apart, which throws SMT entirely off the trail. For example, drive crazy can be separated: Your perfume drives me crazy. Using Kamusi data, we can tell when words in a lexicalized MWE might have been broken apart, and in the future we may be able to predict the range of terms that can go in between.
    7. Variability: When it comes to grammar, a term may take many forms. The verb “see”, for example, has the inflections sees, saw, seen, and seeing. In Kamusi, each entry is a container for many types of data, including these variations. When we configure a language, we figure out that language’s categories and forms, and produce customized interfaces to catalog those elements. Those tailored word forms can them be mapped across languages, with conjugations, contractions, and other transformations tied to appropriate translations.
    8. Transitivity: We can predict translations even if we are not sure about them. While human-confirmed translations are our goal, transitive links across concepts are our starting point. If we know based on human confirmation that a term in Language A is equivalent to one in Language B, and the term in Language B is equivalent to one in Language C, then we can have high confidence about the match between Language A and Language C – but we won’t lock it in stone until a person who knows A and C can confirm or reject it. In the meantime, the provisional vocabulary postulates can be used within MT, though taken with salt. In all cases, this method will produce more precise results than the disastrous method contemporary MT employs for going between languages that are not directly paired, using statistical guesses to go from Language A to English and another round of statistical guesses from English to Language C.
    9. Non-equivalence: Sometimes one language has a term that does not exist in another, or is expressed in a very different way. Kamusi has methods for producing explanatory translations when direct equivalents do not exist, and for showing bridges between different modes of expression. This information, never before modeled or documented, can be extracted from Kamusi in ways that are friendly to MT processes.
    10. Topical terminology: Many domains have specialized terms with meanings that differ from daily language. In sailing, for example, beam, beat, bend, and block all have meanings related to boats and wind. Terms in Kamusi can be designated as belonging to terminology sets for particular domains, making it possible to identify the vocabulary that should be preferred for particular documents.
These aspirations for next-generation MT are built into Kamusi’s design. The current state of the art offers translations that range from awful to adequate, depending on the language pair, the complexity of the text, and the user’s expectations. We contend that SMT is reaching the limits of its potential, and radical progress in MT will only come from approaches that focus on how vocabularies and grammars interact, within and across languages. The Kamusi structure is crafted to support the fine-grained data needed for a quantum leap in translation technology, the jump from adequate to excellent. In our effort to produce a global online living dictionary, we have embarked on collecting rich data for many languages – data that will serve as the bedrock of excellent universal translation. /info/revolution

Institute for Specialised Communication and 👅👅👅 Multilingualism

ISCM, at the European Academy of Bolzano/Bozen in 🇮🇹 Italy, will bring to the Human Languages Project particular expertise on terminology, specialised translation, corpora of language varieties and learner corpora, and tools for the visualisation of linguistic 👅🔢 data.

/info/iscm

Alignment

Kamusi harvests data from many sources, including existing dictionaries, open data sets, and members of the public. The fundamental problem in putting all this data together is that, until Kamusi, there has never been a way to affirm that a term in any one source is equivalent to a term in another unaffiliated source. Even within a company such as Oxford Dictionaries, you would have a difficult mission to equate the various meanings of l-i-g-h-t (not heavy, not dark, not serious, not fattening...) with the terms for those different ideas in their English-Spanish and English-Arabic dictionaries, to figure out which Spanish term equates with the same concept in Arabic. Kamusi resolves this problem by aligning concepts, not spellings.

When a computer looks at l-i-g-h-t, this is what it sees: 0110110001101001011001110110100001110100, the binary code for a string of five letters. Our Basque dataset matches that language's term "argitasun" with the sequence of digits you see above, as does "afessas" in Berber. In fact, PanLex finds nearly 10,000 terms that match to that sequence from nearly 1600 language varieties. Without further context, this spelling match is the closest we can get to forming connections among languages. This is why multilingual "translation" services such as Google Translate frequently give catastrophic results. Unlike a computer, a bilingual Filipino-English speaker who looks at Charles Nigg's 1904 Tagalog-English and English-Tagalog Dictionary can instantly tell which Tagalog term matched to l-i-g-h-t corresponds to which English sense. The person faces a different hurdle: how would someone ever convey their individual knowledge into actionable data that can be shared on digital systems for others to use?

Kamusi has designed unique systems to match linguistic data (01100100011000010111010001100001) to language knowledge (what is in your head). From WordNet, we have a beginner set of about 100,000 concepts defined in English, soon to rise to XXX by aligning to Wiktionary. We show our defined terms in DUCKS (Data Unified Conceptual Knowledge Sets), and players drag the unaligned term from their dataset to the definition that matches. For the Wiktionary version of DUCKS, where we have a Wiktionary sense of l-i-g-h-t, a participant can eyeball the Kamusi sense that corresponds and tie the two together (with three goals, first to find missing senses, second to provide alternative definitions in case the WordNet description is inadequate, and third to bring in translations to many other languages that have been produced by Wiktionary volunteers). For Filipino, players are shown one of the terms in their dataset that matches to l-i-g-h-t, and they choose whether it means "not heavy", "not dark", "not serious", or "not fattening". When a consensus is achieved by a critical mass of players, we consider the alignment to be validated.

Because each version of DUCKS connects to the same core concept set, we are able to make high-probability second-generation connections among languages. While we are insistent that our results show the English or other language we use as the pivot so that we do not make uncertified truth claims, data alignment means that we can confidently assert that Filipino "not heavy" is a likely match for Vietnamese "not heavy" and the term for the same idea in Amharic. (Aligned terms will advance to a game for people to verify proposed links between languages, but we certainly will not be able to find bilingual players for all 25 million language pairs.) When we have the financial resources to work with the 1.3 billion terms in PanLex, we will be able to align concepts across as many as 11,000 language varieties. By combining the computer's ability to process data with people's ability to understand it, our systems are geared to line up linguistic knowledge at the sense level across the world's languages.

/info/alignment

Pre:D (that's short for source-side pre-disambiguation)

Pre:D slideshow Say What You Mean

We want you to find the right word for translation, every time. To help you, we are developing Pre-D, a complicated system that is described in this working paper for the European Association of e-Lexicography, presented in Brno, Czech Republic, 16 September 2016.

Full article:Kamusi Pre:D – Lexicon-based source-side predisambiguation for MT and other text processing applications

Abstract
Kamusi has been developing a system to analyze texts on the source side and present users with sense-specified dictionary options. Similarly to spellcheck, the user selects the intended meaning. We then use a multilingual lexical database to bridge to matching vocabulary in other languages. When paired with Freeling, additional pre-processing is possible for several languages. Integration with MT via Moses and Apertium is planned, but not yet undertaken. MWEs treatment is important. An MWE is lexicalized in the Kamusi database and marked for separability, with a definition and translation equivalents (one or more words) in other languages. When the initial term of an MWE appears in the source text, Pre:D queries the database and scans the sentence for all MWEs that could follow. The user can select the relevant MWE rather than the component words. A user can submit a missing sense or MWE for inclusion in the lexicon. Named entities can also be identified from data sources or by users and rendered appropriately across languages. When users agree, we will also use sense-tagged sentences for machine learning. A prototype of the core system is already functional. /info/pre-d

Meaning

A star is:
(A) a bright thing in the sky at night
(B) a celestial body that consists of a luminous sphere of plasma held together by its own gravitational field
(C) a fixed luminous point in the night sky which is a large, remote incandescent body like the sun
(D) an actor in a leading role

All of the definitions above appear formally online, and all are valid. The first three describe essentially the same thing, while the concept described for (D) is completely different. Looking more closely, (A) could include planets, and in (B) nighttime visibility plays no role. Dictionaries pretend that we sharply etch the scope of each word. In reality, every definition is an approximation of what is understood by a language's speakers 👪🔊 in a particular context. A single term can have multiple meanings, and a single sense can be expressed in many ways.

Kamusi is obsessed with distinctions of meaning. This obsession arises largely from the perils of matching concepts among languages. For good translation, it is imperative to have a fine sense of meaning, so that the most appropriate word can be selected in the other language. At the same time, we cannot be fixated on encapsulating a concept with a single perfect description. Our goal is to sculpt each concept's description, so it can satisfy the full range of potential readers, from primary school students to astrophysicists. For English, we have 🎮 games designed to improve the provisional definitions we have borrowed from 🔠🕸 WordNet, and to bring in more definitions from Wiktionary and other sources. Where senses align, we can show the definition from each source, expanding the user's ability to comprehend the overall concept. Crucially, this alignment will make it possible to create a universal identifier code for each concept, which can be used across language technology applications to know that particular strings of 🔡 letters refer to the same thing.

For most terms in most other languages, definitions have never been written. Over time, we intend to work with our partners and participants, using our tools to produce own-language definitions for every term. Those definitions can also be translated to English or other languages, but terms have an indigenous scope of meaning that is often only partially conveyed by words in other languages. With these own-language definitions, students of minority languages will for the first time have the sort of monolingual reference that speakers of favored languages use every day to to enhance their communications and 🎓 knowledge.

/info/meaning

What is the Best Definition for a Term?

This is a re-post and modification from an answer that originally appeared on Quora.

Can you define "best"?

So much depends on your audience. Check out these two definitions for "star":
American Heritage Dictionary: A self-luminous celestial body consisting of a mass of gas held together by its own gravity in which the energy generated by nuclear reactions in the interior is balanced by the outflow of energy to the surface, and the inward-directed gravitational forces are balanced by the outward-directed gas and radiation pressures.
YourDictionary: A bright point of light in the sky.

The second definition leaves out a whole lot, doesn't it? For example, "a bright point of light" could include satellites, or airplanes flying at night. But if you are a student in Tashkent trying to teach yourself English, which definition gives you a better chance of grasping the concept?

A dictionary definition is an attempt to give the most possible useful information in the least possible space. The basic premise is that a definition should be able to be plugged into the place of the word in question, in context, in which case #2 would satisfy: Columbus navigated by the bright points of light in the sky.

However, oversimplification can let in wrong information, and then it depends on how persnickety you want to be. I once gave a stab at defining «heart» with reference to the left side of the body, until a colleague pointed out that a condition called dextrocardia puts the heart on the right side in 1 person out of 12,000. Whoops. We deleted the errant information, and still ended up with a definition more suited to a secondary student than to a physician: A muscular organ that pumps blood through the body.

Notice that we ended up with a better definition than one that referred to the heart's placement in the human body, because getting rid of the spacial reference also made the definition applicable to any animal. And now it gets complicated. We gave up on being original, and cribbed the definition from a pretty good attempt at Wiktionary. However, the original Wiktionary definition includes extra information that actually detracts from understanding the physiological sense: A muscular organ that pumps blood through the body, traditionally thought to be the seat of emotion. In my book, "the organ of the human body traditionally thought to be the seat of emotion" would be a separate sense. So, "best" is subjective, depending both on the person who is valiantly attempting to encapsulate a big idea in a tiny space, and the person who is trying to make use of the definition for their particular needs with their particular foreknowledge.

The lexicographer is one schmo, who happens to write dictionary definitions. Ask five other lexicographers, you'll get ten other opinions. Can we tell you which is best? I'm a frayed knot.

/info/best_definition

Transtechno

Transtechno is a small Finnish enterprise specializing in corpus planning for African languages, that will bring decades of expertise to the Human Languages Project.

/info/transtechno

GOLD Theme: Controlled Vocabularies

EatUp!

Change how you eat when you travel. Restaurants usually only have the space on their menus for their local language, and perhaps an amusing attempt at English or another language. Automatic tools like Google Translate are tragically inept at menu items, because they do not have comparative source 🔢 data. We have found the key terms on thousands of menus, and are in the process of finding equivalent terms in dozens of languages. No matter the original language, you'll see the name of each dish, as well as translations of all the ingredients and descriptions, in the language you understand. In the case study above, the food is Lebanese, the menu is in German, and the guest of honor speaks French, Spanish, and several African languages. Our prototype is working, but we have more features to add to make this the restaurant app to drool over.

/info/eatup

Controlled Vocabularies

Kamusi's goal, "every word in every language", is too many words for many specific use contexts. People 👪 can often benefit from apps that restrict vocabulary to specialized needs instead - still with the goal of presenting those terms in as many languages as possible. For example, first responders need an app that can fit on their phones, with terms for any emergency they might encounter and any language their victims might be speaking, whereas the extra power of a full dictionary might end up getting in the way of their finding the type of term they need. At the same time, with millions of concepts floating in the air, controlled vocabularies can help focus data collection efforts on topics that are of immediate importance across borders.

We have several specific controlled vocabulary apps in mind. We need 👪 people to work on custom code for a few different use-case circumstances, and we need 👪 people to work on developing data sets. To help develop or fund the controlled vocabularies, please contact us!

/info/controlled

Restaurant Menus

We have partially coded EatUp! - the restaurateur input options work pretty well, but we don't yet have an app for restaurant visitors. We also need to work on the 🔢 data. We have thousands of menu-specific terms in English and French, and a good data source that will allow us to cross-pollinate with other languages. To finish the project, we need 👪 people for both the interface and 🔢 data sides. To help develop or fund EatUp!, please contact us!

/info/menus

Kamusi Mousepad

Thick, soft, durable mousepad - keep Kamusi close at hand!

/info/mousepad

All words from all languages in one dictionary

EPFL Mediacom
19 September 2015
by Cécilia Carron

The universal online dictionary Kamusi has just added 1.2 million terms from several databases in its quest to translate all the meanings of every word in all the world’s languages. Three African languages and 200,000 words of Vietnamese will soon follow.

Kamusi, which means dictionary in Swahili, aims to translate all the meanings of words from 7,000 languages from around the world into all other languages, and it will include definitions and usage examples. This vast project, which began twenty years ago, is growing at an exponential rate. Some languages, like English and Swahili, are already largely available...

Continue to Full Article...

/info/epfl_mediacom_en

QR Products

qr code Product ingredients, usage, safety, and assembly instructions... Consumers are more likely to buy products if they can understand the labels and the instructions, but packaging space is too small to fit multiple languages, and translations are too difficult. Kamusi has a solution to this problem, making it possible to have accurate product information in tiny spaces for hundreds of languages at low cost. Put a QR code on a product and get Kamusi translations to any language. One more adaptation of our core technologies, waiting for a visionary financier. /info/qr

CultureUp!

Museums, zoos, botanical gardens, and many other public spaces work hard to produce explanatory information for their visitors, but have no space or expertise to translate that information for the dozens of languages spoken by the tourists who flow through. We can work with translation teams to produce 📃 texts that visitors can read on their 📱 mobile devices as they explore the space around them. At the same time, each institution will be able to share its unique cultural content with online visitors 🌍 worldwide who will never be able to visit in person. This system will piggyback on EatUp! and add cool new interactive programs for language learners to participate. All we need to go forward is a 💛😇 GOLD Angel to underwrite development costs.

/info/culture

Public Spaces

Museums, zoos, and other public spaces receive visitors from around the 🌍 world, but they cannot produce their explanatory information in more than a few languages. We can prepare an app for institutions to translate their signs to a great many languages. The museum should be able to upload images of its signs. Using team translation, Kamusi participants will then settle on a translation for their language, learning about cultural items from far away at the same time they help members of their community who have find themselves abroad. Institutions will also have the option to borrow from each other, so one zoo with a wallaby can use on-the-shelf translations of information about wallabies from other places. Visitors should in turn have an app that gives them information in their language as they move through a given public space; this app can be keyed to exhibit numbers or QR codes, and perhaps GPS location for outdoor exhibits. To help develop or fund CultureUp!, please contact us!

/info/museums

GOLD Theme: Portugal and Brazil

Universidade Nova de Lisboa

The Centro de Linguística da Universidade Nova de Lisboa brings to the Human Languages Project expertise on terminology and ontology, as well as the Bantu languages of Angola and the Kaboverdianu language of Cape Verde.

/info/lisboa

😂🌎🤖 EmojiWorldBot on Portuguese TV

Sabe o que significam cada um destes Emojis?
Os Emojis estão na moda. São já milhares e não param de aumentar. Para entendermos os Emojis uma equipa internacional está a trabalhar num dicionário português. /info/portugal_tv

Language: Portuguese

Local name: Português
Most spoken in: Brazil, Portugal, Angola, Mozambique
👪🔊: More than 200 million as first language and 50 million as second language
Kamusi records: 21,262
ISO 639(1) / (3): pt / por
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial 🔢 data from the Portuguese 🔠🕸 WordNet (PULO), developed by Alberto Simões at the University of Minho. Additional🔢 data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Portuguese Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/por_pt

CIDLeS

The Interdisciplinary Centre for Social and 👅📃 Language Documentation, based in Minde, Portugal, will be a partner in the Human Languages Project for the 👅🔫 endangered languages of Europe and Mexico. Language endangerment is a hidden problem in Europe, where budget is given to the big official languages, while numerous smaller languages are thrust aside with almost no attention (and sometimes official hostility) to documenting and preserving the cultural patrimony they contain.

/info/cidles

🔊🎨 Sound Effects

We can embed 🔊 sounds from freesound, to add acoustic information to relevant senses of thousands of terms. These 🔊 sounds can follow the concept through Kamusi across languages, so you won't need to know the English word for "lighter" to know what a lighter sounds like. Making this work will take a little coding and a special 🐥📊DUCKS module, including recruiting the 👪 people to play the 🐥📊 DUCKS through to completion. This is outside our main scope for the moment, but is a cool project we can task to the right person, or implement with some special funding from a 💛😇 GOLD Angel. Please contact us if you can help!

/info/sounds

Sound

Spoken languages begin with 🔊 sound. Hearing children learn language first by listening and speaking; ✍ writing is a later addition for individuals and societies. For interaction between humans and computers, machines must be able to decode utterances as 📃 text, and convert text to 🔊 auditory signals that make sense to the 👪 people hearing them. Much of the groundwork has been laid for English and a few other languages, but acoustic information for most people has yet to be captured in ways that can be used by technology. Computers at Bell Labs could recognize the spoken English words 1 through 10 in 1952, with ten being about as many languages for which this basic trick has been repeated for all of Africa in the subsequent 64 years.

Our model provides space to record natural 🗣🔊 speech sounds and match them to shapes (wind [twist]/ whined/ wined) and places - not just "big", but "bigger" and "biggest", not just Parisian French but Vaudoise and Quebecoise and Ivorian. By collecting this🔢 data in a dictionary linked by meaning, we can envision the day when a person speaking, say, a local variety of Swiss German could have their words recognized, translated, and output as comprehensible speech in a regional dialect of Cantonese. The processes necessary to do the acoustic modeling for any language are well established, but the digitized data for most languages is non-existant. Within the Kamusi data framework, and working with partners in the Human Languages Project, where we can find sponsorship, we can gather and deploy the sounds needed for advanced voice technologies for any language.

/info/sound

Donate in Brazilian Real



/info/brazil_real

Kamusi Hoodie

Keep warm in this comfy sweatshirt. For men's sizes, click the 🖼 picture, for women, click here

/info/hoodie

GOLD Theme: East Africa

🆓 Free Swahili Dictionary for 📱 Mobile Devices

This mobile app provides free 🆓 OFFLINE access to the English-Swahili and Swahili-English content of kamusi.org. It contains 32,000 translation pairs. Search quickly in either direction, without a network connection! Kamusi hii ni ya toleo kamusi.org kutumia nje ya mtandao, kupitia Kiingereza-Kiswahili na Kiswahili Kiingereza. Ina majozi 32,000. Tafuta chapuchapu Lugha hadi Lugha, bila na kuunga mtandao!

The 🔢 database is more than 26MB, and will be downloaded when you finish run the application the first time. We recommend downloading with a Wi-Fi connection. Updates to the database will be available periodically. Hifadhidata 🔢 ni kubwa, zaidi ya MB 26, na kwa mara ya kwanza itapakuliwa pindi tu uanzapo kutumia programu hii. Hivyo basi, tunapendekeza uipakua kwenye Wi-Fi. Maboresho ya hifadhidata yatapatikana mara kwa mara.

/info/swahili_mobile

Bishop Barham University College

Bishop Barham approached Kamusi to create a Uganda National Living Dictionary that will extend our molecular lexicography to 10 of Uganda's regional languages, as research projects for students in their Master of Arts and Translation on Language Development program. The project can begin if and when we gain support from a funder who agrees that Ugandan students deserve resources for their languages on par with those available for Europeans.

/info/bbuc_uganda

Kiswahili Grammar Notes

The Kamusi Project wishes to thank Helen L. Erickson and Marianne Gustafsson for their kind permission to make their 📃 text available electronically. We hope you find this guide helpful.

Table of Contents

(i). Introduction
  1. Nouns
  2. Pronouns
  3. Interrogatives
  4. Adjectives
  5. Numbers
  6. Calendar Dates
  7. Telling Time
  8. Verbs
  9. The -A of Relationship
  10. The -O of Reference
  11. Prepositions
  12. Conjunctions
  13. Adverbs
  14. Interjections
  15. Expressions

Kiswahili Grammar Notes is Copyright 1984 and 1989, reproduced with permission. If you find this text valuable, the authors ask that you consider a small contribution to a charitable purpose. /info/swahili_grammar

Defining Moment For Swahili

Kamusi Project Envisions A Unified African Dictionary Online And In Print
Hartford Courant
20 November 2005
by Adrian Brune

Coming of age during the American Revolution, Noah Webster believed fervently in the country's cultural independence and the role played by its American idiom, pronunciation and style. In 1806, America's first lexicographer published his ``compendious dictionary of the English language.'' Nearly 200 years later, the Merriam Webster dictionary has become one of the world best known, with more than 200,000 entries.

With more than 80 million speakers in East and Central Africa, Swahili is the most widely spoken language in Africa, though a fully updated dictionary of the language has not been produced for 30 years...

Continue to Full Article...

/info/courant

Africa Nazarene University

ANU will join the Human Languages Project to work on the Ekegusii, Gikuyu, Maasai, and Kamba languages of Kenya. /info/nazarene

Language: Swahili



Download our Swahili-English dictionary as a 🆓 free mobile app for Android and iPhone!
Local name: Kiswahili. Using "Swahili" vs. "Kiswahili"
Most spoken in: Tanzania, Kenya
👪🔊: More than 100 million, including tens of millions who learn it from birth as or along with their mother tongue
Kamusi records: 60,000+
ISO 639(1) / (3): sw/swa
Pronunciation: Swahili Pronunciation Guide
Books and Music: Swahili Bookstore and Learning Bibliography
Bilingual Dictionaries: Russian
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($) | Swahili Grammar Notes | Kanga Writings and Sayings | Periodic Table of the Elements
Song Lyrics: Harambee | Jambo | Majengo | Malaika
Sources: Initial 🔢 data from the Swahili-English Dictionary (1968) edited by Charles Rechenbach and printed by The Catholic University of America Press in 1968. ISBN 978-0813204062, with permission. Additional data from Kamusi participants. The sources for many of our Swahili Usage Examples can be found at http://kamusi.org/content/source-notes.

💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Swahili Language Task Force to help plan development of robust resources for the language!

Where is our Swahili🔢 data? Although Kamusi started as a project for Swahili, that part of our site is currently offline due to our server/ funding situation. We are striving to have the Swahili dictionary back online as soon as possible, with a 🐥📊 DUCKS game 🎮 to align it with all the new 👅👅👅 multilingual data. Your 🎁 donations will help speed us toward this goal!

Additional online resources from other sites that we recommend: [This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/swa

Kamusi on the Colbert Report

Barack Obama's inability to remember what magazine he read when he was nine is all Stephen Colbert needs to call him a liar. Kamusi carries the punchline for the segment. May 22, 2007. (Only available to viewers in USA)

/info/colbert

Makerere

If and when we can obtain the resources to support their work, Makerere will lead the development of resources for Uganda's major language, Luganda.

/info/makerere

Redsea Online Cultural Foundation

We are seeking resources to build and manage 🐥📊 DUCKS that will integrate the fantastic Somali resources developed by Redsea within the 👅👅👅 multilingual Kamusi system.

/info/redsea

Boston University

The Boston University African Studies Center will join the Human Languages Project to work on several African languages.

/info/bostonu

GOLD Theme: Endangered Languages

Endangered Languages 👅🔫

Many of the 🌍 world's 7000 languages are in danger of disappearing within years or decades. When they go, priceless human heritage will be lost, unless their words and meanings can be documented first. Kamusi is working on tools and techniques to help in this effort. When we have sufficient sponsorship for implementation, our mobile elicitation app will be customized for field lexicography for languages that have few speakers 👪🔊, and usually no written form. We will concentrate on talking dictionaries, where speakers not only say the words in their language, but also provide spoken vignettes in lieu of written definitions. In conjunction with our concept priority list, we intend to produce a "research project in a box" that university and graduate students can use to undertake rapid field documentation of a chosen 👅🔫 endangered language.

/info/endangered

Talking Dictionary

We are inspired to take the 💭 idea of Talking Dictionaries to a new level. As pioneered by the Enduring Voices project, Talking Dictionaries provide a recording of the basic 🔊 sounds of individual words in 👅🔫 endangered languages. With the Kamusi architecture, it is possible to include spoken definitions of indigenous terms. These would not be formal lexicographers' definitions, but rather spoken vignettes that field researchers record with native speakers 👪🔊 on 📱 mobile devices. We need an app for a researcher to elicit a term (what is your word for "sun"?), record that word, then record a short explanation of that term in the indigenous language, and hopefully a follow-on explanation of the term in the contact language. The 📱 device should then synch nicely with the main Kamusi database when a network connection is available. This tool will be part of a research kit that we are planning, for graduate students to be able to undertake field research with a "project in a box" that they can quickly deploy for documenting 👅🔫 endangered and minority languages without having to struggle with setting up their own database, elicitation list, or dissemination system. To help develop or fund the project, please contact us!

/info/talking

University of Buffalo

The Department of Linguistics at the University of Buffalo will contribute to the development of tools, standards, and methods for gathering lexical and comparative 🔢 data on under-documented languages in the Human Languages Project, whether this work is to be done by speaker 👪🔊 community members or 👅 linguistic experts, as well as the development of models for long-term archiving of any data gathered under the aegis of the consortium.

/info/buffalo

Internet Equality

Does the Internet provide equal access to the tools for 🎓 knowledge and prosperity? Not if you are one of the billions of 👪 people who do not speak the few favored languages of the 🕸 Web. This information box (under construction) will discuss the issues. Please visit again soon!

/info/internet_equality

Crúbadán

Based at Saint Louis University in St. Louis, Missouri, Kevin Scannell's An Crúbadán project is central to the production of unique and essential resources for dozens of languages that are otherwise excluded from technology, and that will contribute to the success of the Human Languages Project.

/info/crubadan

Baseball Cap

Sport our great-looking canvas cap!

/info/cap

Why Should I Care?

Along with a generous $100 donation, a Friend-of-Kamusi sent the following provocative message that this video gives a go at answering. The first half discusses what the project will do for you, and the second part talks about what it will do for others:

Why should I care? Why should I be motivated to be part of the project? I realize you are not selling a product to be consumed, but every purchase has some higher level motivation that drives someone to give you their money. If this global dictionary project is going to work, I think there is an opportunity to think about what will motivate a larger group of people to "buy" what you are "selling". Not only compel me to give you my money, but compel me to recommend your "product" to my friends. Again I want to say I am highly impressed with the project and wish you all the best.
/info/care

Donate in Danish Krone



/info/danish_krone

GOLD Theme: Latin America

Tawa Cultural Association

The Asociación Cultural Tawa, based in Cusco, Peru, is an indigenous Peruvian organization leading the efforts to launch Quechua within the Kamusi framework.

/info/+++

Language: Quechua

Local name: Quechua, runa sumi
Most spoken in: Peru, Bolivia
👪🔊: 8 to 10 million
Kamusi records: [data queued for upload and 🐥📊 DUCKS processing]
ISO 639(1)/(3): /
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial data is a 19,000-term Quechua-Spanish dictionary from A Basic Language Technology Toolkit for Quechua, University of Zurich 2015, by Annette Rios, which will be aligned with our current Spanish data using 🐥📊 DUCKS. Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Quechua Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/que

Universidad EAFIT

The Research Network on International Studies at Universidad EAFIT, in Medellin, Colombia, works with Kamusi on 👅 linguistic diversity, protection, and conservation in Latin America.

/info/eafit

Donate in Mexican Pesos



/info/mexican_pesos

Economic Opportunity

In what ways do the resources available in your language affect your ability to benefit from 🌎 global systems of 🎓 knowledge and prosperity? In what ways are 👪 people without linguistic access condemned to miss out? This information box is under construction. Please visit again soon!

/info/economic_opportunity

Language: Spanish

Local name: Español, Castellano
Most spoken in: Spain, Mexico, and most countries of Central and South America
Speakers: More than 1/2 billion
Kamusi records: 58,618
ISO 639(1) / (3): es / spa
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial 🔢 data from the Multilingual 👅👅👅 Central Repository, maintained by the University of the Basque Country, Department of Software, and the Technical University of Catalonia (UPC). Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Spanish Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/spa

GOLD Theme: Spain

Language: Galician

Local name: Galego
Most spoken in: Spain
Speakers: 2.4 million
Kamusi records: 27,126
ISO 639(1) / (3): gl / glg
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial 🔢 data from the Multilingual 👅👅👅 Central Repository, maintained by the University of the Basque Country, Department of Software, and the Technical University of Catalonia (UPC). Updated data (June 2017) from Grupo TALG (Tecnoloxías e Aplicacións da Lingua Galega), Universidade de Vigo, Vigo, Galiza, http://sli.uvigo.gal/galnet/
Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Galician Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/glg

A dictionary can do that?!?

This video showcases some of the major features introduced in 2013 that set Kamusi apart from any dictionary you've seen before.

/info/possible

Language: Catalan

Local name: Català
Most spoken in: Spain, Andorra
👪🔊: 11 million
Kamusi records:71,290
ISO 639(1) / (3): ca / cat
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial 🔢 data from the Multilingual 👅👅👅 Central Repository, maintained by the University of the Basque Country, Department of Software, and the Technical University of Catalonia (UPC). Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Catalan Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/cat

Our Homepage Design

The trend in 🕸 web design these days is to have sleek homepages with big 🖼 pictures and minimal information. This doesn't work for Kamusi. We have way too much going on that 👪 people need to be able to find out about - hundreds of languages, lots of software, dozens of partners, a plethora of sub-projects. Our design is intended to be a bit overwhelming, just like our scope. Using Masonry, we aim for a dynamic effect that melds the Zanzibari kikoi fabric background that has been a constant since our first website in 1995, a color palette that has evolved to match, abstract artistic concepts, and the shifting layout of sites like Pinterest.

Navigation: We recommend that you scroll through the homepage to get an overview of what is available to you at Kamusi, and to read items you come across that spark your interest. However, the homepage has far too much information for you to easily find what you need just by scrolling. We have 3 techniques for you to find information more precisely:
  1. Infoboxes at the top of the homepage will zap you around to , articles, commentary, news, projects, theory, videos, and ways you can support Kamusi through donations and purchases.
  2. Themes. Boxes are roughly clustered into common topics. You can select a theme to move quickly to that section. Themes also have dedicated pages you can reach through our Glide-Through system.
  3. Glide-Through. The bottom of each Infobox has a link to a stand-alone page for that box. From any independent page, you can use the navigation boxes to hop around to the individual pages for all our content, while always having access to the dictionary search bar. This is the fastest and lightest way to glide through Kamusi, and the default option for visitors on mobile devices.
The user experience is modelled in some ways on contemporary airports. In many airports these days, all travellers are channelled through a glitzy shopping area between security and their gates. You are always free to ignore the items on display, but you cannot help but have a glimpse of the themes (chocolates, liquor, perfumes,...) and special items on offer. You can browse on your way through (stay on the homepage), or you can find your gate and then come back to the duty free when you are ready (glide-through). Of course, we hope that your visits to Kamusi will be a lot less jarring than a typical dash through an airport, and that we will provide you with a stimulating way to reach many new destinations!

/info/homepage

Language: Basque

Local name: Euskara
Most spoken in: Spain, France
👪🔊: 660,000
Kamusi records: 50,038
ISO 639(1)/(3): eu/eus
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($) | Basques Face the Future (BBC)
Sources: Initial 🔢 data from the Multilingual 👅👅👅 Central Repository, maintained by the University of the Basque Country, Department of Software, and the Technical University of Catalonia (UPC). Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Basque Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/eus

Color Name Game 🎮

Cultures view colors differently, and express those views through language. For example, in the Mampruli language of Ghana, "zibi" is the range of hues from green to black. We can create a 🎮 game that captures color names. Where does green end and yellow begin, and is there a special term for chartreuse? We flash colors from across the spectrum, and our players have fun identifying them. In this way, a fashion designer in Milan will be able to confidently discuss shades of red with a weaver in Nepal. To help develop or fund this game, please contact us!

/info/colors

GOLD Theme: Eastern Europe

Donate in Russian Roubles



/info/roubles

Language: Romanian

Local name: Română
Most spoken in: Romania, Moldova
👪🔊: 24 million
Kamusi records: 84,638
ISO 639(1) / (3): ro / ron
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial 🔢 data from the Romanian 🔠🕸 WordNet, by the Institute for Artificial Intelligence, Romanian Academy, Bucharest. Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Romanian Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/ron

Language: Bulgarian

Local name: български
Most spoken in: Bulgaria
👪🔊: 10 million
Kamusi records: 27011
ISO 639(1)/(3): bg/bul

Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)

Sources: Initial data from the BulTreeBank WordNet, maintained by Kiril Simov and Petya Osenova. Additional data from Kamusi participants.
A more complete Bulgarian WordNet is available for purchase from ELRA, but unfortunately we have no budget to include it in Kamusi. BulNet 3.0, Institute for Bulgarian Language (IBL), Bulgarian Academy of Sciences, Sofia, Bulgaria, maintained by Svetla Koeva.
💛😇 GOLD Angel: Nobody.

Pearls: Be the first Pearl for this language!

Join the Bulgarian Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/bul

University of Szeged

The FinUgRevita project, based in Szeged, Hungary, builds computational tools for endangered indigenous Finno-Ugric languages in Russia,including Mansi and Udmurt. If funding can be secured, these will be the first of Russia's many minority languages to be integrated within the Human Languages Project framework.

/info/szeged

Donate in Hungarian Forints



/info/hungary

HASRIL

The Research Institute for Linguistics of the Hungarian Academy of Sciences will join the Human Languages Project for Hungarian and larger issues in the production of 👅👅👅 multilingual 🔢 data that can serve both linguistics and language technology.

/info/hasril

Try a multilingual dictionary that's built around concepts

Is your automated translation a garbled string of words? It's probably because computers can't deal with homonyms.
8 March 2013
by Janet Fang

Oh homonyms. Automated translation services can provide the gist of a passage of text, but computers just can’t deal with words that are spelled the same but have different meanings.

Machines learn to translate by searching for correlations in texts that have been translated by humans. Now it's time to put humans back in the loop.

Continue to Full Article...

/info/zdnet

Language: Slovak

Local name: Slovenčina
Most spoken in: Slovakia
👪🔊: 5.6 million
Kamusi records: 44,029
ISO 639(1) / (3): sk / slk

Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)

Sources: Initial 🔢 data from the Slovak 🔠🕸 WordNet, by the Slovak National Corpus, Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences. Additional data from Kamusi participants.

💛😇 GOLD Angel: Nobody.

Pearls: Be the first Pearl for this language!

Join the Slovak Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/slk

Donate in Czech Koruna



/info/czech

Language: Macedonian

Local name: македонски
Most spoken in: Macedonia
👪🔊: 3 million
Kamusi records: [data queued for upload]
ISO 639(1)/(3): mk/mkd
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial data from the Macedonian WordNet, Ss. Cyril and Methodius University & Staffordshire University, maintained by Martin Saveski & Igor Trajkovski. Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Macedonian Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/mkd
+++

GOLD Theme: West Africa

CERDOTOLA

Centre International de Recherche et de Documentation sur les Traditions et les Langues Africaines will be a partner in the Human Languages Project for the languages of Cameroon.

/info/cerdotola

African Languages Technology Initiative

We have a long-standing partnership with Alt-i, a Nigerian SME, for producing localization and general 👅 linguistic resources for Nigerian languages, including Hausa, Igbo, and Yoruba.

/info/alt-i

Comparative African Word List

The Comparative African Word List contains 1700 concepts that are likely to occur in many languages. We have used 🐥📊 DUCKS to align relevant concepts to their specific senses in 🔠🕸 WordNet. The next step will be to hoover in 🔢 data that has been collected with CAWL. This will accurately launch several new languages in the 👅👅👅 multilingual system at a basic level.

/info/cawl

Académie Malienne des Langues

AMALAN will join the Human Languages Project to work on the Bambara, Songhay, and Fulani languages of Mali.

/info/amalan

Language: Bambara

Local name: Bamanankan
Most spoken in: Mali, Burkina Faso
👪🔊: 5 million first language, 10 million second language
Kamusi records: [data queued for upload and 🐥📊 DUCKS processing]
ISO 639(1)/(3): bm/bam
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($) | Bibliothèque | VOA | RFI
Sources: Initial data from Dictionnaire Bamadaba, Bailleul, Charles & Davydov, Artem & Erman, Anna & Maslinksy, Kirill & Méric Jean Jacques & Vydrin, Valentin. Bamadaba : Dictionnaire électronique bambara-français, avec un index français-bambara. 2011–2014. Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Bambara Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/bam

Pulaagu.com

We are seeking resources to work with the Pulaagu community to produce molecular quality resources for the Pulaar/ Fulfulde language continuum.

/info/pulaagu

What is a language?

It is a convenient fiction to say there are about 7000 living languages in the world, because this is the number of languages that have been singled out by people who know a lot about these things at the Ethnologue. However, this number elides serious questions. For example, Kirundi, spoken in Burundi, and Kinyarwanda, spoken in neighboring Rwanda, are considered two different languages even though you would need to see a passport, not a dictionary, to know who came from which side of the border. On the other hand, Fula is a blanket term that covers a continuum of varieties, problematically called dialects, 9 of which get their own official ISO 639-3 codes.

We could continue at length, but Stephen Anderson already wrote an excellent article on the subject, explaining many important considerations in the notion of "language". We'll quote the most relevant parts, and urge you to read the rest.
The late Max Weinreich used to say that “A language is a dialect with an army and a navy.” He was talking about the status of Yiddish, long considered a “dialect” because it was not identified with any politically significant entity. The distinction is still often implicit in talk about European “languages” vs. African “dialects.” What counts as a language rather than a “mere” dialect typically involves issues of statehood, economics, literary traditions and writing systems, and other trappings of power, authority and culture — with purely linguistic considerations playing a less significant role.

For instance, Chinese “dialects” such as Cantonese, Hakka, Shanghainese, etc. are just as different from one another (and from the dominant Mandarin) as Romance languages such as French, Spanish, Italian and Romanian. They are not mutually intelligible, but their status derives from their association with a single nation and a shared writing system, as well as from explicit government policy.

In contrast, Hindi and Urdu are essentially the same system (referred to in earlier times as “Hindustani”), but associated with different countries (India and Pakistan), different writing systems, and different religious orientations. Although varieties in use in India and Pakistan by well-educated speakers are somewhat more distinct than the local vernaculars, the differences are still minimal—far less significant than those separating Mandarin from Cantonese, for example.

For an extreme example of this phenomenon, consider the language formerly known as Serbo-Croatian, spoken over much of the territory of the former Yugoslavia and generally considered a single language with different local dialects and writing systems. Within this territory, Serbs (who are largely Orthodox) use a Cyrillic alphabet, while Croats (largely Roman Catholic) use the Latin alphabet. Within a period of only a few years after the breakup of Yugoslavia as a political entity, at least three new languages (Serbian, Croatian and Bosnian) had emerged, although the actual linguistic facts had not changed a bit.

PDF: How many languages are there in the world? by Stephen R. Anderson
/info/what_is_a_language

Language: Fula

Local name: Fula, Fulani, Fulfulde, Pulaar, Pular
Most spoken in: Mauritania, Senegal, Mali, Guinea, Burkina Faso, Niger, Nigeria, Cameroon, Gambia, Chad, Sierra Leone, Benin, Guinea-Bissau, Sudan, Central African Republic, Côte d'Ivoire, Ghana, Togo, Liberia and Gabon.
👪🔊: Estimates from 13 to 24 million
Kamusi records: [data queued for upload and 🐥📊 DUCKS processing]
ISO 639(1)/(3): ff/ful
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Adlam alphabet

Sources: Initial data from Osborn, Donald Zhang, David J. Dwyer, and Joseph I. Donohoe, Jr. A Fulfulde (Maasina)--English--French lexicon : a root based compilation drawn from extant sources followed by English--Fulfulde and French--Fulfulde listings. East Lansing, MI: Michigan State University Press, 1993. Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Fula Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/ful

Kasahorow

We have worked hand-in-glove since 2005 with Kasahorow, an independent language technology organization based in Ghana, on many projects for African language development.

/info/kasahorow

Llacan

Langage, Langues et Cultures d’Afrique Noire at CNRS/INaLCO (Le Centre national de la recherche scientifique/ Institut national des langues et civilisations orientales), will lead the development in the Human Languages Project for several West African languages.

/info/llacan

Language: Songhay

Local name: Songhay
Most spoken in: Mali, Niger
👪🔊: around 3 million
Kamusi records: around 7000 to be aligned via 🐥📊 DUCKS
ISO 639(2 & 5): son ("son" is not in 639(3))
Links: Wikipedia | Glottolog | Ethnologue ($)
Sources: We are working closely with Songhay.org to bring their 🔢 data into Kamusi.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Songhay Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/son

Songhay.org

Songhay.org will integrate and expand the lexicons they have developed for the Songhay language spoken by millions in Mali and Niger, when funding for their work can be activated.

/info/songhay

GOLD Theme: Education

Educational Outcomes

For every 1 microg/dL increase in blood lead concentration, there was a 0.7-point decrement in mean arithmetic scores, an approximately 1-point decrement in mean reading scores, a 0.1-point decrement in mean scores on a measure of nonverbal reasoning, and a 0.5-point decrement in mean scores on a measure of short-term memory. (Lanphear, et al., 2000)
Exposure to lead causes lower school performance, and a lifetime of follow-on consequences for job opportunities and earning potential. Removing lead from the environment increases test scores significantly in both reading and math.

Education in a second language causes lower school performance, and a lifetime of follow-on consequences for job opportunities and earning potential. Teaching in a familiar language increases test scores significantly in both reading and math.

Study after study shows that children learn much better when taught in their own language. This is true at all levels. Whether the subject is reading or science, students are much faster at grasping concepts in a language they already understand. Foreign languages, which include the colonial souvenir languages used in government, are learned best as subjects for study; you take classes to learn English or French, but you don't use those languages as your entryway to business or chemistry. Yet, in much of Africa, classes are conducted mostly in the foreign languages starting as early as the first day of primary school, and usually by the beginning of secondary school. Quite often, neither the teacher nor the student is proficient in the language of instruction. The result is that students don't understand what the teacher is saying, don't feel comfortable asking or answering ❓ questions, and do not move up to higher levels. (For much greater analysis, we recommend reading Optimising Learning, Education and Publishing in Africa: The Language Factor.) Imagine what it would be like to walk into a school where you were expected to learn to read in a language you don't know. You don't know the words, you don't know the rules, you don't know the pronunciations. The teacher can drill you on the alphabet, but the words you assemble hold no meaning for you. Compare that to the 📽 video below, in which Nicole, who lives in the French part of Switzerland and speaks French among her native languages, can master reading the language - the 📃 text that you see her reading for the first time reveals meanings with which she has become familiar throughout her previous years of acquiring the language orally. As she continues through school, language will present no obstacles to her efforts to learn biology, law, or economics - unless she were forced to study all those subjects in German, instead of learning German as one of her courses. As Fredua-Kwarteng and Ahia point out in examining why Ghanaian students scored 44th out of 45 countries in a grade eight mathematics test, "countries that top-performed in the mathematics test--- Taiwan, Malaysia, Latvia, Russia- used their own language to teach and learn mathematics," while Ghanaian students took the test in their non-native English. The effect of schooling in a foreign language can be clearly seen in this video that secondary students in Zanzibar made to show why the language of their education is causing them to fail school.

One of the barriers to mother tongue education is a lack of learning resources for most non-favored languages. Books and technology tools are rare or non-existent. Neither dictionaries that can define the concepts of subject like physics or history, nor encyclopedias that can provide more extensive explanations, are available to most African students in any meaningful way. Nor, for the most part, has subject-specific terminology been developed to enable communication about school topics in students' languages. Yet, these resources have been produced for lucrative languages, and are readily available for children in wealthy countries to master their studies. Producing similar resources for hundreds of millions of children who do not speak privileged languages is a matter of applying the time and money to use our current and future tools - reducing the 👅🚪 linguistic barriers to education, 🆓 free to students, on 📱 devices accessible today to most families in Africa. /info/educational_outcomes

FEM

The Foundation for Multidimensional Education, in Cartagena, Colombia, is a non-profit organization that will work with the Human Languages Project on educational resources for the indigenous languages of South America.

/info/fem

Return on Investment

Our non-profit mission makes an ROI that can be quantified monetarily impossible to offer investors. The payback comes in the difference your investment can make in 👪 people's lives.

Teens Make Film In Broken English To Explain Why They'll Fail English

Watch 👀 this film 📽, and read the NPR article, for a perspective from students in Zanzibar about how inadequate language resources contribute to failing in school and lost opportunities for good jobs. We are working to produce terminology references for many topics that will be 🆓 free for students 🌍 worldwide to understand their school subjects in their own language.

/info/roi

Team Translation

People who try to learn a language online are often defeated by loneliness, or lack of material at higher levels. Our controlled vocabulary applications, starting especially with CultureUp!, will bring 👪 people together to learn a language by collaborating on interesting translations. In the early version, museums and other institutions will upload the 📃 texts from their exhibits. This text will be available for groups from any language, though the first step will ordinarily be producing good 📃 texts in pivot languages such as English. The system will be built on wiki-like collaborative editing, with each team member making improvements until everyone agrees the translation is perfect. There should also be a chat mechanism for team members to discuss the special terms and topics they encounter, including a way to contact the curator to clarify difficult items. The next application is Kamusi Help!, followed by other public interest services, such as school text books or non-profit websites. To help develop or fund the Team Translation, please contact us!

/info/team_translate

Tuxpaint

Free 🆓, award-winning, multiplatform drawing program for children, available in 129 languages. We are proud that Kamusi members have contributed translations for several languages, including Songhay and Swahili. The original Swahili version was released on 23 December, 2004.

/info/tuxpaint

A multilingual dictionary accessible to all

IC | Computer and Communication Sciences
27 January 2014
by Alexandra Walther

A multilingual dictionary, benefitting from the contributions of Internet users, available for free on the Web in all of the world's languages: this is the aim of Kamusi. However, developing such a tool is a veritable technological challenge, because each language is said to have an average of at least 100,000 words that often have several meanings. Given that there are currently around 7,000 living languages around the world, finding equivalencies for all of the words in all of the languages amounts to searching for needles in haystacks piled as far as the eye can see. How to handle such a challenge? By having an anthropologist slash language enthusiast work with a computer scientist who is passionate about Big Data.

Continue to Full Article...

/info/kamusi_epfl

EDUC

EDUC (Ensemble developpons un usage des tic pour les communautes locales) is an NGO in Congo that develops information technology resources for local languages. Kamusi and EDUC have worked together on several projects for African language ICT.

/info/educ

Native Teaching Aids

Native Teaching Aids is a Montana-based small enterprise that develops, creates, and produces educationally entertaining Indigenous language material, that incorporates culture and history, for schools, communities and families.

/info/native_teaching

GOLD Theme: South East Asia

Donate in Philippine Pesos



/info/philipesos

VIoLE

Together with the Vietnam Institute of Lexicography and Encyclopedia, we have gathered a huge 🔢 dataset of about 190,000 Vietnamese terms. We are now seeking sponsorship to put these entries online and align them with the other languages in the Kamusi system.

/info/viole

Language: Vietnamese

Local name: tiếng Việt
Most spoken in: Vietnam
👪🔊: 82 million
Kamusi records: about 190,000 [queued for uploading]
ISO 639(1)/(3): vi/vie
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial 🔢 data arranged through Vietnam Institute of Lexicography and Encyclopedia, by agreement. Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Vietnamese Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/vie

Donate in Singapore Dollars



/info/singapore

Insulated Travel Mug

Hot stuff: Drink to your language!

/info/travel_mug

Donate in Thai Baht



/info/baht

Language: Thai

Local name: ภาษาไทย (Phasa Thai)
Most spoken in: Thailand
👪🔊: 65 million
Kamusi records: [data in upload queue]
ISO 639(1)/(3): th/tha
Links: Omniglot | Wikipedia | Glottolog | Ethnologue ($)
Sources: Initial 🔢 data from the Asian 🔠🕸 WordNet, maintained by the National Electronics and Computer Technology Center (NECTEC) – Thai Computational Linguistics Laboratory (TCL), NICT, Kyoto, Japan. Additional data from Kamusi participants.
💛😇 GOLD Angel: Nobody.
Pearls: Be the first Pearl for this language!
Join the Thai Language Task Force to help plan development of robust resources for the language!

[This page is under permanent construction. We would like to make sure the page stays updated, and add new interesting links and information for our visitors. We invite you to help curate this page!!! Please contact us to volunteer to improve this infobox, or to translate the content.]

/info/tha

GOLD Theme: User Information

Kamusi Project USA

Kamusi Project USA (KPUSA) is an independent 501(c)(3) non-profit, non-governmental organization registered in the State of Delaware, EIN: 26-3290364. KPUSA is responsible for all activities associated with the Kamusi Project in the United States. Activities include US-based fundraising and expenditures, and all project development that occurs on US soil or through contracts that originate in the US.

/info/kpusa

Funders


Currently, our only institutional funding is a limited amount of research support from EPFL.

In the past, we have received grant support at various periods from:

* Consortium for Language Teaching and Learning
* International Development Research Centre of Canada
* Negaunee Foundation
Open Society Institute for Southern Africa
*Swiss State Secretariat for Education, Research, and Innovation
* United States Department of Education
* United States National Endowment for the Humanities
/info/funders

EPFL

Our academic home since 2013 has been the Distributed Information Systems Laboratory (LSIR) at the École Polytechnique Fédérale de Lausanne (EPFL), Switzerland, directed by Karl Aberer. LSIR support has been fundamental to our advances in the informatics aspects of 👅👅👅 multilingual lexicography, including our new graph 🔢 database structure, crowdsourcing, 🎮💾 gamification, and human-computer interaction with our raft of new mobile and 🕸 web applications.

/info/epfl

Privacy

We value your privacy as much as we cherish our own. We promise never to spam you, or sell or share personal details you wish to remain private. However, to maximize what you gain from Kamusi, we need to keep track of some user information.

  1. we have an individual ❓ question for you, for example to clarify a submission you've made
  2. we have information that affects all our visitors, such as resumption of service after a site outage
  3. you have subscribed to a newsletter or other service, such as notifications of when your submissions are approved.

Terms of Service

We make no guarantees about the 🔢 data we provide. By using Kamusi, you agree that we cannot be held liable for any damages resulting from your use of our data or services. Furthermore, you agree that any submissions you make to the project are 🆓 freely shared, in good faith, representing the best of your 🎓 knowledge, and with no expectation of ownership or financial reward. /info/privacy

Our Cookies Policy

Of course we use cookies. A website without cookies is like a hotel room without sheets - you could stay there, but you would not enjoy the experience nearly as much. We comply with all European Union regulations about cookies. The purpose of using cookies is to remember your actions and preferences (such as login, language, font size and other display preferences) over a period of time, so you don’t have to keep re-entering them whenever you come back to the site or browse from one page to another. We do not sell or share any personal information that is stored in cookies.

You can control and/or delete cookies as you wish – for details, see aboutcookies.org. You can delete all cookies that are already on your computer and you can set most browsers to prevent them from being placed. If you do this, however, you may have to manually adjust some preferences every time you visit a site and some services and functionalities may not work.

The types of cookies that we use at Kamusi are "strictly necessary in order for the provider of an information society service explicitly required by the user to provide that service". We are therefore required to provide this notice for EU residents. By continuing to use our services, you implicitly agree that, like having sheets on your bed in a hotel room, our cookies are cool with you.

/info/cookies

Telamenta

Our programming is led by Greg McKeen at Telamenta, an independent coding house in South Africa. This SME has gone far beyond all expectations, donating thousands of hours to keep Kamusi alive after funding ran out for the work we contracted to them. They have already written the back end for the next generation of Kamusi, and it is fantastic. The new system enables us to link every word in every language in extremely rich and nuanced ways. However, it is financially unsustainable as a boutique project, so must remain offline until we can find major support to fund implementation.

/info/telamenta

Contact Us

We welcome your comments and questions, and will try to respond quickly. You can contact us using this form. Make sure to use a real email address so we can send you a real reply!
Name: Email: Category: Subject: Message:

Kamusi is managed by Dr. Martin Benjamin, Executive Director

/info/contact

© Copyright ©

The Kamusi Project dictionaries and the Kamusi Project databases are intellectual property protected by international copyright law, ©2007 through ©2016, under the joint ownership of Kamusi Project International and Kamusi Project USA. Further explanation may be found on our © Copyright page.

Our 🔢 data comes from many sources, some of which have different licenses than our own. We strive to credit our sources in the contexts where their data is used, and request that you honor their terms when you make further use of their contributions. External 🔢 data is subject to modification at any time by project participants, including additions, revisions, or deletions, without further indication, so please consult the original source if you want their original 🔢 data.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

/info/copyright

Kamusi Project International

Kamusi Project International (KPI) is an independent non-profit, tax-exempt, non-governmental organization registered in the Canton of Geneva, Switzerland. KPI is responsible for all activities associated with the Kamusi Project worldwide 🌍, except for the United States. Activities include international fundraising and expenditures, and project development that occurs or originates in Africa, Europe, Asia, Australia, or any country in the Americas other than the United States.

/info/kpi