This is a page from the Kamusi archives. The information below may be out of date, and the links may no longer be valid. Please visit kamusi.org for current information. If you know of links or information on this page that can be updated, please let us know.
One path toward the Kamusi Project's goal of documenting every word in every language is to produce dictionary entries for a large number of concepts in as many languages as possible. For example, all languages have a term for "mother", so we can ask participants to produce that concept in each language, and end up with a global set of parallel entries for the concept of a woman who has given birth to a baby.
How do we figure out which terms to select, though? "Mother" is obvious, but is "subway" an important concept across languages, even for places that do not have underground trains? What about "momma", or "pathos", or "autopilot", or "cheetah"? And, once we figure out which terms to focus on, how do we prioritize so that the we build entries for the most useful terms first? Which do we treat first - jazz or opera, businessman or pastor, infrastructure or mud?
The answer I've devised combines science, art, and a pinch of magic dust, to produce a ranked priority list of nearly 20,000 English terms that can serve as the basis for parallel dictionaries across languages. I combined eleven different English wordlists that were produced with different methods, for different purposes. I then totalled the number of wordlists that each term appears in; almost 5,000 terms appear in two or more lists, so I'm proposing to make those terms at least a little more of a priority than the 15,000 terms that only appear on a single list. Within each cluster of number of list appearances, I've then ranked the terms by the their rank within a very large corpus of English documents; while "and" and "rain" both appear on ten lists, "and" is the third most frequent word in the English corpus, whereas "rain" comes in at 1559, so we'll make sure to tackle "and" before "rain".
Why go to all of this trouble? Why not just use the ranked list of 20,000 terms derived from the English corpus? The answers to these questions are both practical and philosophical.
From a practical perspective, the words that appear in a set of English documents, even a very large corpus of tens of millions of terms, only reflect the concepts that people are writing about. When producing a dictionary, on the other hand, we need to include topics that people are thinking about, even if they don't rank high in the written record. Among the terms that do not appear in the corpus list, for example, are dollar, Africa, Europe, or percent - all of which should appear in even a basic dictionary.
From a philosophical perspective, there is no reason why the frequent use of a term in English should indicate the importance of that concept for speakers of other languages, in cultures not formed by settlers from the British Isles. The number one term in the English language, "the", does not even exist in many languages - African Bantu languages like Swahili, for example, do not have any definite articles, although they have dozens of terms to narrow down the concept of "this", "that", "these", and "those". "Baseball" appears in the corpus as term 1380, but what languages would pay it any attention outside of North America, the Caribbean, or Japan? On the other hand, coffee appears less often than baseball in the corpus (#1395), but is central to cultural life in many countries, is a cornerstone of many national economies, and appears on seven wordlists. And many concepts that are vital to people around the world, such as the food called cassava or manioc, do not appear at all in the English corpus list. Basing a dictionary on the frequency by which words appear in a set of documents from the English language privileges some ideas and excludes others in a way that has more than a slight whiff of linguistic imperialism.
Yet, if English frequency is not a sufficient basis for generating a starting list of concepts for a multilingual dictionary, how should such a list be generated? We cannot escape that English is uniquely pivotal to global communication. Nor would I want to: the many tools that have been developed for English can serve as building blocks for other languages that have not benefitted from the same financial and scholarly resources. The ranked priority list I've crafted for Kamusi accepts that English will be the launching point for building a cross-cutting dictionary. In the long run, the "living" aspect of Kamusi provides open-ended space for languages to bring in their own additional concepts and priorities at any point, going far beyond the initial terms we propose for comparative purposes. More immediately, I deliberately selected wordlists that expand the range of concepts to be included, including two lists that arise from African languages, so that the initial terms in KamusiGOLD span a broad range of cultural and topical concerns.
The wordlists from which Kamusi has developed its priorities for the first 20,000 terms for each language are:
AWL - the Academic Word List contains 570 terms that appear frequently in a broad range of academic texts, but do not occur frequently enough to be included in the GSL. The intent of including AWL is to broaden our collection of core concepts to include those that will be most needed by dictionary users, including students and scholars in many disciplines.
Basic English - this list of over 2000 terms was compiled by Charles Kay Ogden for the purposes of teaching English as a second language. The list dates back to 1930, and has been subject to much justified criticism, but offers a valuable set of opinions about priorities when blended with other approaches. The terms on the list are those Ogden deemed to represent "what any learner should know."
CAWL - the SIL Comparative African Word List. SIL compiled 1700 terms relating to 12 main semantic domains. These terms are selected as elemental concepts across Africa. In addition to the focus on topical breadth, CAWL offers the potential to quickly import parallel data from many of the African languages documented by linguists working with SIL.
Clear English Choices has a list of 2126 words based on a frequency analysis of their appearance in forty American newspapers and popular magazines in the 1990s. The contemporary nature of the list provides a counterpoint to some of the lists that date to early twentieth century.
DWL - the Dolch Word List contains 315 terms selected 1n 1936 by Edward Dolch to represent core terms that children would need when learning to read. The words have been selected from children's literature, so they offer a counterpoint to the academic terms in AWL.
GSL - the General Service List contains roughly 2000 words selected from an English corpus to represent the most frequent terms, with human revision based on semantics and morphology. The premise of GSL is that the terms on the list provide the basis for understanding as much as 95% of spoken English and 85% of written texts.
The Moe list and Corpus of Contemporary American English frequency list. Lexicographer Ron Moe has graciously provided the Kamusi Project with access to his work-in-progress analysis of the 20,000 most frequent English terms in the Corpus of Contemporary American English. The frequency rankings are available from COCA and are used with permission. The first 5000 terms are ranked in raw order of frequency, while the remaining terms are clustered in groups of 1000; in fact the difference in frequency between nearby words deep in the list cannot be seen as statistically significant. Moe has classified each term as a member of one or more semantic domains, which we hope to use for future development of the data. The COCA list disambiguates words by part of speech, providing different ranking for verbs, nouns, and other types of words that share the same spelling; for the Kamusi list, we have kept only the lowest ranking, and will treat all speech types concurrently for the same spelling. Terms that did not appear on the COCA list but did appear on at least one other list were assigned an arbitrary ranking of 6500, so that they would be treated within the top 10,000.
The Reading Teachers' Book of Lists includes a list that it claims represents the top 1000 English terms, ranked by frequency. Including this list along with the other corpus frequency lists might give too much weight to written English, but only at the top end, where, arguably, more universal concepts tend to lie.
Sereer-English Dictionary - compiled by volunteers for the US Peace Corps, this dictionary used the Sereer language as the starting point for its selection of words. Many of the 2000 terms on the list are therefore African priority concepts, rather than English concerns that have been rendered in Sereer. Some of the terms are too specific to the local culture to merit inclusion on a global priority list (not very many people, even elsewhere in Africa, need numerous terms relating to mangrove swamps), but others add important missing concepts to the base English terms we harvested from other sources. In the future, I would like to develop a cross-cultural priority list that culls many similar dictionaries.
Swadesh - the Swadesh list contains 200 terms that represent concepts deemed to exist in the largest number of languages. Because Swadesh lists have been prepared for many dozens of languages, its inclusion will promote rapid seeding of new languages within Kamusi.
VOA Special English - the Voice of America has a daily broadcast service called "Special English" that uses short sentences and a core vocabulary of 1500 terms. If the words on the list are adequate to deliver the news to a global audience every day, then they constitute a useful indicator of English priority concepts. In addition, the VOA list comes with open source definitions that we may choose to use for particular concepts.
The KamusiGOLD priority list compiled from all of these other lists is far from a definitive set of concepts that ought to be in a dictionary, but it does provide us with a good starting point. Over time, I expect to eliminate many of the words on the list, either because they are irrelevant as basic terms for a global dictionary, or because they are morphemes of words that already appear ("did" is a morpheme of "do"). We will also work in other lists that already have available multilingual data, such as the list of country and language names from CLDR that is available in more than 300 languages. I also plan to build additional lists that will either supplement or supplant the list under discussion. As Kamusi grows, we will be able to flag terms that are indigenous to a particular language - terms like "kanga" that have a complete entry in their language, but can only be given "explanatory translations" in English; these concepts can be collated into a graded compendium of non-English concepts to be cross-defined in other languages. We will also be keeping anonymous search records, so we will have millions of data points to show which terms people are searching for; such data will enable us to provide dictionary users with the concepts about which they are most curious.
For the moment, though, the priority list is a solid, rational starting point for documenting about 20,000 concepts across languages. It is not perfect, but it gets at a range of concepts that dictionary users will be searching for, with a realistic amount of work to put before our contributing partners for each language.