You can Register Here ,   OR

Paldo Planning Meeting Chat Transcript

Archived Page

This is a page from the Kamusi archives. The information below may be out of date, and the links may no longer be valid. Please visit for current information. If you know of links or information on this page that can be updated, please let us know.

05:10:51 Martin Benjamin: Welcome to the PALDO planning session
05:18:22 Martin Benjamin: paa kwesi is welcoming us
05:18:28 Martin Benjamin: we are about to do the agenda
05:22:37 Martin Benjamin: introductions: paa kwesi from kasahorow, currently living in london
05:23:47 Martin Benjamin: henry addo from kasahorow, living here in accra
05:26:37 Martin Benjamin: arthur buliva is with us from nairobi
05:26:44 Martin Benjamin: sampson is with us from nigeria
05:26:47 henryaddo: Paa Kwesi introducing what PALDO and kasahorow is about to a Journalist
05:27:23 Martin Benjamin: sylvanus nana kumi from the Daily Guide here in Ghana
05:28:39 henryaddo: Marting explaining what kamusiproject is all about.
05:29:29 mark: Ayekoo! Following the developments with interest.
05:30:17 henryaddo: Welcome Mark
05:32:06 henryaddo: this link explains all the edit engines available
05:32:26 henryaddo: Arthur giving his experience about how he got into the kamusiproject
05:32:55 henryaddo: A google search for a translation got him on to the project
05:36:42 henryaddo: You can follow PALDO"s webcam session here
05:42:24 henryaddo: Supposed languages to be dealt with for PALDO are kiswahili, kinyarwanda, lingala,tswana,akan,wolof,somali,hausa/yoruba,arabic,english,french,berber
05:43:54 henryaddo: six languages this year and six languages next year.
05:45:04 henryaddo: The goal is by April 1 next year we should have 2500 terms for Information Technology.
05:45:50 mark: what does "hausa/yoruba" mean? one of the two? both?
05:46:04 A12n: Jam waali! Don Osborn here - just got through the login issues.
05:46:38 Martin Benjamin: paa kwesi is explaining the PAL project
05:46:54 henryaddo: Paa Kwesi taking over from Martin
05:46:58 mark: Hi Don!
05:46:59 Martin Benjamin: welcome, Don, perhaps you could add a few words on PAL into the chat
05:47:13 Martin Benjamin: thanks for getting up early in Washington, Don
05:48:02 A12n: No problem. I"m on a Yahoo IM with Andrew Cunningham who asked how the languages were selected - maybe you"re discussing that now
05:48:06 henryaddo: Mark, it means one of them would be supported, delibration is going as to what to take Hausa/ Yoruba
05:48:49 Muiris: I am here from Ohio, I will have to leave early though
05:49:17 henryaddo: Paa Kwesi now talking about kasahorow.
05:49:26 henryaddo: Welcome Muiris
05:49:37 A12n: Hi Mark, good ques. re Hausa & Yoruba. Thx Henry for the clarification.
05:50:09 andrewc: Yoruba woud be more interesting and challenginging from the internationalization perspective
05:50:58 mark: Andrew: why is that? Orthography?
05:51:30 Martin Benjamin: mark, muiris, andrewc, can you please introduce yourselves?
05:51:32 mark: Looking at no. of speakers and geographical spread. Hausa would trump Yoruba
05:51:50 Martin Benjamin: we are currently looking at
05:52:02 A12n: With Berber (Tamazight, Kabyle, Tamasheq, et al) there are questions both about variety & script
05:52:06 henryaddo: Paa Kwesi talking about some of the tools kasahorow has been built
05:52:10 andrewc: Need to build in unicode normalization through out, esp when people are searching, so much variation in character output via keyboards
05:52:19 Martin Benjamin: the Yoruba vs. Hausa issue is largely a question of availability of partners, from my perspective
05:52:54 andrewc: Also form user end, have font rendering/text layout issues
05:53:29 Martin Benjamin: the kasahorow dictionaries were intended as a demonstration, to show what can be done
05:54:08 Martin Benjamin: one of the tasks for PALDO will be to integrate the work already done on the kasahorow dictionaries, merge it with the Kamusi platform
05:54:26 mark: Martin Benjamin: I am Mark Dingemanse, MA (African languages & linguistics, Leiden), currently PhD at the Max Planck Institute for Psycholinguistics, Nijmegen. Working on Siwu (Akpafu-Lolobi), Volta Region, Ghana. Interested in i18n and in levelling the Digital Divide.
05:55:13 Martin Benjamin: yes, andrew, unicode issues are paramount
05:56:17 andrewc: at back end, public UI also has web i18n issues, esp bidi support, and various other things
05:56:46 A12n: Per intros, I"m Don Osborn, of Bisharat. I led the PanAfrican Localisation project (PAL) which is being followed by the PanAfrican Localisation Network ((as of now)
05:56:54 Martin Benjamin: we"ve already done a lot of work on Yoruba orthographic issues, for the Edeyede Project (currently offline). That work needs to be updated, however
05:57:36 Martin Benjamin: we are working with the Yoruba community on a proposal to reactivate and merge the Yoruba dictionary with PALDO. That proposal is in draft form.

05:58:18 A12n: Scripts/orthographies & support, I"ve suggested elsewhere that Latin script has 4 categories - if you"re seeking to move beyond demo to prove the concept, it would be good to cover those
05:59:03 andrewc: I"m Andrew Cunningham, State library of Victoria, melbourne, Australia.
06:00:43 arthurbuliva: why is the canadian government sponsoring this?
06:00:57 arthurbuliva: why not any other African government?
06:01:17 A12n: (1) ASCII (e.g., Swahili, English); (2) Latin-1 (many accented characters as in Sango, west Eur. lang"s); (3) extended Latin (e.g., Hausa, Wolof); (4) ext. Latin + combining diacritics, usually for tone (Yoruba is one of the more demanding in this regard)
06:02:22 A12n: Hi Arthur, good question
06:05:02 A12n: I can"t speak on anyone"s behalf but see the area of localization as emerging. Few Afr. govt"s have invested much in this aspect - yet. The PAL project helped research the issues in general and under the new project...
06:05:31 A12n: ... we"ll have a sub-project looking into these policy issues
06:06:39 A12n: Another missing actor, if you will, is the forweign donor/development community with a few exceptions like IDRC
06:07:57 A12n: On the other hand, the African Academy of Languages ACALAN (which is part of the African Union) is taking an active interest in this area. Not only as part of the new project but in other ways (such as in collab with UNESCO)
06:08:18 Martin Benjamin: we are also working on a proposal for Hausa. But those Hausa and Yoruba proposals are for full dictionaries, not the IT component we are currently talking about
06:08:34 henryaddo: IDCR is the funder of the PAL project. Most softwares available don"t have translations in African languages so the Canadain government decided to devote some money to support african languages so that most softwares get translated into them. Currently they"re not mostly translated into African languages because they"re no tools ( keyboards, softwares ) for these languages to supported
06:08:38 SabineCretella: I am Sabine Cretella, CCO of Vox Humanitatis, we work with less resourced languages and care about the development of OmegaWiki 2.0
06:09:29 A12n: So it"s an evolving area. Local initiatives in Africa - a number of which are associated with this effort already - are developing the civil society support for localisation
06:10:53 Martin Benjamin: paa kwesi is noting the parallels with radio in Ghana. that until Peace FM started broadcasting in Twi, every broadcaster in Ghana only used English
06:11:42 Martin Benjamin: remember that images are available on
06:12:05 Martin Benjamin: and Paa Kwesi says that now more than 50% of the 20+ radio stations in Ghana are broadcasting in local langs
06:12:53 henryaddo: Introduction over by Paa Kwesi
06:13:06 Martin Benjamin: ok, that"s been a lot of prelimnary explanations, to bring everyone up to speed
06:13:13 Martin Benjamin: now on with the agenda
06:14:11 Martin Benjamin: CHAT VISITORS, if you have other questions about the background of the project, please chime in
06:14:26 Martin Benjamin: it is very difficult to type and talk at the same time
06:15:32 henryaddo: Martin going to summarize the PALDO document
06:15:33 Martin Benjamin: henry and arthur will now do the scribing
06:15:42 mark: Not at the moment no. Could you copy/paste the agenda into the chatroom?
06:16:10 henryaddo: Martin will get the document available later
06:18:25 henryaddo: Agenda will be pasted in a minute
06:20:25 Bèrto ëd Sèra: Hi all :) I"m Bèrto "d Sèra. Eager to listen. I worked all night so bear with my typos...
06:20:56 mark: The chat has some Unicode issues too :)
06:23:13 Bèrto ëd Sèra: LOL yes :) My name is good Unicode tester, usually :)
06:23:13 henryaddo: Morning: Dictionary design review
( Multilingual Database
Linking Tool
Editorial Interface
Language Specifications
Expanded search
Server optimization
kamusi impetrata
Funding mechanism
Custom dictionary output
User suggestion from online participants )

Afternoon: Discuss implementation details
technologies to be used.
Which server and its location

Evening:Iteration Planning
06:23:44 henryaddo:

The agenda
06:24:02 A12n: Well this raises the question: Kamusi has worked mainly with ASCII (level 1 orthographies by my schema). I"m guessing that Kasahorow will help with the Unicode value--added for more complex Latin orthographies?
06:24:45 henryaddo: A12n; yeah that is considered
06:25:28 A12n: (My question followed Mark"s comment, but relates also to the multilingual issues in the agenda)
06:26:03 thinfox: ascii won"t be able to support any other language so unicode is the only way forward, i think
06:26:43 henryaddo: Welcome thinfox
06:26:47 andrewc: yep, but there are many aspects to unicode support, database used and scriptinga nd programming langauges will impact
06:27:08 A12n: Yes, agreed re Unicode. Just noting the support issues in the chat and Kamusi"s previous ability to use ASCII
06:28:06 Bèrto ëd Sèra: how many languages do you plan to use? One of the weakest points with MySQL is sorting. You will have to build your own sorting algorythms
06:28:28 Bèrto ëd Sèra: unless you accept a sort that is foreign to the language
06:28:50 Bèrto ëd Sèra: like unicode binary
06:28:51 andrewc: normalisation, collation, case folding. making an extesible back end capable of supporting multiple langauges in mroe than one script can be challenging
06:29:44 henryaddo: These are issues to be discussed during the implemation details session
06:31:36 mark: Other implementation details, then, include issues of data portability
06:31:58 mark: How will the database format be documented? What about XML output?
06:32:06 henryaddo: noted mark
06:32:19 mark: What about porting a dictionary in SFM format to "PALDO format", or vice versa?
06:34:05 mark: (I meant MDF instead of SFM)
06:35:17 henryaddo: These are issues to be discussed during the implemation details session

06:36:05 mark: henry: indeed (but I might not be there). Add this related one: it is worth to take a look at LIFT XML, see e.g.
06:37:08 henryaddo: Martin is giving a demo about the edit engine of kamusi
06:37:20 henryaddo: Thanks mark. Noted
06:38:55 henryaddo: for those who just join in. you can view live images from here
06:49:42 SabineCretella: is Martin"s presentation being recorded and then available online afterwards? we are in two who don"t get any video or sound (just some changing pictures) - or is that normal?
06:50:25 henryaddo: Marting explaining how the linking tool will work; there should be a way of linking words that have similar meaning in other languages
06:50:53 A12n: Hi Sabine - good question
06:53:01 henryaddo: no recording :(. just streaming live
06:53:48 henryaddo: sorry we have limited bandwidth
06:54:15 SabineCretella: hmmm ... are we the only ones having these probs? can somebody record what is coming in over the streaming connection?
06:54:15 henryaddo: but we"re transcribing it
06:54:29 A12n: In addition to scripts/orthography support (and the range of technical issues there) there are also linguistic questions wrt "dialects" - your plan to work with Berber languages is an extreme case - a lot of variability. Fulfulde/Pulaar (not yet on your list) is another one, though less extreme. Even Kinyarwanda, which is so close to Kirundi, but not the same raises some questions
06:54:29 SabineCretella: that"s a lot of work ... transcribing ...,
06:55:01 henryaddo: yeah its sabine
06:55:50 henryaddo: Paa Kwesi explaining the goals of the Editor"s interface
06:56:12 A12n: (You"re streaming the audio? Hear nothing on the video page and see no links)
06:56:17 Martin Benjamin: ok, I"ll try to type up what I just talked about, at the same time as typing up Paa Kwesi"s talk
06:56:27 Martin Benjamin: henry will also type for Paa Kwesi
06:57:12 Martin Benjamin: the biggest challenge we are confronting right now is going from a bilingual model to multilingual
06:57:15 SabineCretella: just switch audacity on and record ... or just the ordinary voice recorder of the computer - that should help a lot
06:57:16 henryaddo: Current design desion is,
06:57:26 SabineCretella: even if audio quality will not be the best
06:57:40 henryaddo: new version must be able to handle non-linear variation
06:58:51 Martin Benjamin: thanks, sabine, I"ve turned on audacity. not sure how we"ll get that online
06:58:57 A12n: (I"ll continue with some comments as I"ll have to move on in another hour or so). Another issue is how to handle languages that are organized differently. I would love to see Fulfulde/Pulaar in the mix for the reason that it is root-based which implies a different kind of optimal organization than that used for most other languages
06:59:04 henryaddo: there should be a way to capture variations by using the concept of cluster
06:59:20 SabineCretella: contact me later on - we can get it online
06:59:26 Martin Benjamin: in bilingual model, each entry is tied between english and swahili. eng school = swa shule, for example
06:59:35 henryaddo: the actual language is the cluster language which can get various sub langs
07:00:11 arthurbuliva: @Sabine audio is not available becaause of limited bandwidth
07:00:35 Martin Benjamin: in a multilingual dictionary, it becomes impossible to hard link each lang in a single entry
07:01:27 Martin Benjamin: for example, there are at least two ways to say secondary school in English, and 3 in Swahili, and 3 in French
07:01:39 SabineCretella: no probs - we can get it online later
07:02:12 henryaddo: there is going to a single part of speech system and going to be the english"s
07:02:15 Martin Benjamin: the mathematics would make it prohibitive to try to have each one of those as an individual record
07:03:32 Martin Benjamin: instead, we need to have each english term be its own record, and each swahili term be its own record, and each akan term be its own record
07:04:10 mark: order words, a relational database (or that would seem to be the only viable solution to such problems)
07:04:17 A12n: I"m interested to know how the cluster arrangement will work. This is something that one might be able to do with other language families in other regions, but seems especially important in Africa at this time - so many "languages" are either really dialects of a larger language (or "macrolanguage") or are in clusters that share a lot. For all of these, ways to leverage resources for all, and apply research on one to others is possible. Then too - back to the orthography issues - there are variant forms often decided on by countries (when the language crosses borders) or local organizations
07:04:39 Martin Benjamin: and then we do some magic behind the scenes, in the database, that links a particular record from one lang with the relevant record or records in a second lang / third lang/ etc
07:07:15 Simon W: Linking between databases is tricky stuff! How many languages are you going to have linked together?
07:07:25 Martin Benjamin: we need to do this in such a way that we don"t mess up our existing 70,000 Swahili-English records
07:09:01 henryaddo: explaining the cluster arrangement: Akan[ cluster ]
fanti - local( is only for akan )
akuapem part of speech( gloal )
Headword local

07:09:01 andrewc: what is the structure for existing records is there a database schema available?
07:10:09 henryaddo: andrewc yeah there is
07:10:13 mark: andrewc, good question - I"d be interested in that too
07:10:48 mark: henry, could you put it online, or is it already?
07:12:25 henryaddo: currently, there are two main tables, one contains all the words and meaning and it parts of speech, the second table is for grouping the words
07:12:55 henryaddo: mark; let me get the current schema online
07:14:23 andrewc: swahili and english in teh same table or separate tables?
07:14:59 henryaddo: same tables

07:15:52 Martin Benjamin: the easiest way to see the schema, visually, is to look up any word at and click on "edit entry"
07:16:00 A12n: Relating what I suggested re roots previously, another way of relating words in a language like Fulfulde/Pulaar is specific to the language itself - the roots from which most lexical items are derived
07:16:22 Martin Benjamin: anyone on this chat also has privileges to log in to the edit engine
07:19:09 Martin Benjamin: [sorry, photo break before our journalist attendee leaves us to the technical discussion]
07:19:27 andrewc: would this model suit Arabic?
07:19:50 Martin Benjamin: yes, we are hoping to add arabic at some point
07:20:25 henryaddo: okay Martin gave, the alternate way of viewing the schema
07:20:46 A12n: Yes, Arabic of course has roots as well, and again that is language specific and an active part of morphology. In such languages, the ability to call up related terms by root would be helpful, though asymetric with the way other languages in a multilingual resource are organized
07:22:18 A12n: (at least it seems like it is both essential for that language component [Arabic or Fulfulde/Pulaar] and an additional element that is not important for many other languages)
07:24:44 Martin Benjamin: we sometimes have Arabic in the existing kamusi, when we know the derived arabic root for a swahili word
07:25:36 Martin Benjamin: you can look up "workshop" to see how that works for
07:25:38 Martin Benjamin: you can look up "workshop" to see how that works for Farsi
07:27:23 A12n: Morphology is one aspect (how the words are formed or what their history is). Productive roots are another aspect and this is the case for Fulfulde/Pulaar and I think Arabic...
07:29:23 Martin Benjamin: look up "rafadha" from swa to eng for an example of arabic currently in the Kamusi
07:30:39 A12n: You can look up a (more or less, since correspondences are not that simple, as you"ve already noted) 1 to 1 equivalence, but knowing the root and other related derivations (with extenders, as are found in some other languages of the region - Nguni languages I think for example) from it allows one to use it productively. It"s almost like the roots are metawords. (Sorry for this long digression)
07:37:43 Martin Benjamin: thank, don, that"s good
07:38:04 Martin Benjamin: and I"ve been vocalizing again, so let me try to type backwards
07:38:28 Martin Benjamin: we talked about the linking tool
07:39:18 arthurbuliva: Download the first audio file at here. The file is about 1.5 MB in size and is in amr format. Apologies for the poor sound quality. Sorry also that the recording started in the middle of the session.
07:39:53 arthurbuliva: Sorry, but the recording is at
07:40:14 Martin Benjamin: basically, if you know that swa "simama" = eng "stand", and you know that akan "sori" = eng "stand", then the computer can predict that simama = sori
07:40:46 Martin Benjamin: HOWEVER, that is problematic, because that is only good to the extent that you are talking about the action of standing up
07:41:41 Martin Benjamin: there are other senses of "stand" that are completely different, in both swahili and akan. stand = tolerate, stand = a bunch of trees, stand = a kiosk
07:42:09 Martin Benjamin: so you need (a) the computer to predict connections, and (b) a person to confirm or reject those predictions
07:42:38 Martin Benjamin: the linking tool will provide that functionality
07:43:40 Martin Benjamin: also, every time you make a confirmed human connection between records for 2 langs, that confirmation can be used to improve the confidence of computer predictions between langs that are linked to those langs
07:44:06 arthurbuliva: Recording 2 is 1MB in size and is downloadable at here
07:44:20 Martin Benjamin: we are going to take a lunch break. back at 1:30 local time. CHAT PARTICIPANTS - please continue to send questions, even if we are offline
07:45:22 arthurbuliva: Pardon the typo
07:45:27 arthurbuliva: Recording 2 is 1MB in size and is downloadable at here
07:58:33 andrewc: will powerpoint presentations be made available?
08:00:31 Martin Benjamin: re pp, probably not
08:10:43 A12n: Too bad re ppt, but whatever other info can be made available online would be appreciated. I will have to sign off now. Look forward to hearing/reading more. Oon e tiyabu (praises on the work). Don
08:13:26 mark: Will the linking tool always need English to take part in the chain? Or will it be made such that lemmas in any two or more languages can be linked?
08:14:20 mark: The latter seems more advisable to me, obviously. From a lexicographical point of view it would be quite naive (and dangerously limiting) to presume that English always can play this mediating role.
08:18:47 A12n: (my impression was that English would be used as the networking language as a starter, but that the ultimate goal would be the approach you describe. Another aspect that doesn"t get discussed as much but is part of Kamusi already is the monolingual one - i.e., how the word in a language is described in that language - monolingual dictionaries as we know are still relatively rare for African languages, although they are a key element in language development)
08:22:29 mark: Don, I agree. Anyway, that was my main question; I hope it can be addressed in my absence, for I, too, have to sign off now. Mìndo kàràbràlo (You are doing the work!), as we say in Siwu. Next time in Ghana I"ll try to visit Kasahorow, I know the place in Accra and even saw the signboard last summer.
08:26:36 Martin Benjamin: don and mark, that is right. the idea is to have english "drop out" of the equation
08:27:48 Martin Benjamin: any language can serve as an "index" language, as long as a person working on the project is competent in both langs
08:28:42 Martin Benjamin: you could, for example, do a direct hausa-yoruba build, without ever using eng, because plenty of people are competent in both
08:29:12 Martin Benjamin: but it would be much harder to do a berber-malagasy dict without passing through at least one intermediary index lang
08:29:26 thinfox: one way would be to assign some sort of global identifier/word name to entries. equivalent words share the same word name/global identifier. english won"t be needed in this setup
08:30:28 Martin Benjamin: ok, back from lunch
09:33:14 arthurbuliva: this second session is about implementation
09:33:21 henryaddo: second session starts. Martin talking about impetrata
09:33:48 henryaddo: impetrata means things that have been searched for and found
09:34:56 henryaddo: we keep track of everysearch that comes to search. we know the ip, timestamp, what word searched, what lang searched in, did the word searched for return a result?
09:36:51 henryaddo: impetrata functionality is simpily to give a report on say, how many words have been searched, when, etc
09:45:22 henryaddo: getting a frequency based of searched words
09:47:00 henryaddo: clean up slog table and remove most prepositions

09:48:27 Martin Benjamin: so we are able to say "love" is the most frequently searched term, and "carpet" is never searched for. that way, we can have editors for a particular language work on the terms that our search records already indicate are most important to our users
09:49:36 arthurbuliva: Recording 3 is 1.3 MB in size and is available for download at here
09:51:32 Martin Benjamin: the Impetrata will feed the terms to editors for each language in order, from most frequently searched toward least.
09:52:31 henryaddo: now to the languages specification
09:56:17 henryaddo: We need to work with linguist to give us the real specifications for the langs.
What is an entry in each langs.
What are we capturing? . eg tone.
common vs high forms.

Dump the various details when it comes to a language specification.

09:57:31 Martin Benjamin: re impetrata, this gives us the ability to feed to editors a consistent set of terms across languages (preferencing the english search records, for simplicity).
09:58:08 henryaddo: These linguistic features will be dectated by the specific linguist per language. developer side, is to just think about the data
09:58:44 Martin Benjamin: and payment to partners can be made when specific targets have been reached - eg, when each tranche of 1000 terms is completed
09:59:36 Martin Benjamin: now we are talking about access points to PALDO
09:59:48 Martin Benjamin: eg, one web site for each lang? or one overall website?
10:02:13 SabineCretella: Part one of the recording is online here in mp3 format: here
10:02:33 SabineCretella: part 2 audio mp3
10:02:54 SabineCretella: part 2 - will go ahead with conversion of part 3 now
10:02:58 arthurbuliva: Thanks for the conversions Sabine
10:03:26 henryaddo: Martin throwing some ideas
10:04:20 SabineCretella: you"re welcome :-)
10:05:13 henryaddo: one website but with localized navigation
10:07:01 henryaddo: one backend but different frontends( skins )
10:07:29 henryaddo: and the third, different backend but shared access to the single db
10:10:49 SabineCretella: part 3 mp3
10:11:48 Martin Benjamin: the third option is basically a web service, where we provide data (db queries) that others can embed in their own websites
10:11:53 henryaddo: for initial implementation, we going to use the first idea( one website but with localized navigation )
10:13:53 andrewc: model for i18n architecture?
10:14:24 henryaddo: yeah
10:15:21 henryaddo: now on domain.
10:15:28 Simon W: Listening to the audio on Sabine"s site helps a lot. Shame we didn"t get live audio. Something to think about for next time.
10:15:40 andrewc: ui localisation static or per user preference?
10:16:22 Simon W: The notes are good though!
10:16:28 henryaddo: yeah its user preference
10:17:40 andrewc: one more variable to track independent of langauges of dictionary items
10:18:44 henryaddo: so if someone language is say Akan, he/she can choose which lang the ui should be in
10:19:32 andrewc: possibly ui interface may be in language other than dictionary item
10:21:46 henryaddo: exactly yeah
10:22:39 henryaddo: basically the ui lang is what lang you"re using the site in and the lookup language could be anything other than the language you"re viewing site in
10:22:49 andrewc: interesting issues for bidi support in ui design
10:23:20 andrewc: will it support ui mirroring?
10:24:27 henryaddo: yep
10:25:35 henryaddo: the index page could be something like this splashpage
10:26:22 mark: re henry"s mentioning of the type of data captured: what about tone indeed? Indispensable for linguists and foreigners (pronunciation), mostly dispensable for native speakers (cf. the many non-tone marking orthographies in West Africa). So what do you do?
10:27:06 mark: It might be best to have two field: orthography and pronunciation. In fact, there may be more than one field needed for orthography.
10:28:41 andrewc: pronunciation i"d assume would be ipa
10:28:47 mark: For database implementation of such issues, it might be instructive to look at the way SIL FieldWorks Language Explorer 5.2 tackles these issues.
10:29:04 arthurbuliva: Recording 4 is 1.9 MB in size and is available for download at here
10:29:22 mark: See Supports multiple writing systems, very niftily done.
10:29:36 arthurbuliva: Recording 5 is 1.1 MB in size and is available for download at here
10:29:37 andrewc: one approach would be adding tone and using filters to toggle between displays
10:30:21 Martin Benjamin: mark, the point regarding tone is this: as far as PALDO is concerned, it is all just data. We can have 2 fields, 5 fields, 20 fields to show the distinctive features of each lang
10:30:49 mark: Martin Benjamin; point taken, thanks for clarifying.
10:31:05 Martin Benjamin: that is a discussion that will be had as we set up each lang, with the appropriate linguists, etc
10:32:02 mark: I was wondering about the user generated content side of this (if you"re going that road, and it seems you are). Lay users (of which you"ll have more than you"ll have linguists, I suppose) will usually want to input without tone.

10:33:16 Martin Benjamin: what we need to do for PALDO is to have a standard jack into which we can plug each individual language. but the plug between the lang and PALDO doesnt affect what goes on in the lang, just like the plug in your kitchen doesn"t care whether it is providing power to a microwave or a food processor
10:35:43 mark: (what is the chat doing to our single quotes?)
10:37:14 arthurbuliva: @mark that is a bug that is to be fixed
10:37:23 arthurbuliva: sorry about that issue
10:37:25 Martin Benjamin: now we are talking about how users can select (a) the lang they want for interface, and (b) the lang/s they want info about
10:40:44 henryaddo: we"re talking about this page as an example of search box/ display lang selection on the left side frame
10:42:06 andrewc: actually analogy with plug is problematic, plug will need to do something, as minimum it will need to do normalisation, may also need to do case-folding
10:45:29 thinfox: it seems you need a different plug for each language that implements the unique features of that language, just like different geographic areas have their own plug and voltage output
11:08:40 SabineCretella: mp3 audio part 4
11:12:25 SabineCretella: mp3 audio part 5
12:40:08 thinfox: question about plural form of words: do we know enough about the selected languages to be algorithmically derived the plural form of any word?
14:03:15 arthurbuliva: sorry guys but we completely lost internet connection"
18:48:01 GMT 2008 arthurbuliva: the conversation after the communication blackout will be posted as soon as we"re back up

The web connection went out in the building, sorry. Let me try to catch up on what we've been talking about.

We've been talking about what sorts of elements need to appear in the site.

1) user preferences, in terms of what lang they want the interface in, font size, etc

2) a block for search settings, so you can choose for example what elements of an entry you want (I want to see example sentences but not derivation info, eg)

3) do photos appear with entries by default? now they do not, so that we do not overload slow/ expensive connections in Africa. however, we can make a judgement about a user's connection speed, and upload photos by default when they appear to have fast connections.

4) Tools for each dictionary results page: email this page, share (eg delicious), print, export to pdf, add to personal list (eg flashcards)

the print page will be a stripped down version, no extraneous html elements. this can also be the basis for build-on-the-fly pdf pages

user export limits in place, so that we don't swamp the server with someone wanting to print a customized pdf file with all our data

weekly automated builds of particular dictionarys as pdfs (eg, swahili / kinyarwanda) that can be downloaded. done on a different machine than the primary server

discussing mirror and shadow servers, and backups. We want to have at least 2 servers. One is master, one (or more) slave. Master sends changed entries to slaves every X minutes, or when a daemon says that a change exists

edit engine: users should be able to edit their own edits (after submission), until the time that the entry has been reviewed by the editor


Project Management
Logic implementation - coding
Designing/ mocking/ theming
Hardware optimization
User experience engineering
Content editor - how to manuals, screenshots, etc
User support (account manager)
Business development
Database specialist (MySQL)

20 working days for Coding Marathon
Top priority is MULTILINGUAL features

discussing how to input and edit. in first screen, user edits their main language. Then, they will go to a second screen and input the term in the Lang B. We then feed options from Lang B, like google's predictive search feature. If the word they are looking for exists in their LangB, they can click to select it. If it doesn't exist, they will need to create it and get it into an editor's loop for that lang.

* functional multilingual capabilities
* decide on languages (get committments from partners)

* Dataset cleanup
* Getting the Impetrata working
* Drupal cleanup

* Content translation module
* Drupal 6 upgrade
* All preparation work for Kinyarwanda

* First deliverable, July 1 = fully functional interface for editing Kinyarwanda dictionary
* July 15-31 (roughly), Kinyarwanda workshop/ training in Rwanda"


Kamusi GOLD

These are the languages for which we have datasets that we are actively working toward putting online. Languages that are Active for you to search are marked with "A" in the list below.


•A = Active language, aligned and searchable
•c = Data 🔢 elicited through the Comparative African Word List
•d = Data from independent sources that Kamusi participants align playing 🐥📊 DUCKS
•e = Data from the 🎮 games you can play on 😂🌎🤖 EmojiWorldBot
•P = Pending language, data in queue for alignment
•w = Data from 🔠🕸 WordNet teams

Software and Systems

We are actively creating new software for you to make use of and contribute to the 🎓 knowledge we are bringing together. Learn about software that is ready for you to download or in development, and the unique data systems we are putting in place for advanced language learning and technology:

Articles and Information

Kamusi has many elements. With these articles, you can read the details that interest you:

Videos and Slideshows

Some of what you need to know about Kamusi can best be understood visually. Our 📽 videos are not professional, but we hope you find them useful:


Our partners - past, present, and future - include:

Hack Kamusi

Here are some of the work elements on our task list that you can help do or fund:

Theory of Kamusi

Select a link below to learn about the principles that guide the project's unique approach to lexicography and public service.

Contact Us

We welcome your comments and questions, and will try to respond quickly. To get in touch, please visit our contact page. You must use a real email address if you want to get a real reply!

© Copyright ©

The Kamusi Project dictionaries and the Kamusi Project databases are intellectual property protected by international copyright law, ©2007 through ©2018, under the joint ownership of Kamusi Project International and Kamusi Project USA. Further explanation may be found on our © Copyright page.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.


Discussion items about language, technology, and society, from the Kamusi editor and others. This box is growing. To help develop or fund the project, please contact us!

Our biggest struggle is keeping Kamusi online and keeping it free. We cannot charge money for our services because that would block access to the very people we most want to benefit, the students and speakers of languages around the world that are almost always excluded from information technology. So, we ask, request, beseech, beg you, to please support our work by donating as generously as you can to help build and maintain this unique public resource.


Frequently Asked Questions

Answers to general questions you might have about Kamusi services.

We are building this page around real questions from members of the Kamusi community. Send us a question that you think will help other visitors to the site, and frequently we will place the answer here.

Try it : Ask a "FAQ"!

Press Coverage

Kamusi in the news: Reports by journalists and bloggers about our work in newspapers, television, radio, and online.

Sponsor Search:
Who Do You Know?

To keep Kamusi growing as a "free" knowledge resource for the world's languages, we need major contributions from philanthropists and organizations. Do you have any connections with a generous person, corporation, foundation, or family office that might wish to make a long term impact on educational outcomes and economic opportunity for speakers of excluded languages around the world? If you can help us reach out to a potential 💛😇 GOLD Angel, please contact us!