Welcome to LexiRumah

LexiRumah (from Lexicon and Indonesian Rumah = house) is an online database containing lexical data for languages of Eastern Indonesian and Timor-Leste. The database was set up and is maintained by the NWO Vici project “Reconstructing the past through languages of the present: the Lesser Sunda Islands” at Leiden University.

Presentation of Data

Each word in LexiRumah belongs to a particular language and a particular concept. In the case of data drawn from survey word-lists, the concept to which a word matches was the prompt which elicited the word. (In nearly all cases this prompt was given in Indonesian/Malay). For data from published sources the concept to which a word matches is that judged most similar to the gloss or definition given in the source.

Each word is associated with three different transcriptions. Firstly, there is a standardised transcription (Form IPA) using the symbols of the International Phonetic Alphabet which follows the conventions of CLTS (Cross-Linguistic Transcription System). Depending on the nature of the source, and the level of knowledge of the language, this transcription is either phonemic or phonetic, bearing in mind the conventions of CLTS. In addition, in nearly all cases a phonemic sequence of two identical vowels is transcribed as a single long vowel (/VV/ → <Vː>) in order to ease cross-linguistic comparison.

Secondly, each word has an orthographic transcription (Orthography). For languages which have an established orthography which is used to some extent by speakers and/or linguists, this transcription follows the conventions of the established orthography. For languages which do not yet have any orthography this transcription mostly follows the conventions of Indonesian.

Finally, the representation of the word in the original source from which the word is derived is also given.

To take one simple example, the Ili'uun (eray1237) word which matches the concept ‘hide’ was given in the original source as <ladjōk>. This has been standardised to IPA /lad͡ʒoːk/ and is transcribed orthographically lajook in rough accordance with Indonesian orthography.

Data Provenance

The data represented in LexiRumah comes from published sources, as well as from survey work. All data is attributed and linked to the source from which it comes. In addition to wordlists provided by individual linguists (as attributed), survey work executed in the following projects is also present in LexiRumah.
  • NWO-VICI project “Reconstructing the past through languages of the present: the Lesser Sunda Islands”, Leiden University 2014–2019
  • The European Science Foundation EuroBABEL Project “Alor-Pantar languages: Origins and theoretical impact”, Leiden University, University of Fairbanks, Alaska, US and University of Surrey, United Kingdom, 2009–2013
  • NWO-VIDI project “Linguistic Variation in Eastern Indonesia”, Leiden University 2002–2007

How to cite LexiRumah

LexiRumah is a separate publication by Gereon A. Kaiping, Owen Edwards, and Marian Klamer. We recommend you cite it as

Kaiping, Gereon A., Owen Edwards, and Marian Klamer (eds.). 2019. LexiRumah 3.0.0. Leiden: Leiden University Centre for Linguistics. Available online at https://lexirumah.model-ling.eu/. Accessed on 2025-01-22.

It is important to cite the specific reference (printed source, or data collector) that the data you cite comes from. Every item in the database is linked to its reference and should be cited accordingly. Thus, for instance if you wish to cite the Hewa (sika1262-hewa) word taʔa ‘betel vine’. You would do so as following: “The Hewa word for ‘betel vine’ is taʔa (Fricke 2014)”. With the following entry in your list of references:

Fricke, Hanna. 2014. Topics in the grammar of Hewa: A variety of Sika in Eastern Flores, Indonesia. München: Lincom Europa

If you do not have access to the original source, as is often the case for survey word-list data, it is still insufficient to cite only LexiRumah. Instead, cite the original source and LexiRumah in the following way: “The Iha word for ‘thorn’ [ˈᵑg͡bɛm] contains a voiced pre-nasalised co-articulated bilabial velar plosive (Donohue 2010 in Kaiping, Edwards and Klamer 2019)”

Then in your reference list you will have one reference for LexiRumah and one for Donohue (2010). You may wish (or your publisher may require) you to treat this original source as being “in” LexiRumah, in which case you can treat it like a chapter in a book and cite it as follows:

Donohue, Mark. 2010. Bomberai survey word lists. in Kaiping, Gereon A., Owen Edwards, and Marian Klamer (eds.). 2019. LexiRumah 3.0.0. Leiden: Leiden University Centre for Linguistics. Available online at https://lexirumah.model-ling.eu/. Accessed on 2025-01-22.

Cognate sets

The Cognate sets section contains sets of words which are formally and semantically similar, and thus likely cognate (inherited from a single ancestral etymon). These cognate sets are automatically detected with LexStat which means they are a rough approximation, but not exactly the result of the application of the traditional comparative method through identification of regular correspondences. The alignment column within a single cognate set shows which segments of each word correspond to one another, on the basis of the same algorithm. The source for all such cognate sets is given as Automatic cognate coding with LexStat, 2019.

Each cognate set is named after a randomly selected term from all forms in the cognate class. This term is not necessarily a reconstructed proto-form or a particularly frequent or representative form.

Content

Lexical items 117,986
Concepts 607
Sources 143
Cognate sets 4,028
Languages 357 from 11 families
279 Austronesian
51 Timor-Alor-Pantar
12 South Bird's Head
4 West Bomberai
3 East Bird's Head
2 Hatam-Mansim
2 Konda-Yahadian
1 Maybrat
1 Mor
1 Mpur
1 Tambora