OLAC Language Vocabulary

Date issued:2001-03-19
Status of document:Draft Recommendation. This is only a preliminary draft that is still under development; it has not yet been presented to the whole community for review.
This version:http://www.language-archives.org/REC/language-20010319.html
Latest version:http://www.language-archives.org/REC/language.html
Previous version:None.
Abstract:

This document specifies the controlled vocabulary of language identifiers used by OLAC.

Editors: Gary Simons, SIL International (mailto:gary_simons@sil.org)
Steven Bird, University of Pennsylvania (mailto:sb@ldc.upenn.edu)
Copyright © 2001 Gary Simons (SIL International) and Steven Bird (University of Pennsylvania). This material may be distributed and repurposed subject to the terms and conditions set forth in the Creative Commons Attribution-ShareAlike 2.5 License.

Table of contents

  1. Introduction
  2. Ethnologue codes
  3. Other RFC 1766 codes
  4. Languages for which codes are not assigned
  5. Mapping to unqualified Dublin Core
References

1. Introduction

Language identification is an important dimension of language resource classification. However, the character-string representation of language names is problematic for several reasons:

The sum of these facts taken together suggests that a standard based on names will not work. Rather, what is needed is a standard based on unique identifiers that do not change, combined with accessible documentation that clarifies the particular speech variety denoted by each identifier.

The information technology community has a standard for language identification, namely, ISO 639 [ISO-639]. Part 1 of this standard lists two-letter codes for identifying about 140 of the world's major languages; part 2 of the standard lists three-letter codes for identifying about 400 languages. ISO 639 in turn forms the core of another standard, RFC 3166 [RFC-3166] (formerly RFC 1766 [RFC-1766]), which is the standard used for language identification in the xml:lang attribute of XML and in the language element of the Dublin Core Metadata Initiative. RFC 3166 provides a mechanism for users to register new language identification codes for languages not covered by ISO 639, but very few additional languages have been registered.

Unfortunately, the existing standard falls far short of meeting the needs of the language resources community since it fails to account for more than 90% of the world's languages, and it fails to adequately document what languages the codes refer to [Simons-2000]. However, SIL's Ethnologue [Ethnologue] provides a complete system of language identifiers which is openly available on the Web. OLAC will employ the RFC 3166 extension mechanism which permits Ethnologue codes to be incorporated.

2. Ethnologue codes

The SIL Ethnologue [Ethnologue] provides some 6,800 three-letter codes, along with detailed information about language names, genetic affiliations and geographical locus, amongst other things.

There are at least three ways to determine the Ethnologue code for a given language:

  1. Use the form interface provided on the Ethnologue site [Ethnologue],

  2. Use the LDC's temporary Ethnologue query form, giving simpler output [LDC-Ethnologue],

  3. Download an ASCII table of language codes [Language-Codes], and load them into a relational database using the schema provided in [Simons-2000].

A three-letter Ethnologue code AAA will be represented asx-sil-AAA.

3. Other RFC 1766 codes

Other RFC 1766 language codes, such as "en" (English) and "en-us" (US English) may be used, however the Ethnologue codes are identified as OLAC Best Practice.

4. Languages for which codes are not assigned

The SIL Ethnologue only covers living and recently extinct languages, and no language codes currently exist for ancient languages (e.g. Akkadian), for proto-languages (e.g. Proto-Bantu) or more recent precursors of current languages (e.g. Middle English). Until a coding system is devised, these languages should be identified by their conventional name(s).

5. Mapping to unqualified Dublin Core

An OLAC data provider should support a standardized method for representing OLAC metadata in unqualified Dublin Core. For language identifiers, the procedure is as follows:

  1. Drop the language refinement of the subject element and prepend "Language: " to the content.

  2. If there is an identifier but no content, look up the language name using the controlled vocabulary server to get a human-readable string, and make that the content.

  3. Drop the identifier attribute and append its value, parenthesized, to the content.


To do

Create an official controlled vocabulary server for the Ethnologue.


References

[Country-Codes]Table of Ethnologue Country Codes
<http://www.language-archives.org/data/countrycodes.tab>
[Ethnologue]Ethnologue: Languages of the World
<http://www.sil.org/ethnologue/>
[ISO-3166]Codes for the representation of names of countries and their subdivisions--Part 1: Country codes
<http://www.din.de/gremien/nas/nabd/iso3166ma/>
[ISO-639]Codes for the Representation of Names of Languages-Part 2: Alpha-3 Code
<http://lcweb.loc.gov/standards/iso639-2/langhome.html>
[LDC-Ethnologue]LDC's Temporary Ethnologue Controlled Vocabulary Server
<http://wave.ldc.upenn.edu/OLAC/ethnologue/form.php3>
[Language-Codes]Table of Ethnologue Language Codes
<http://www.language-archives.org/data/languagecodes.tab>
[OLAC-MS]OLAC Metadata Set
<http://www.language-archives.org/OLAC/olacms.html>
[RFC-1766]Tags for the Identification of Languages
<http://www.ietf.org/rfc/rfc1766.txt>
[RFC-3066]Tags for the Identification of Languages (replaces 1766)
<ftp://ftp.isi.edu/in-notes/rfc3066.txt>
[Simons-2000]Language identification in metadata descriptions of language archive holdings
<http://www.ldc.upenn.edu/exploration/expl2000/papers/simons/simons.htm>