Date issued: | 2001-03-19 |
---|---|
Status of document: | Draft Recommendation. This is only a preliminary draft that is still under development; it has not yet been presented to the whole community for review. |
This version: | http://www.language-archives.org/REC/language-20010319.html |
Latest version: | http://www.language-archives.org/REC/language.html |
Previous version: | None. |
Abstract: |
This document specifies the controlled vocabulary of language identifiers used by OLAC. |
Editors: |
Steven Bird, University of Pennsylvania (mailto:sb@ldc.upenn.edu) |
Copyright © 2001 Gary Simons (SIL International) and Steven Bird (University of Pennsylvania). This material may be distributed and repurposed subject to the terms and conditions set forth in the Creative Commons Attribution-ShareAlike 2.5 License.
References
Language identification is an important dimension of language resource classification. However, the character-string representation of language names is problematic for several reasons:
Different languages (in different parts of the world) may have the same name.
The same language may have a different name in each country where it is spoken.
Within the same country, the preferred name for a language may change over time.
In the early history of discovering new languages (before names were standardized), different people referred to the same language by different names.
For languages having non-Roman orthographies, the language name may have several possible romanizations.
The sum of these facts taken together suggests that a standard based on names will not work. Rather, what is needed is a standard based on unique identifiers that do not change, combined with accessible documentation that clarifies the particular speech variety denoted by each identifier.
The information technology community has a standard for language identification, namely, ISO 639 [ISO-639]. Part 1 of this standard lists two-letter codes for identifying about 140 of the world's major languages; part 2 of the standard lists three-letter codes for identifying about 400 languages. ISO 639 in turn forms the core of another standard, RFC 3166 [RFC-3166] (formerly RFC 1766 [RFC-1766]), which is the standard used for language identification in the xml:lang attribute of XML and in the language element of the Dublin Core Metadata Initiative. RFC 3166 provides a mechanism for users to register new language identification codes for languages not covered by ISO 639, but very few additional languages have been registered.
Unfortunately, the existing standard falls far short of meeting the needs of the language resources community since it fails to account for more than 90% of the world's languages, and it fails to adequately document what languages the codes refer to [Simons-2000]. However, SIL's Ethnologue [Ethnologue] provides a complete system of language identifiers which is openly available on the Web. OLAC will employ the RFC 3166 extension mechanism which permits Ethnologue codes to be incorporated.
The SIL Ethnologue [Ethnologue] provides some 6,800 three-letter codes, along with detailed information about language names, genetic affiliations and geographical locus, amongst other things.
There are at least three ways to determine the Ethnologue code for a given language:
Use the form interface provided on the Ethnologue site [Ethnologue],
Use the LDC's temporary Ethnologue query form, giving simpler output [LDC-Ethnologue],
Download an ASCII table of language codes [Language-Codes], and load them into a relational database using the schema provided in [Simons-2000].
A three-letter Ethnologue code AAA will be represented asx-sil-AAA.
Other RFC 1766 language codes, such as "en" (English) and "en-us" (US English) may be used, however the Ethnologue codes are identified as OLAC Best Practice.
The SIL Ethnologue only covers living and recently extinct languages, and no language codes currently exist for ancient languages (e.g. Akkadian), for proto-languages (e.g. Proto-Bantu) or more recent precursors of current languages (e.g. Middle English). Until a coding system is devised, these languages should be identified by their conventional name(s).
An OLAC data provider should support a standardized method for representing OLAC metadata in unqualified Dublin Core. For language identifiers, the procedure is as follows:
Drop the language refinement of the subject element and prepend "Language: " to the content.
If there is an identifier but no content, look up the language name using the controlled vocabulary server to get a human-readable string, and make that the content.
Drop the identifier attribute and append its value, parenthesized, to the content.
Create an official controlled vocabulary server for the Ethnologue.