A Survey of the State of the Art in
Digital Language Documentation and Description
Steven Bird and Gary Simons
Draft: 5 December 2000
About this document.
This document has been prepared in conjunction with the
workshop on Web-Based Language Documentation and Description,
held in Philadelphia on 12-15 December 2000.
It is a follow-up to the requirements document, helping to
assess the extent to which the requirements are met by the present
state of the art.
2004-03-30 NOTE: This document is no longer maintained, and contains many
broken hyperlinks
Whether one is collecting new language data,
searching a corpus for an instance of some linguistic phenomenon,
looking for dictionaries and texts from a particular language family,
converting data to work with a favorite tool, cataloging language
resources, or any of a host of similar tasks,
one is immediately confronted with a series of questions:
- What data is available?
- What tools are available?
- How adequate are these resources?
- Who is creating and using these resources?
- Where can I go for advice?
A more extensive list of such questions (with answers) is available
at the LTG Helpdesk FAQ.
1. What data is available?
In recent months, we have conducted a survey of language archives
[http://www.ldc.upenn.edu/exploration/survey.html].
Respondents were asked to answer the following questions:
1. Name and Location |
1 |
Please provide the archive name, URL, host institution, country,
contact person and email address. |
2. Catalog |
2.1 |
If the archive has a catalog in a standardized format, what fields does it
contain? If not, what contextual information about the resources are collected?
What other information would you like to collect if you could? |
2.2 |
If the electronic catalog conforms to some standard, please tell us the
name of the standard. |
2.3 |
To what extent have the archived materials been cataloged
electronically? |
2.4 |
If there is an online public access catalog, please give its URL. |
3. Holdings |
3.1 |
What geographical regions and languages are covered? |
3.2 |
Please give impressionistic estimates of the archive holdings for each of
the data types: Texts; Wordlists, Vocabularies, Lexicons, Dictionaries;
Field Notes, Correspondence, Misc files; Descriptions (Grammars, Phonologies,
etc); Audio Recordings; Video Recordings. |
3.3 |
Please list any other data types which are not included above, or any other
comments on the archive holdings. |
3.4 |
What proportion of the holdings are unique to the archive and not available
elsewhere? |
4. Electronic Publication |
4.1 |
To what extent are the archive holdings published electronically, where
"published" means that there is a well-defined procedure such that anyone at
all can get a standard copy of the data, either on digital media or over the
internet? |
4.2 |
To what extent are the archive holdings accessible over the web? |
4.3 |
Is permission required before materials can be accessed? |
4.4 |
Is there any fee for materials? |
4.5 |
How are author and/or editor defined for the electronic publications? Is
there a bibliographical citation method? |
4.6 |
Do the electronic publications have ISBN numbers? |
4.7 |
What plans are there to expand the electronic publication of archive
holdings? |
5. General Issues |
5.1 |
Who is the legal owner of archived materials? The original collector or
his/her estate? The language community? The archive or its host institution?
Some combination of these |
5.2 |
Beyond legal ownership, are there any asserted or perceived moral rights
concerning archived materials? Do the holders of the archive see the original
speakers or their representatives as controlling publication? |
5.3 |
In cases where no electronic publication is planned, why is this so?
(e.g. funding, licensing, technical know-how, lack of interest). |
5.4 |
Is any of the data in a proprietary format (e.g. MS Word)? If so, are there
plans to transfer it to an open standard (e.g., XML)? |
6. |
Do you have any other comments about digital archives of language material,
or on this survey? |
|
Responses were received from some twenty archives, and the completed
survey forms are all available online
[http://www.ldc.upenn.edu/exploration/survey/].
The full set of archives which have digital catalogs and holdings, or
concrete plans for these, is listed below, with URLs and contact names.
- AILLA: Archive of Indigenous Languages of Latin America
[http://uts.cc.utexas.edu/~ailla/introeng.html]
Joel Sherzer, Anthony Woodbury, University of Texas, Austin
- ALMA: African Language Material Archive
[http://polyglot.lss.wisc.edu/afrst/wara.html]
Leigh Swigart, West African Research Association
- ANLC: Alaska Native Language Center Archives
[http://www.uaf.edu/anlc]
Gary Holton, University of Alaska
- APS: American Philosophical Society American Indian Manuscript Collections
[http://www.amphilsoc.org/library/guides/indians/]
Robert Cox, American Philosophical Society
- ASEDA: Aboriginal Studies Electronic Data Archive
[http://coombs.anu.edu.au/SpecialProj/ASEDA/ASEDA.html]
Patrick McConvell, Australian Institute of Aboriginal and Torres Strait
Islander Studies
- BAS: Bavarian Archive of Speech Signals
[http://www.phonetik.uni-muenchen.de/Bas/BasHomeeng.html]
Florian Schiel, University of Munich
- CDEL: Center for the Documentation of Endangered Languages
[http://php.indiana.edu/~aisri/lab/home.html]
Douglas Parks, Wally Hooper, Indiana University
- CHILDES: Child Language Data Exchange System
[http://childes.psy.cmu.edu]
Brian MacWhinney, Carnegie Mellon University
- Corpus Documentale Latinum Portugaliae
Antonio Emiliano, University of Lisbon
- CNNC: Charlotte Narrative and Conversation Collection
[http://www.uncc.edu/english/cnnc/]
Boyd Davis, Pat Ryckman, University of North Carolina, Charlotte
- Creolist Archives
[http://www.ling.su.se/Creole/Text_Collection.shtml]
Mikael Parkvall, University of Stockholm
- CDLI: Cuneiform Digital Library Initiative
[http://cdli.ucla.edu/]
Robert Englund, UCLA
- ELRA: European Language Resources Association
[http://www.icp.inpg.fr/ELRA/catalog.html]
Khalid Choukri, Paris
- LACITO Linguistic Data Archive
[http://195.83.92.32/index.html.en]
Boyd Michailovsky, CNRS, Paris
- Linguistic Data Consortium
[http://www.ldc.upenn.edu/Catalog/]
Mark Liberman, University of Pennsylvania
- LPCA: Language and Popular Culture in Africa Text Archives
[http://www.pscw.uva.nl/lpca/textarchives/toc.html]
Vincent De Rooij, University of Amsterdam
- Max Planck Institute Language Archive and DOBES Archive
Peter Wittenburg, Max Planck Institute
- NAA: National Anthropological Archives
[http://www.nmnh.si.edu/naa/]
Robert Leopold, Smithsonian Institution
- OTA: Oxford Text Archive
[http://ota.ahds.ac.uk/ota/]
Michael Popham, Oxford University
- SIL Language and Culture Archive
Joan Spanne, Summer Institute of Linguistics
- SIL-MEX: SIL Mexico Archive
[http://www.sil.org/mexico/]
Albert Bickford, Summer Institute of Lingustics
- Survey of California and Other Indian Languages
[http://linguistics.berkeley.edu/Survey/]
Leanne Hinton, University of California, Berkeley
- UHLCS: University of Helsinki Language Corpus Server
[http://www.ling.helsinki.fi/uhlcs/]
Pirkko Suihkonen, University of Helsinki
Most of these archives have a partial digital catalog, and
about 25% have a complete digital catalog. A couple of them
use MARC or TEI. The following is a list of catalog fields
which are used or proposed by the above archives.
- language id (for the resource and for its subject, ethnologue code,
RFC 1766, ISO 639-2, alternative language names, language group)
- title of the resource, transliterated title
- resource type (e.g. lexicon, text, signal, ...)
- modality (e.g. text, audio, video, physiological, ...)
- file format, sample rate, number of tracks, size
- media type, dimensions (of book) or number (of CD-ROMs)
- recording details (e.g. microphone type)
- genre (e.g. narrative, instructional, greeting, ...)
- thematic topics
- register (e.g. formal, informal, honorific, collaborative, ...)
- event type (e.g. interview, meeting, ceremony, announcement, ...)
- participant description (e.g. name, age, gender, education, ...)
- interviewer, recorder, transcriber, ...
- speech style (e.g. whisper, mutter, talk, sing, falsetto, ...)
- transcription type (e.g. phonetic, orthographic, gesture, musical, ...)
- translation type (e.g. morpheme-level, word-level, sentence-level, ...)
- date, location (e.g. of creation, encoding, publication)
- access rights, use restrictions, copyrights, licenses, price
- editor, series name, series number, publisher
- catalog number (local, ISBN, ...)
- project for which the resource was created
- technological applications of the resource (e.g. machine translation, ...)
- URL for an online version of the resource, or for documentation
- provenance of the resource (e.g. geographical origin)
- historical period covered by the resource
- thesis level, degree granting institution
- software version, platform
- contact person/institution, address
Archives use some subset of these elements, in a variety of formats.
For certain elements an archive has evidently adopted a controlled
vocabularly. At present there are no widely used standards for
the storage format, or for the controlled vocabularies, such that
the catalog information from different archives is comparable.
About half of these archives have some materials in digital form,
and about 20% are completely digital. Digital materials are
stored in a variety of formats, including:
HTML, SGML, XML, PDF, TEI Lite, Filemaker, MS Access, MS Word,
and project-internal formats.
To find out what is available, it is necessary to consult the catalogs of each
archive independently, typically using different interfaces and vocabularies for
each one.
There are links pages, e.g.
Corpus Linguistics.
2. What tools are available?
Available tools are listed on several links pages, including the
following:
For LinguistList and the CMU AI Repository,
the categorization of the tools is
by application domain (e.g. text analysis, morphology, fonts, ...).
For the Linguistic Annotation and Linguistic Exploration pages,
there is a key for the platform. In the other cases there is
no categorization.
The ACL/DFKI Natural Language Software Registry
The Natural Language Software Registry is a key community
resource initiated by the ACL and organized by DFKI in
Saarbrücken.
Uses a taxonomy based on:
State of the art in Language Technology
http://registry.dfki.de/
Hans Uszkoreit, Thierry Declerck
Categories:
- annotation tools
- evaluation tools
- resources: grammars, lexicons, multimodal corpora,
spoken language corpora, terminology, written language corpora
- multimodality
- NLP development aid: tools, formalisms, machine learning methods,
architectures, theories
- spoken language: signal analysis, signal editing, signal process,
speaker recognition, speech analysis, speech editing, speech processing, speech
production, speech recognition, speech synthesis, spoken dialog systems, spoken
language generation, spoken language translation, spoken language
understanding, text-to-speech synthesis, voice analysis, voice processing
- written language: alignment tools, corpus analysis, deep generation,
deep syntactic analysis, document image analysis, grammar and style checkers,
handling controlled languages, information extraction, information retrieval,
language guesser, lemmatizer, lexicon management, morphological generation,
morphological analysis, optical character recognition, part-of-speech tagging,
partial parsing, processing mark-up languages, segmenter, semantic and
pragmatic analysis, shallow generation, shallow parsing, speech checkers,
stemmer, summarization, terminology extraction, terminology management, text
classification, tokenizaitno, translation memory, written dialog systems,
written language translation, written language understanding
Search form, permitting search on the following fields:
name, abstract, description, license (free, to negotiate, commercial),
kind of license (academic, multiple user, commercial),
main section, operating system, supported language
3. How adequate are these resources? (draft)
learn by trial and error
no systematic evaluation available
just tools - no support for interoperability, standard formats, etc
best practice recommendations exist (e.g. TEI, CES) - what
is the extent of their adoption?
4. Who is creating and using these resources?
The community is arranged into three main groups.
The first group is engaged in the core activity of generating and using
language resources. The second group provides the technical foundation for
this core activity, while the third group constitutes the adminstrative
umbrella.
1. CREATORS AND USERS OF LANGUAGE RESOURCES
- THE CORE ACTIVITY
Speakers
using and learning languages;
providing primary materials and commentary;
promoting language use and teaching.
|
Descriptivists
linguists, sociolinguists, and linguistic anthropologists
documenting language structure and use.
|
Educators
teaching specific languages,
and the linguistic structure of specific languages.
|
Theorists
developing new models of the human language faculty.
|
Technologists
developing new human language technologies.
|
|
2. IMMEDIATE INFRASTRUCTURE
- THE TECHNICAL FOUNDATION
Archivists
digital archivists and librarians
providing storage and access for language resources.
|
Developers
computer scientists developing models, formats, architectures and
tools for creating and searching digital language data.
|
Publishers
disseminating language resources in paper and digital form.
|
|
3. SPONSORS AND PROMOTERS
- THE UMBRELLA
Professional Associations
promoting language resources, and
the adoption of best-practices for digital archives.
|
Government Funding Agencies
establishing funding priorities,
and evaluating and enabling language resources.
|
Non-Governmental Organizations
promoting and funding language resources.
|
|
|
Table 1: The Language Resources Community
|
---|
Some archives catalog/distribute the resources of others.
5. Where can I go for advice?
Creators, users and archivers of language resources are often
faced with a bewildering array of technological options, with
no obvious source for competent advice. The most popular method
for obtaining advice is the large collection of electronic
mailing lists. On many of the following lists there is significant
exchange of information concerning best practices.
anthro-list | anthro-l@listserv.acsu.buffalo.edu |
archives-list | archives@listserv.muohio.edu |
corpora-list | corpora@hd.uib.no |
diglib-list | diglib@infoserv.nlc-bnc.ca |
elsnet-list | elsnet-list@let.ruu.nl |
electronic-records-list | erecs-l@listserv.albany.edu |
empiricists-list | empiricists@unagi.cis.upenn.edu |
endangered-languages-list | endangered-languages-l@carmen.murdoch.edu.au |
exploration-list | linguistic-exploration@listserv.linguistlist.org |
language-culture-list | language-culture@cs.uchicago.edu |
linganth-list | linganth@cc.rochester.edu |
linguist-list | linguist@listserv.linguistlist.org |
nl-kr-list | nl-kr@cs.rpi.edu |
salt-request | salt-request@cstr.ed.ac.uk |
saltmil | saltmil@egroups.com |
Another source of advice is the
LTG Helpdesk.
This site represents a vision for a
repository / clearing house for best practice recommendations.
People needing advice typically resort to posting a query on one or more
lists, sorting through the responses, and possibly posting a summary of
responses back to the lists. However, it is often difficult to decide a
good course of action, when the primary information is an uncoordinated set
of suggestions originating from strangers on a mailing list. In an period
of rapidly evolving technology, a wrong choice can wind up in a dead end,
and painstakingly collected data ends up being unusable. Numerous
experiences of this community attest to this reality.
So how can we make wise use of the new technological opportunities
before us?