OLAC Record: 2007 NIST Language Recognition Evaluation Supplemental Training Set

OLAC Record
oai:www.ldc.upenn.edu:LDC2009S05

Metadata

Title: 2007 NIST Language Recognition Evaluation Supplemental Training Set

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Martin, Alvin, et al. 2007 NIST Language Recognition Evaluation Supplemental Training Set LDC2009S05. Web Download. Philadelphia: Linguistic Data Consortium, 2009

Contributor: Martin, Alvin

Le, Audrey

Graff, David

van Santen, Jan

Date (W3CDTF): 2009

Date Issued (W3CDTF): 2009-11-20

Description: *Introduction* 2007 NIST Language Recognition Evaluation Supplemental Training Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). It consists of 118 hours of conversational telephone speech segments in the following languages and dialects: Arabic (Egyptian colloquial), Bengali, Min Nan Chinese, Wu Chinese, Taiwan Mandarin, Cantonese, Russian, Mexican Spanish, Thai, Urdu, and Tamil. The goal of NIST's Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted three previous language recognition evaluations, in 1996, 2003 and 2005. The most significant differences between those evaluations and the 2007 task were the increased number of languages and dialects, the greater emphasis on a basic detection task for evaluation and the variety of evaluation conditions. Thus, in 2007, given a segment of speech and a language of interest to be detected (i.e., a target language), the task was to decide whether that target language was in fact spoken in the given telephone speech segment (yes or no), based on an automated analysis of the data contained in the segment. *Data* The supplemental training material in this release consists of the following: * Approximately 53 hours of conversational telephone speech segments in Arabic (Egyptian colloquial), Bengali, Cantonese, Min Nan Chinese, Wu Chinese, Russian, Thai, and Urdu. This material is taken from LDC's CALLHOME, CALLFRIEND, and Mixer collections. * Approximately 65 hours of full telephone conversations in Mandarin Chinese (Taiwan), Spanish (Mexican), and Tamil. This material was collected by Oregon Health and Science University (OHSU), Beaverton, Oregon. The test segments used in the 2005 NIST Language Recognition Evaluation (LDC2008S05) were derived from these full conversations. In addition to the supplemental material contained in this release, the training data for the 2007 NIST Language Recognition Evaluation (LDC2009S04) consisted of data from previous LRE evaluation test sets, namely, 2003 NIST Language Recognition Evaluation (LDC2006S31) and 2005 NIST Language Recognition Evaluation (LDC2008S05). LDC released other LREs as: * 2003 NIST Language Recognition Evaluation (LDC2006S31) * 2005 NIST Language Recognition Evaluation (LDC2008S05) * 2007 NIST Language Recognition Evaluation (LDC2009S04) * 2009 NIST Language Recognition Evaluation Test Set (LDC2014S06) * 2011 NIST Language Recognition Evaluation Test Set (LDC2018S06) *Samples* For an example of the data in this corpus, please listen to this Egyptian Arabic sample (WAV) from the data set. *Updates* None at this time.

Extent: Corpus size: 3323985 KB

Format: Sampling Rate: 8000

Sampling Format: 8 bit u-law

Identifier: LDC2009S05

https://catalog.ldc.upenn.edu/LDC2009S05

ISBN: 1-58563-530-8

ISLRN: 498-359-265-464-3

DOI: 10.35111/gqmf-6p19

Language: Yue Chinese

Wu Chinese

Urdu

Thai

Tamil

Spanish

Russian

Min Nan Chinese

Mandarin Chinese

Bengali

Egyptian Arabic

Language (ISO639): yue

wuu

urd

tha

tam

spa

rus

nan

cmn

ben

arz

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2009S05

Rights Holder: Portions © 2005 Oregon Health and Science University, © 1996, 2006, 2009 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2009S05

DateStamp: 2021-09-10

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Martin, Alvin; Le, Audrey; Graff, David; van Santen, Jan. 2009. Linguistic Data Consortium.
Terms: area_Africa area_Asia area_Europe country_BD country_CN country_EG country_ES country_IN country_PK country_RU country_TH dcmi_Sound iso639_arz iso639_ben iso639_cmn iso639_nan iso639_rus iso639_spa iso639_tam iso639_tha iso639_urd iso639_wuu iso639_yue olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009S05
Up-to-date as of: Wed Oct 29 7:01:10 EDT 2025

Metadata
Title:		2007 NIST Language Recognition Evaluation Supplemental Training Set
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Martin, Alvin, et al. 2007 NIST Language Recognition Evaluation Supplemental Training Set LDC2009S05. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:		Martin, Alvin
		Le, Audrey
		Graff, David
		van Santen, Jan
Date (W3CDTF):		2009
Date Issued (W3CDTF):		2009-11-20
Description:		Introduction 2007 NIST Language Recognition Evaluation Supplemental Training Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). It consists of 118 hours of conversational telephone speech segments in the following languages and dialects: Arabic (Egyptian colloquial), Bengali, Min Nan Chinese, Wu Chinese, Taiwan Mandarin, Cantonese, Russian, Mexican Spanish, Thai, Urdu, and Tamil. The goal of NIST's Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted three previous language recognition evaluations, in 1996, 2003 and 2005. The most significant differences between those evaluations and the 2007 task were the increased number of languages and dialects, the greater emphasis on a basic detection task for evaluation and the variety of evaluation conditions. Thus, in 2007, given a segment of speech and a language of interest to be detected (i.e., a target language), the task was to decide whether that target language was in fact spoken in the given telephone speech segment (yes or no), based on an automated analysis of the data contained in the segment. Data The supplemental training material in this release consists of the following: * Approximately 53 hours of conversational telephone speech segments in Arabic (Egyptian colloquial), Bengali, Cantonese, Min Nan Chinese, Wu Chinese, Russian, Thai, and Urdu. This material is taken from LDC's CALLHOME, CALLFRIEND, and Mixer collections. * Approximately 65 hours of full telephone conversations in Mandarin Chinese (Taiwan), Spanish (Mexican), and Tamil. This material was collected by Oregon Health and Science University (OHSU), Beaverton, Oregon. The test segments used in the 2005 NIST Language Recognition Evaluation (LDC2008S05) were derived from these full conversations. In addition to the supplemental material contained in this release, the training data for the 2007 NIST Language Recognition Evaluation (LDC2009S04) consisted of data from previous LRE evaluation test sets, namely, 2003 NIST Language Recognition Evaluation (LDC2006S31) and 2005 NIST Language Recognition Evaluation (LDC2008S05). LDC released other LREs as: * 2003 NIST Language Recognition Evaluation (LDC2006S31) * 2005 NIST Language Recognition Evaluation (LDC2008S05) * 2007 NIST Language Recognition Evaluation (LDC2009S04) * 2009 NIST Language Recognition Evaluation Test Set (LDC2014S06) * 2011 NIST Language Recognition Evaluation Test Set (LDC2018S06) Samples For an example of the data in this corpus, please listen to this Egyptian Arabic sample (WAV) from the data set. Updates None at this time.
Extent:		Corpus size: 3323985 KB
Format:		Sampling Rate: 8000
Format:		Sampling Format: 8 bit u-law
Identifier:		LDC2009S05
		https://catalog.ldc.upenn.edu/LDC2009S05
		ISBN: 1-58563-530-8
		ISLRN: 498-359-265-464-3
		DOI: 10.35111/gqmf-6p19
Language:		Yue Chinese
		Wu Chinese
		Urdu
		Thai
		Tamil
		Spanish
		Russian
		Min Nan Chinese
		Mandarin Chinese
		Bengali
		Egyptian Arabic
Language (ISO639):		yue
		wuu
		urd
		tha
		tam
		spa
		rus
		nan
		cmn
		ben
		arz
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2009S05
Rights Holder:		Portions © 2005 Oregon Health and Science University, © 1996, 2006, 2009 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2009S05
DateStamp:		2021-09-10
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Martin, Alvin; Le, Audrey; Graff, David; van Santen, Jan. 2009. Linguistic Data Consortium.
Terms:		area_Africa area_Asia area_Europe country_BD country_CN country_EG country_ES country_IN country_PK country_RU country_TH dcmi_Sound iso639_arz iso639_ben iso639_cmn iso639_nan iso639_rus iso639_spa iso639_tam iso639_tha iso639_urd iso639_wuu iso639_yue olac_primary_text