OLAC Record: 2009 NIST Language Recognition Evaluation Test Set

OLAC Record
oai:www.ldc.upenn.edu:LDC2014S06

Metadata

Title: 2009 NIST Language Recognition Evaluation Test Set

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Martin, Alvin, et al. 2009 NIST Language Recognition Evaluation Test Set LDC2014S06. Web Download. Philadelphia: Linguistic Data Consortium, 2014

Contributor: Martin, Alvin

Greenberg, Craig

Graff, David

Walker, Kevin

Brandschain, Linda

Date (W3CDTF): 2014

Date Issued (W3CDTF): 2014-07-15

Description: *Introduction* 2009 NIST Language Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). It contains approximately 215 hours of conversational telephone speech and radio broadcast conversation collected by LDC in the following 23 languages and dialects: Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, English (American), English (Indian), Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu, and Vietnamese. The goal of NIST's Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005, and 2007. The 2009 evaluation increased the number of target languages. Most of the test data originated from multilingual Voice of America (VOA) radio broadcasts assessed as being of telephone bandwidth in addition to conversational telephone speech. Further information regarding this evaluation can be found in the evaluation plan which is included in the documentation for this release. LDC released other LREs as: * 2003 NIST Language Recognition Evaluation (LDC2006S31) * 2005 NIST Language Recognition Evaluation (LDC2008S05) * 2007 NIST Language Recognition Evaluation Test Set (LDC2009S04) * 2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05) * 2011 NIST Language Recognition Evaluation Test Set (LDC2018S06) *Data* The VOA speech data was collected by LDC in 2000 and 2001 and constitutes approximately 75% of the test set. The telephone speech was taken from LDC's Mixer 3 collection recorded between 2005 and 2007. All test speech segments are presented as a sampled data stream in standard 8-bit 8-kHz μ-law format. Each segment is stored separately in a single channel SPHERE format file. The test segments contain three nominal durations of speech: three seconds, 10 seconds and 30 seconds. Actual speech durations vary, but were constrained to be within the ranges of 2-4 seconds, 7-13 seconds and 23-35 seconds, respectively. Non-speech portions of each segment were included in each segment so that a segment contained a continuous sample of the source recording. Therefore, the test segments may be significantly longer than the speech duration, depending on how much non-speech was included. *Samples* For an example of the data in this corpus, please listen to this sample (WAV). *Updates* None at this time.

Extent: Corpus size: 6390040 KB

Format: Sampling Rate: 8000

Sampling Format: ulaw

Identifier: LDC2014S06

https://catalog.ldc.upenn.edu/LDC2014S06

ISBN: 1-58563-682-7

ISLRN: 180-783-854-340-4

DOI: 10.35111/qv7y-5026

Language: Amharic

Haitian

English

French

Hindi

Spanish

Urdu

Bosnian

Croatian

Georgian

Korean

Portuguese

Turkish

Vietnamese

Yue Chinese

Dari

Persian

Hausa

Mandarin Chinese

Russian

Ukrainian

Pushto

Language (ISO639): amh

hat

eng

fra

hin

spa

urd

bos

hrv

kat

kor

por

tur

vie

yue

prs

fas

hau

cmn

rus

ukr

pus

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2014S06

Rights Holder: Portions © 2000, 2001, 2005-2007, 2014 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2014S06

DateStamp: 2021-09-10

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Martin, Alvin; Greenberg, Craig; Graff, David; Walker, Kevin; Brandschain, Linda. 2014. Linguistic Data Consortium.
Terms: area_Africa area_Americas area_Asia area_Europe country_AF country_BA country_CN country_ES country_ET country_FR country_GB country_GE country_HR country_HT country_IN country_KR country_NG country_PK country_PT country_RU country_TR country_UA country_VN dcmi_Sound iso639_amh iso639_bos iso639_cmn iso639_eng iso639_fas iso639_fra iso639_hat iso639_hau iso639_hin iso639_hrv iso639_kat iso639_kor iso639_por iso639_prs iso639_pus iso639_rus iso639_spa iso639_tur iso639_ukr iso639_urd iso639_vie iso639_yue olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2014S06
Up-to-date as of: Wed Oct 29 7:01:27 EDT 2025

Metadata
Title:		2009 NIST Language Recognition Evaluation Test Set
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Martin, Alvin, et al. 2009 NIST Language Recognition Evaluation Test Set LDC2014S06. Web Download. Philadelphia: Linguistic Data Consortium, 2014
Contributor:		Martin, Alvin
		Greenberg, Craig
		Graff, David
		Walker, Kevin
		Brandschain, Linda
Date (W3CDTF):		2014
Date Issued (W3CDTF):		2014-07-15
Description:		Introduction 2009 NIST Language Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). It contains approximately 215 hours of conversational telephone speech and radio broadcast conversation collected by LDC in the following 23 languages and dialects: Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, English (American), English (Indian), Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu, and Vietnamese. The goal of NIST's Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005, and 2007. The 2009 evaluation increased the number of target languages. Most of the test data originated from multilingual Voice of America (VOA) radio broadcasts assessed as being of telephone bandwidth in addition to conversational telephone speech. Further information regarding this evaluation can be found in the evaluation plan which is included in the documentation for this release. LDC released other LREs as: * 2003 NIST Language Recognition Evaluation (LDC2006S31) * 2005 NIST Language Recognition Evaluation (LDC2008S05) * 2007 NIST Language Recognition Evaluation Test Set (LDC2009S04) * 2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05) * 2011 NIST Language Recognition Evaluation Test Set (LDC2018S06) Data The VOA speech data was collected by LDC in 2000 and 2001 and constitutes approximately 75% of the test set. The telephone speech was taken from LDC's Mixer 3 collection recorded between 2005 and 2007. All test speech segments are presented as a sampled data stream in standard 8-bit 8-kHz μ-law format. Each segment is stored separately in a single channel SPHERE format file. The test segments contain three nominal durations of speech: three seconds, 10 seconds and 30 seconds. Actual speech durations vary, but were constrained to be within the ranges of 2-4 seconds, 7-13 seconds and 23-35 seconds, respectively. Non-speech portions of each segment were included in each segment so that a segment contained a continuous sample of the source recording. Therefore, the test segments may be significantly longer than the speech duration, depending on how much non-speech was included. Samples For an example of the data in this corpus, please listen to this sample (WAV). Updates None at this time.
Extent:		Corpus size: 6390040 KB
Format:		Sampling Rate: 8000
Format:		Sampling Format: ulaw
Identifier:		LDC2014S06
		https://catalog.ldc.upenn.edu/LDC2014S06
		ISBN: 1-58563-682-7
		ISLRN: 180-783-854-340-4
		DOI: 10.35111/qv7y-5026
Language:		Amharic
		Haitian
		English
		French
		Hindi
		Spanish
		Urdu
		Bosnian
		Croatian
		Georgian
		Korean
		Portuguese
		Turkish
		Vietnamese
		Yue Chinese
		Dari
		Persian
		Hausa
		Mandarin Chinese
		Russian
		Ukrainian
		Pushto
Language (ISO639):		amh
		hat
		eng
		fra
		hin
		spa
		urd
		bos
		hrv
		kat
		kor
		por
		tur
		vie
		yue
		prs
		fas
		hau
		cmn
		rus
		ukr
		pus
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2014S06
Rights Holder:		Portions © 2000, 2001, 2005-2007, 2014 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2014S06
DateStamp:		2021-09-10
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Martin, Alvin; Greenberg, Craig; Graff, David; Walker, Kevin; Brandschain, Linda. 2014. Linguistic Data Consortium.
Terms:		area_Africa area_Americas area_Asia area_Europe country_AF country_BA country_CN country_ES country_ET country_FR country_GB country_GE country_HR country_HT country_IN country_KR country_NG country_PK country_PT country_RU country_TR country_UA country_VN dcmi_Sound iso639_amh iso639_bos iso639_cmn iso639_eng iso639_fas iso639_fra iso639_hat iso639_hau iso639_hin iso639_hrv iso639_kat iso639_kor iso639_por iso639_prs iso639_pus iso639_rus iso639_spa iso639_tur iso639_ukr iso639_urd iso639_vie iso639_yue olac_primary_text