OLAC Record: 2015 NIST Language Recognition Evaluation Test Set

OLAC Record
oai:www.ldc.upenn.edu:LDC2025S02

Metadata

Title: 2015 NIST Language Recognition Evaluation Test Set

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Greenberg, Craig, et al. 2015 NIST Language Recognition Evaluation Test Set LDC2025S02. Web Download. Philadelphia: Linguistic Data Consortium, 2025

Contributor: Greenberg, Craig

Sadjadi, Omid

Graff, David

Walker, Kevin

Jones, Karen

Caruso, Christopher

Strassel, Stephanie

Wright, Jonathan

Date (W3CDTF): 2025

Date Issued (W3CDTF): 2025-03-17

Description: *Introduction* 2015 NIST Language Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). It contains the evaluation test set for the 2015 NIST Language Recognition Evaluation, approximately 867 hours of conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) collected by LDC in 20 languages, over 6 clusters of related languages: Arabic (Egyptian, Iraqi, Levantine, Maghrebi, Modern Standard Arabic); Spanish (Caribbean, European, Latin American, Brazilian Portuguese); English (British, Indian, General American English); Chinese (Cantonese, Mandarin, Min Nan, Wu); Slavic (Polish, Russian); and French (West African, Haitian Creole). The goal of NIST's Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005, 2007, 2009, 2011, 2015, 2017, and 2022. LRE15 expanded the range of test segment durations and added a test condition that allowed systems to make use of unrestricted training data when developing models. Further information about the 2015 evaluation can be found in the 2015 NIST Language Recognition Evaluation Plan *Data* The test segments in this release were drawn from the Multi-Language Speech Corpus (MLS14) (CTS and BNBS data) and designated Babel corpora (CTS data). For the MLS14 CTS collection, a small number of native speakers known as "claques" were recruited for each language to make single calls to multiple individuals in their social network. Calls lasted 8-15 minutes and speakers were free to discuss any topic. The BNBS data was collected by LDC from streaming and satellite radio programming, focusing on programs that included narrowband speech (e.g. call-ins to a talk show). Portions of the CTS callee call sides and portions of each broadcast recording were manually audited by native speakers to verify language and quality. Additional test segments for two languages, Cantonese and Haitian Creole, were drawn from the IARPA Babel series, specifically, CTS data collected in 2012-2013 from male and female speakers of a variety of ages using a range of phone types in diverse settings with varying noise conditions. Test segments were extracted by NIST from MLS14 CTS callee call sides, narrowband portions of the MLS14 BNBS data, and from designated Babel recordings. All test segments are presented in single channel, 16-bit 8 kHz linear PCM format with NIST SPHERE headers. *Samples* SPHERE audio file *Updates* None at this time.

Extent: Corpus size: 38160893 KB

Format: Sampling Rate: 8000

Sampling Format: linear pcm

Identifier: LDC2025S02

https://catalog.ldc.upenn.edu/LDC2025S02

ISLRN: 411-138-775-382-3

DOI: 4975-nz38

Language: Mesopotamian Arabic

North Levantine Arabic

Standard Arabic

Moroccan Arabic

Egyptian Arabic

English

Haitian

French

Portuguese

Spanish

Chinese

Wu Chinese

Yue Chinese

Min Dong Chinese

Polish

Russian

Language (ISO639): acm

apc

arb

ary

arz

eng

hat

fra

por

spa

zho

wuu

yue

cdo

pol

rus

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2025S02

Rights Holder: Portions © 2013 Agora Radio Group, © 2010 Al Arabiya Network, © 2010 Al Jazeera Media Network, © 2014 AM1430 Cantonese Radio Station, © 2013 BBC, © 2010 Beijing TV, © 2011 Bennett, Coleman & Company Limited, © 2013 BFBS, © 2010 Cable News Network. A Warner Bros. Discovery Company, © 2010 China Media Group, CCTV.com., © 2013 Foundation "BLAG", © 2013-2014 Global, © 2011 MSNBC Cable, L.L.C., © 2010 National Radio, © 2013 National State TV and Radio Company of the Republic of Belarus, © 2010 NTD, © 2010 Phoenix New Media Limited, © 2013 Radio Amistad 1090 AM, © 2013 Radio Station Pro., © 2013 radio.unal.edu.co, © 2013 Radio VIA, © 2013 RFI, © 2013 Spanish Radio and Television Corporation, © 2013 World Radio Network, Inc, © 2025 Trustees of the University of Pennsylvania

Type (DCMI): Sound

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2025S02

DateStamp: 2026-06-10

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Greenberg, Craig; Sadjadi, Omid; Graff, David; Walker, Kevin; Jones, Karen; Caruso, Christopher; Strassel, Stephanie; Wright, Jonathan. 2025. Linguistic Data Consortium.
Terms: area_Africa area_Americas area_Asia area_Europe country_CN country_EG country_ES country_FR country_GB country_HT country_IQ country_MA country_PL country_PT country_RU country_SA country_SY dcmi_Sound iso639_acm iso639_apc iso639_arb iso639_ary iso639_arz iso639_cdo iso639_eng iso639_fra iso639_hat iso639_pol iso639_por iso639_rus iso639_spa iso639_wuu iso639_yue iso639_zho

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2025S02
Up-to-date as of: Wed Jul 8 7:30:31 EDT 2026

Metadata
Title:		2015 NIST Language Recognition Evaluation Test Set
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Greenberg, Craig, et al. 2015 NIST Language Recognition Evaluation Test Set LDC2025S02. Web Download. Philadelphia: Linguistic Data Consortium, 2025
Contributor:		Greenberg, Craig
		Sadjadi, Omid
		Graff, David
		Walker, Kevin
		Jones, Karen
		Caruso, Christopher
		Strassel, Stephanie
		Wright, Jonathan
Date (W3CDTF):		2025
Date Issued (W3CDTF):		2025-03-17
Description:		Introduction 2015 NIST Language Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). It contains the evaluation test set for the 2015 NIST Language Recognition Evaluation, approximately 867 hours of conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) collected by LDC in 20 languages, over 6 clusters of related languages: Arabic (Egyptian, Iraqi, Levantine, Maghrebi, Modern Standard Arabic); Spanish (Caribbean, European, Latin American, Brazilian Portuguese); English (British, Indian, General American English); Chinese (Cantonese, Mandarin, Min Nan, Wu); Slavic (Polish, Russian); and French (West African, Haitian Creole). The goal of NIST's Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005, 2007, 2009, 2011, 2015, 2017, and 2022. LRE15 expanded the range of test segment durations and added a test condition that allowed systems to make use of unrestricted training data when developing models. Further information about the 2015 evaluation can be found in the 2015 NIST Language Recognition Evaluation Plan Data The test segments in this release were drawn from the Multi-Language Speech Corpus (MLS14) (CTS and BNBS data) and designated Babel corpora (CTS data). For the MLS14 CTS collection, a small number of native speakers known as "claques" were recruited for each language to make single calls to multiple individuals in their social network. Calls lasted 8-15 minutes and speakers were free to discuss any topic. The BNBS data was collected by LDC from streaming and satellite radio programming, focusing on programs that included narrowband speech (e.g. call-ins to a talk show). Portions of the CTS callee call sides and portions of each broadcast recording were manually audited by native speakers to verify language and quality. Additional test segments for two languages, Cantonese and Haitian Creole, were drawn from the IARPA Babel series, specifically, CTS data collected in 2012-2013 from male and female speakers of a variety of ages using a range of phone types in diverse settings with varying noise conditions. Test segments were extracted by NIST from MLS14 CTS callee call sides, narrowband portions of the MLS14 BNBS data, and from designated Babel recordings. All test segments are presented in single channel, 16-bit 8 kHz linear PCM format with NIST SPHERE headers. Samples SPHERE audio file Updates None at this time.
Extent:		Corpus size: 38160893 KB
Format:		Sampling Rate: 8000
Format:		Sampling Format: linear pcm
Identifier:		LDC2025S02
		https://catalog.ldc.upenn.edu/LDC2025S02
		ISLRN: 411-138-775-382-3
		DOI: 4975-nz38
Language:		Mesopotamian Arabic
		North Levantine Arabic
		Standard Arabic
		Moroccan Arabic
		Egyptian Arabic
		English
		Haitian
		French
		Portuguese
		Spanish
		Chinese
		Wu Chinese
		Yue Chinese
		Min Dong Chinese
		Polish
		Russian
Language (ISO639):		acm
		apc
		arb
		ary
		arz
		eng
		hat
		fra
		por
		spa
		zho
		wuu
		yue
		cdo
		pol
		rus
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2025S02
Rights Holder:		Portions © 2013 Agora Radio Group, © 2010 Al Arabiya Network, © 2010 Al Jazeera Media Network, © 2014 AM1430 Cantonese Radio Station, © 2013 BBC, © 2010 Beijing TV, © 2011 Bennett, Coleman & Company Limited, © 2013 BFBS, © 2010 Cable News Network. A Warner Bros. Discovery Company, © 2010 China Media Group, CCTV.com., © 2013 Foundation "BLAG", © 2013-2014 Global, © 2011 MSNBC Cable, L.L.C., © 2010 National Radio, © 2013 National State TV and Radio Company of the Republic of Belarus, © 2010 NTD, © 2010 Phoenix New Media Limited, © 2013 Radio Amistad 1090 AM, © 2013 Radio Station Pro., © 2013 radio.unal.edu.co, © 2013 Radio VIA, © 2013 RFI, © 2013 Spanish Radio and Television Corporation, © 2013 World Radio Network, Inc, © 2025 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2025S02
DateStamp:		2026-06-10
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Greenberg, Craig; Sadjadi, Omid; Graff, David; Walker, Kevin; Jones, Karen; Caruso, Christopher; Strassel, Stephanie; Wright, Jonathan. 2025. Linguistic Data Consortium.
Terms:		area_Africa area_Americas area_Asia area_Europe country_CN country_EG country_ES country_FR country_GB country_HT country_IQ country_MA country_PL country_PT country_RU country_SA country_SY dcmi_Sound iso639_acm iso639_apc iso639_arb iso639_ary iso639_arz iso639_cdo iso639_eng iso639_fra iso639_hat iso639_pol iso639_por iso639_rus iso639_spa iso639_wuu iso639_yue iso639_zho