OLAC Record
oai:www.ldc.upenn.edu:LDC2025S02

Metadata
Title:2015 NIST Language Recognition Evaluation Test Set
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Greenberg, Craig, et al. 2015 NIST Language Recognition Evaluation Test Set LDC2025S02. Web Download. Philadelphia: Linguistic Data Consortium, 2025
Contributor:Greenberg, Craig
Sadjadi, Omid
Graff, David
Walker, Kevin
Jones, Karen
Caruso, Christopher
Strassel, Stephanie
Wright, Jonathan
Date (W3CDTF):2025
Date Issued (W3CDTF):2025-03-17
Description:*Introduction* 2015 NIST Language Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). It contains the evaluation test set for the 2015 NIST Language Recognition Evaluation, approximately 867 hours of conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) collected by LDC in 20 languages, over 6 clusters of related languages: Arabic (Egyptian, Iraqi, Levantine, Maghrebi, Modern Standard Arabic); Spanish (Caribbean, European, Latin American, Brazilian Portuguese); English (British, Indian, General American English); Chinese (Cantonese, Mandarin, Min Nan, Wu); Slavic (Polish, Russian); and French (West African, Haitian Creole). The goal of NIST's Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005, 2007, 2009, 2011, 2015, 2017, and 2022. LRE15 expanded the range of test segment durations and added a test condition that allowed systems to make use of unrestricted training data when developing models. Further information about the 2015 evaluation can be found in the 2015 NIST Languagage Recognition Evaluation Plan *Data* The test segments in this release were drawn from the Multi-Language Speech Corpus (MLS14) (CTS and BNBS data) and designated Babel corpora (CTS data). For the MLS14 CTS collection, a small number of native speakers known as "claques" were recruited for each language to make single calls to multiple individuals in their social network. Calls lasted 8-15 minutes and speakers were free to discuss any topic. The BNBS data was collected by LDC from streaming and satellite radio programming, focusing on programs that included narrowband speech (e.g. call-ins to a talk show). Portions of the CTS callee call sides and portions of each broadcast recording were manually audited by native speakers to verify language and quality. Additional test segments for two languages, Cantonese and Haitian Creole, were drawn from the IARPA Babel series, specifically, CTS data collected in 2012-2013 from male and female speakers of a variety of ages using a range of phone types in diverse settings with varying noise conditions. Test segments were extracted by NIST from MLS14 CTS callee call sides, narrowband portions of the MLS14 BNBS data, and from designated Babel recordings. All test segments are presented in single channel, 16-bit 8 kHz linear PCM format with NIST SPHERE headers. *Samples* SPHERE audio file *Updates* None at this time.
Extent:Corpus size: 38160893 KB
Format:Sampling Rate: 8000
Sampling Format: linear pcm
Identifier:LDC2025S02
https://catalog.ldc.upenn.edu/LDC2025S02
ISLRN: 411-138-775-382-3
DOI: 4975-nz38
Language:Mesopotamian Arabic
North Levantine Arabic
Standard Arabic
Moroccan Arabic
Egyptian Arabic
English
Haitian
French
Portuguese
Spanish
Chinese
Wu Chinese
Yue Chinese
Min Dong Chinese
Polish
Russian
Language (ISO639):acm
apc
arb
ary
arz
eng
hat
fra
por
spa
zho
wuu
yue
cdo
pol
rus
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2025S02
Rights Holder:Portions © 2013 Agora Radio Group, © 2010 Al Arabiya Network, © 2010 Al Jazeera Media Network, © 2014 AM1430 Cantonese Radio Station, © 2013 BBC, © 2010 Beijing TV, © 2011 Bennett, Coleman & Company Limited, © 2013 BFBS, © 2010 Cable News Network. A Warner Bros. Discovery Company, © 2010 China Media Group, CCTV.com., © 2013 Foundation "BLAG", © 2013-2014 Global, © 2011 MSNBC Cable, L.L.C., © 2010 National Radio, © 2013 National State TV and Radio Company of the Republic of Belarus, © 2010 NTD, © 2010 Phoenix New Media Limited, © 2013 Radio Amistad 1090 AM, © 2013 Radio Station Pro., © 2013 radio.unal.edu.co, © 2013 Radio VIA, © 2013 RFI, © 2013 Spanish Radio and Television Corporation, © 2013 World Radio Network, Inc, © 2025 Trustees of the University of Pennsylvania

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2025S02
DateStamp:  2025-03-17
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Greenberg, Craig; Sadjadi, Omid; Graff, David; Walker, Kevin; Jones, Karen; Caruso, Christopher; Strassel, Stephanie; Wright, Jonathan. 2025. Linguistic Data Consortium.
Terms: area_Africa area_Americas area_Asia area_Europe country_CN country_EG country_ES country_FR country_GB country_HT country_IQ country_MA country_PL country_PT country_RU country_SA country_SY iso639_acm iso639_apc iso639_arb iso639_ary iso639_arz iso639_cdo iso639_eng iso639_fra iso639_hat iso639_pol iso639_por iso639_rus iso639_spa iso639_wuu iso639_yue iso639_zho


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2025S02
Up-to-date as of: Tue Mar 18 1:01:11 EDT 2025