OLAC Record: 2017 NIST Language Recognition Evaluation Training and Development Sets

OLAC Record
oai:www.ldc.upenn.edu:LDC2022S10

Metadata

Title: 2017 NIST Language Recognition Evaluation Training and Development Sets

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Greenberg, Craig, et al. 2017 NIST Language Recognition Evaluation Training and Development Sets LDC2022S10. Web Download. Philadelphia: Linguistic Data Consortium, 2022

Contributor: Greenberg, Craig

Sadjadi, Omid

Reynolds, Douglas

Singer, Elliot

Graff, David

Date (W3CDTF): 2022

Date Issued (W3CDTF): 2022-10-17

Description: *Introduction* 2017 NIST Language Recognition Evaluation Training and Development Sets contains training and development material for the 2017 NIST Language Recognition Evaluation. It consists of approximately 2,100 hours of conversational telephone speech, broadcast conversation, broadcast narrow band speech, and speech from video in the following 14 languages, dialects, and varieties: Arabic (Iraqi, Levantine, Maghrebi, Egyptian), English (British, American), Polish, Russian, Portuguese (Brazilian), Spanish (Caribbean, European, Latin American Continental), and Chinese (Mandarin, Min Nan). The goal of the NIST (National Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005, 2007, 2009, 2011, and 2015. The 2017 evaluation focused on differentiating closely related language pairs. In addition to conversational telephone speech, broadcast conversation, and broadcast narrow band speech, speech excerpts extracted from video data were used. Further information regarding this evaluation can be found in the evaluation plan which is also included in the documentation for this release. LDC released the prior LREs as: * 2003 NIST Language Recognition Evaluation (LDC2006S31) * 2005 NIST Language Recognition Evaluation (LDC2008S05) * 2007 NIST Language Recognition Evaluation Test Set (LDC2009S04) * 2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05) * 2009 NIST Language Recognition Evaluation Test Set (LDC2014S06) * 2011 NIST Language Recognition Evaluation Test Set (LDC2018S06) *Data* This release includes data from LDC's CALLFRIEND and Fisher telephone collections, the VAST video collection, various broadcast sources and earlier NIST LRE test sets. The training audio files are single-channel, 8-KHz sample rate in NIST SPHERE format, either mu-law, A-law or 16-bit PCM. The development audio files are also single-channel, but vary in format: either SPHERE or FLAC-compressed MSWAV (RIFF). All "*.flac" files are 16-bit PCM, 44.1 KHz sample rate; the "*.sph" files are all 8-KHz, with either mu-law or 16-bit PCM samples. *Samples* Please view the following audio sample. *Updates* None at this time.

Extent: Corpus size: 65699443 KB

Format: Sampling Rate: 8000, 44100

Sampling Format: PCM, u-law, a-law

Identifier: LDC2022S10

https://catalog.ldc.upenn.edu/LDC2022S10

ISBN: 1-58563-999-0

ISLRN: 854-427-979-036-7

DOI: 10.35111/awny-7397

Language: Arabic

English

Polish

Russian

Portuguese

Spanish

Mandarin Chinese

Min Nan Chinese

Language (ISO639): ara

eng

pol

rus

por

spa

cmn

nan

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2022S10

Rights Holder: Portions © 2013-2014 Agora Radio Group, © 2013 BBC, © 2013 Bethel Church of Redding, © 2013 BFBS, © 2013 Blago Foundation, © 2013 Brazil Communication Company, © 2010-2011 Cable News Network, LP, LLLP, © 2013 El Pando Zambrano.com, © 2013-2014 Global, © 2010-2011 New Tang Dynasty TV, © 2010-2011 Phoenix New Media Limited, © 2013 Radio Amistad, C.por A., © 2013 Radio UNAL, © 2013 Spanish Radio and Television Corporation, © 2013 The New Television of the South CA (TVSUR), © 2013 University of Puerto Rico Radio Network, © 2010 WorldNetCast/TVNET, © 2011-2018 You Tube, LLC, © 1996-1999, 2001-2011, 2013-2014, 2018, 2022 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2022S10

DateStamp: 2026-06-10

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Greenberg, Craig; Sadjadi, Omid; Reynolds, Douglas; Singer, Elliot; Graff, David. 2022. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_ES country_GB country_PL country_PT country_RU dcmi_Sound dcmi_Text iso639_ara iso639_cmn iso639_eng iso639_nan iso639_pol iso639_por iso639_rus iso639_spa olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2022S10
Up-to-date as of: Wed Jul 8 7:30:30 EDT 2026

Metadata
Title:		2017 NIST Language Recognition Evaluation Training and Development Sets
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Greenberg, Craig, et al. 2017 NIST Language Recognition Evaluation Training and Development Sets LDC2022S10. Web Download. Philadelphia: Linguistic Data Consortium, 2022
Contributor:		Greenberg, Craig
		Sadjadi, Omid
		Reynolds, Douglas
		Singer, Elliot
		Graff, David
Date (W3CDTF):		2022
Date Issued (W3CDTF):		2022-10-17
Description:		Introduction 2017 NIST Language Recognition Evaluation Training and Development Sets contains training and development material for the 2017 NIST Language Recognition Evaluation. It consists of approximately 2,100 hours of conversational telephone speech, broadcast conversation, broadcast narrow band speech, and speech from video in the following 14 languages, dialects, and varieties: Arabic (Iraqi, Levantine, Maghrebi, Egyptian), English (British, American), Polish, Russian, Portuguese (Brazilian), Spanish (Caribbean, European, Latin American Continental), and Chinese (Mandarin, Min Nan). The goal of the NIST (National Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005, 2007, 2009, 2011, and 2015. The 2017 evaluation focused on differentiating closely related language pairs. In addition to conversational telephone speech, broadcast conversation, and broadcast narrow band speech, speech excerpts extracted from video data were used. Further information regarding this evaluation can be found in the evaluation plan which is also included in the documentation for this release. LDC released the prior LREs as: * 2003 NIST Language Recognition Evaluation (LDC2006S31) * 2005 NIST Language Recognition Evaluation (LDC2008S05) * 2007 NIST Language Recognition Evaluation Test Set (LDC2009S04) * 2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05) * 2009 NIST Language Recognition Evaluation Test Set (LDC2014S06) * 2011 NIST Language Recognition Evaluation Test Set (LDC2018S06) Data This release includes data from LDC's CALLFRIEND and Fisher telephone collections, the VAST video collection, various broadcast sources and earlier NIST LRE test sets. The training audio files are single-channel, 8-KHz sample rate in NIST SPHERE format, either mu-law, A-law or 16-bit PCM. The development audio files are also single-channel, but vary in format: either SPHERE or FLAC-compressed MSWAV (RIFF). All ".flac" files are 16-bit PCM, 44.1 KHz sample rate; the ".sph" files are all 8-KHz, with either mu-law or 16-bit PCM samples. Samples Please view the following audio sample. Updates None at this time.
Extent:		Corpus size: 65699443 KB
Format:		Sampling Rate: 8000, 44100
Format:		Sampling Format: PCM, u-law, a-law
Identifier:		LDC2022S10
		https://catalog.ldc.upenn.edu/LDC2022S10
		ISBN: 1-58563-999-0
		ISLRN: 854-427-979-036-7
		DOI: 10.35111/awny-7397
Language:		Arabic
		English
		Polish
		Russian
		Portuguese
		Spanish
		Mandarin Chinese
		Min Nan Chinese
Language (ISO639):		ara
		eng
		pol
		rus
		por
		spa
		cmn
		nan
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2022S10
Rights Holder:		Portions © 2013-2014 Agora Radio Group, © 2013 BBC, © 2013 Bethel Church of Redding, © 2013 BFBS, © 2013 Blago Foundation, © 2013 Brazil Communication Company, © 2010-2011 Cable News Network, LP, LLLP, © 2013 El Pando Zambrano.com, © 2013-2014 Global, © 2010-2011 New Tang Dynasty TV, © 2010-2011 Phoenix New Media Limited, © 2013 Radio Amistad, C.por A., © 2013 Radio UNAL, © 2013 Spanish Radio and Television Corporation, © 2013 The New Television of the South CA (TVSUR), © 2013 University of Puerto Rico Radio Network, © 2010 WorldNetCast/TVNET, © 2011-2018 You Tube, LLC, © 1996-1999, 2001-2011, 2013-2014, 2018, 2022 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2022S10
DateStamp:		2026-06-10
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Greenberg, Craig; Sadjadi, Omid; Reynolds, Douglas; Singer, Elliot; Graff, David. 2022. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_ES country_GB country_PL country_PT country_RU dcmi_Sound dcmi_Text iso639_ara iso639_cmn iso639_eng iso639_nan iso639_pol iso639_por iso639_rus iso639_spa olac_primary_text