OLAC Record: 2006 NIST Spoken Term Detection Evaluation Set

OLAC Record
oai:www.ldc.upenn.edu:LDC2011S03

Metadata

Title: 2006 NIST Spoken Term Detection Evaluation Set

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: NIST Multimodal Information Group. 2006 NIST Spoken Term Detection Evaluation Set LDC2011S03. Web Download. Philadelphia: Linguistic Data Consortium, 2011

Contributor: NIST Multimodal Information Group

Date (W3CDTF): 2011

Date Issued (W3CDTF): 2011-07-15

Description: *Introduction* 2006 NIST Spoken Term Detection Evaluation Set, Linguistic Data Consortium (LDC) catalog number LDC2011S03 and isbn 1-58563-584-7, was compiled by researchers at NIST (National Institute of Standards and Technology) and contains approximately eighteen hours of Arabic, Chinese and English broadcast news, English conversational telephone speech and English meeting room speech used in NISTs 2006 Spoken Term Detection (STD) evaluation. The STD initiative is designed to facilitate research and development of technology for retrieving information from archives of speech data with the goals of exploring promising new ideas in spoken term detection, developing advanced technology incorporating these ideas, measuring the performance of this technology and establishing a community for the exchange of research results and technical insights. The 2006 STD task was to find all of the occurrences of a specified term (a sequence of one or more words) in a given corpus of speech data. The evaluation was intended to develop technology for rapidly searching very large quantities of audio data. Although the evaluation used modest amounts of data, it was structured to simulate the very large data situation and to make it possible to extrapolate the speed measurements to much larger data sets. Therefore, systems were implemented in two phases: indexing and searching. In the indexing phase, the system processes the speech data without knowledge of the terms. In the searching phase, the system uses the terms, the index, and optionally the audio to detect term occurrences. The development data is available in 2006 NIST Spoken Term Detection Development Set LDC2011S02. *Data* The evaluation corpus consists of three data genres: broadcast news (BNews), conversational telephone speech (CTS) and conference room meetings (CONFMTG). The broadcast news material was collected in 2003 and 2004 by LDCs broadcast collection system from the following sources: ABC (English), Aljazeera (Arabic), China Central TV (Chinese), CNN (English), CNBC (English), Dubai TV (Arabic), New Tang Dynasty TV (Chinese), Public Radio International (English) and Radio Free Asia (Chinese). The CTS data was taken from the Switchboard data sets (e.g., Switchboard-2 Phase 1 LDC98S75, Switchboard-2 Phase 2 LDC99S79) and the Fisher corpora (e.g., Fisher English Training Speech Part 1 LDC2004S13), also collected by LDC. The conference room meeting material consists of goal-oriented, small group roundtable meetings and was collected in 2004 and 2005 by NIST, the International Computer Science Institute (Berkeley, California), Carnegie Mellon University (Pittsburgh, PA), TNO (The Netherlands) and Virginia Polytechnic Institute and State University (Blacksburg, VA) as part of the AMI corpus project. This evaluation corpus includes scoring software. It uses the inputs described in the STD Evaluation plan to complete the evaluation of a system. Each BNews recording is a 1-channel, pcm-encoded, 16Khz, SPHERE formatted file. CTS recordings are 2-channel, u-law encoded, 8 Khz, SPHERE formatted files. The CONFMTG files contain a single recorded channel. *Samples* For an example of the audio data in this corpus, please examine this audio sample.

Extent: Corpus size: 1153433 KB

Format: Sampling Rate: 8000

Sampling Format: ulaw

Identifier: LDC2011S03

https://catalog.ldc.upenn.edu/LDC2011S03

ISBN: 1-58563-584-7

ISLRN: 244-296-223-213-3

DOI: 10.35111/8czt-kb94

Language: English

Mandarin Chinese

Standard Arabic

Arabic

Language (ISO639): eng

cmn

arb

ara

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2011S03

Rights Holder: Portions © 2003 American Broadcasting Corporation, © 2003 Aljazeera, © 2003 Cable News Network, LP, LLP, © 2004 China Central TV, © 2003 Dubai TV, © 2003 National Broadcasting Company, © 2004 New Tang Dynasty TY, © 2003 Public Radio International, © 1998, 1999, 2003, 2004, 2011 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2011S03

DateStamp: 2021-09-09

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: NIST Multimodal Information Group. 2011. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB country_SA dcmi_Sound iso639_ara iso639_arb iso639_cmn iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2011S03
Up-to-date as of: Wed Oct 29 7:01:16 EDT 2025

Metadata
Title:		2006 NIST Spoken Term Detection Evaluation Set
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		NIST Multimodal Information Group. 2006 NIST Spoken Term Detection Evaluation Set LDC2011S03. Web Download. Philadelphia: Linguistic Data Consortium, 2011
Contributor:		NIST Multimodal Information Group
Date (W3CDTF):		2011
Date Issued (W3CDTF):		2011-07-15
Description:		Introduction 2006 NIST Spoken Term Detection Evaluation Set, Linguistic Data Consortium (LDC) catalog number LDC2011S03 and isbn 1-58563-584-7, was compiled by researchers at NIST (National Institute of Standards and Technology) and contains approximately eighteen hours of Arabic, Chinese and English broadcast news, English conversational telephone speech and English meeting room speech used in NISTs 2006 Spoken Term Detection (STD) evaluation. The STD initiative is designed to facilitate research and development of technology for retrieving information from archives of speech data with the goals of exploring promising new ideas in spoken term detection, developing advanced technology incorporating these ideas, measuring the performance of this technology and establishing a community for the exchange of research results and technical insights. The 2006 STD task was to find all of the occurrences of a specified term (a sequence of one or more words) in a given corpus of speech data. The evaluation was intended to develop technology for rapidly searching very large quantities of audio data. Although the evaluation used modest amounts of data, it was structured to simulate the very large data situation and to make it possible to extrapolate the speed measurements to much larger data sets. Therefore, systems were implemented in two phases: indexing and searching. In the indexing phase, the system processes the speech data without knowledge of the terms. In the searching phase, the system uses the terms, the index, and optionally the audio to detect term occurrences. The development data is available in 2006 NIST Spoken Term Detection Development Set LDC2011S02. Data The evaluation corpus consists of three data genres: broadcast news (BNews), conversational telephone speech (CTS) and conference room meetings (CONFMTG). The broadcast news material was collected in 2003 and 2004 by LDCs broadcast collection system from the following sources: ABC (English), Aljazeera (Arabic), China Central TV (Chinese), CNN (English), CNBC (English), Dubai TV (Arabic), New Tang Dynasty TV (Chinese), Public Radio International (English) and Radio Free Asia (Chinese). The CTS data was taken from the Switchboard data sets (e.g., Switchboard-2 Phase 1 LDC98S75, Switchboard-2 Phase 2 LDC99S79) and the Fisher corpora (e.g., Fisher English Training Speech Part 1 LDC2004S13), also collected by LDC. The conference room meeting material consists of goal-oriented, small group roundtable meetings and was collected in 2004 and 2005 by NIST, the International Computer Science Institute (Berkeley, California), Carnegie Mellon University (Pittsburgh, PA), TNO (The Netherlands) and Virginia Polytechnic Institute and State University (Blacksburg, VA) as part of the AMI corpus project. This evaluation corpus includes scoring software. It uses the inputs described in the STD Evaluation plan to complete the evaluation of a system. Each BNews recording is a 1-channel, pcm-encoded, 16Khz, SPHERE formatted file. CTS recordings are 2-channel, u-law encoded, 8 Khz, SPHERE formatted files. The CONFMTG files contain a single recorded channel. Samples For an example of the audio data in this corpus, please examine this audio sample.
Extent:		Corpus size: 1153433 KB
Format:		Sampling Rate: 8000
Format:		Sampling Format: ulaw
Identifier:		LDC2011S03
		https://catalog.ldc.upenn.edu/LDC2011S03
		ISBN: 1-58563-584-7
		ISLRN: 244-296-223-213-3
		DOI: 10.35111/8czt-kb94
Language:		English
		Mandarin Chinese
		Standard Arabic
		Arabic
Language (ISO639):		eng
		cmn
		arb
		ara
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2011S03
Rights Holder:		Portions © 2003 American Broadcasting Corporation, © 2003 Aljazeera, © 2003 Cable News Network, LP, LLP, © 2004 China Central TV, © 2003 Dubai TV, © 2003 National Broadcasting Company, © 2004 New Tang Dynasty TY, © 2003 Public Radio International, © 1998, 1999, 2003, 2004, 2011 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2011S03
DateStamp:		2021-09-09
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		NIST Multimodal Information Group. 2011. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB country_SA dcmi_Sound iso639_ara iso639_arb iso639_cmn iso639_eng olac_primary_text