OLAC Record: BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts

OLAC Record
oai:www.ldc.upenn.edu:LDC2005S08

Metadata

Title: BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: BBN Technologies (with American University of Beirut a subcontractor), et al. BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts LDC2005S08. Web Download. Philadelphia: Linguistic Data Consortium, 2005

Contributor: BBN Technologies (with American University of Beirut a subcontractor)

Makhoul, John

Zawaydeh, Bushra

Choi, Frederick

Stallard, David

Date (W3CDTF): 2005

Date Issued (W3CDTF): 2005-01-15

Description: *Introduction* BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts was developed by BBN Technologies and contains 60.6 hours of spontaneous speech recorded from subjects speaking Levantine colloquial Arabic and associated transcripts. Levantine Arabic is the dialect of Arabic spoken in Lebanon, Jordan, Syria, and Palestine. It is significantly different from Modern Standard Arabic. It is a spoken rather than a written language, and includes different words and pronounciations from Modern Standard Arabic. The corpus was developed with funding from the Defense Advanced Research Project Agency (DARPA), as part of the Babylon program. The Babylon program was intended to advance the state of the art in speech-to-speech translation systems by creating new technology and by developing systems for field use. BBN was funded under Babylon to develop a limited English/Arabic refugee/medical speech translation system for a handheld computer, and it collected this corpus as part of its work. The corpus may be useful for speech recognition in Levantine colloquial Arabic, including for speech translation and spoken dialog systems. Approximately 20% of the corpus was recorded by BBN using paid subjects recruited in the Boston area from May 2002 to September 2002. This portion of the corpus was the first to be collected. Subsequently, the remaining 80% was recorded by the American University of Beirut (AUB), under subcontract to BBN, from July 2002 to November 2002. AUB students and staff served as both experimenters and subjects. This portion of the corpus was recorded in Beirut, Lebanon, on the AUB campus. *Data* For collection, 101 males and 63 females were recorded responding to various prompts. Their responses were saved as individual files for each utterance. The corpus contains both audio and transcription for 76,227 such utterances. The audio was recorded in MS WAV, signed PCM, with a sampling rate of 16 kHz and 16-bit resolution. The transcription is saved as individualized TXT files matching the names of the audio files, and also as a single concatenated XML file. All transcriptions are Unicode Arabic, encoded in UTF-8. They do not include short-vowel diacritics of Arabic writings, which are rarely written. *Samples* For an example of the data in this corpus, please listen to this audio sample (WAV) and view its transcription (TXT). *Updates* None at this time.

Format: Sampling Rate: 16000

Sampling Format: pcm

Identifier: LDC2005S08

https://catalog.ldc.upenn.edu/LDC2005S08

ISBN: 1-58563-296-1

ISLRN: 500-300-564-790-5

DOI: 10.35111/j1tn-v351

Language: Levantine Arabic

Arabic

Language (ISO639): ara

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2005S08

Rights Holder: Portions © 2003 BBNT Solutions LLC, © 2004, 2005 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2005S08

DateStamp: 2022-01-31

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: BBN Technologies (with American University of Beirut a subcontractor); Makhoul, John; Zawaydeh, Bushra; Choi, Frederick; Stallard, David. 2005. Linguistic Data Consortium.
Terms: dcmi_Sound dcmi_Text iso639_ara olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005S08
Up-to-date as of: Wed Oct 29 7:00:24 EDT 2025

Metadata
Title:		BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		BBN Technologies (with American University of Beirut a subcontractor), et al. BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts LDC2005S08. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:		BBN Technologies (with American University of Beirut a subcontractor)
		Makhoul, John
		Zawaydeh, Bushra
		Choi, Frederick
		Stallard, David
Date (W3CDTF):		2005
Date Issued (W3CDTF):		2005-01-15
Description:		Introduction BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts was developed by BBN Technologies and contains 60.6 hours of spontaneous speech recorded from subjects speaking Levantine colloquial Arabic and associated transcripts. Levantine Arabic is the dialect of Arabic spoken in Lebanon, Jordan, Syria, and Palestine. It is significantly different from Modern Standard Arabic. It is a spoken rather than a written language, and includes different words and pronounciations from Modern Standard Arabic. The corpus was developed with funding from the Defense Advanced Research Project Agency (DARPA), as part of the Babylon program. The Babylon program was intended to advance the state of the art in speech-to-speech translation systems by creating new technology and by developing systems for field use. BBN was funded under Babylon to develop a limited English/Arabic refugee/medical speech translation system for a handheld computer, and it collected this corpus as part of its work. The corpus may be useful for speech recognition in Levantine colloquial Arabic, including for speech translation and spoken dialog systems. Approximately 20% of the corpus was recorded by BBN using paid subjects recruited in the Boston area from May 2002 to September 2002. This portion of the corpus was the first to be collected. Subsequently, the remaining 80% was recorded by the American University of Beirut (AUB), under subcontract to BBN, from July 2002 to November 2002. AUB students and staff served as both experimenters and subjects. This portion of the corpus was recorded in Beirut, Lebanon, on the AUB campus. Data For collection, 101 males and 63 females were recorded responding to various prompts. Their responses were saved as individual files for each utterance. The corpus contains both audio and transcription for 76,227 such utterances. The audio was recorded in MS WAV, signed PCM, with a sampling rate of 16 kHz and 16-bit resolution. The transcription is saved as individualized TXT files matching the names of the audio files, and also as a single concatenated XML file. All transcriptions are Unicode Arabic, encoded in UTF-8. They do not include short-vowel diacritics of Arabic writings, which are rarely written. Samples For an example of the data in this corpus, please listen to this audio sample (WAV) and view its transcription (TXT). Updates None at this time.
Format:		Sampling Rate: 16000
Format:		Sampling Format: pcm
Identifier:		LDC2005S08
		https://catalog.ldc.upenn.edu/LDC2005S08
		ISBN: 1-58563-296-1
		ISLRN: 500-300-564-790-5
		DOI: 10.35111/j1tn-v351
Language:		Levantine Arabic
Language:		Arabic
Language (ISO639):		ara
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2005S08
Rights Holder:		Portions © 2003 BBNT Solutions LLC, © 2004, 2005 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2005S08
DateStamp:		2022-01-31
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		BBN Technologies (with American University of Beirut a subcontractor); Makhoul, John; Zawaydeh, Bushra; Choi, Frederick; Stallard, David. 2005. Linguistic Data Consortium.
Terms:		dcmi_Sound dcmi_Text iso639_ara olac_primary_text