OLAC Record: Gulf Arabic Conversational Telephone Speech, Transcripts

OLAC Record
oai:www.ldc.upenn.edu:LDC2006T15

Metadata

Title: Gulf Arabic Conversational Telephone Speech, Transcripts

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Appen Pty Ltd. Gulf Arabic Conversational Telephone Speech, Transcripts LDC2006T15. Web Download. Philadelphia: Linguistic Data Consortium, 2006

Contributor: Appen Pty Ltd

Date (W3CDTF): 2006

Date Issued (W3CDTF): 2006-09-19

Description: *Introduction* Gulf Arabic Conversational Telephone Speech, Transcripts is a database developed by Appen Pty Ltd., Sydney, Australia and contains transcripts of roughly 2,800 min of spontaneous telephone conversations in Colloquial Gulf Arabic. A total of 976 conversation sides from 975 Gulf Arabic speakers are provided (one speaker appears on two distinct calls). The average duration per side is about 5.7 minutes. The data was collected and transcribed in 2004 by Appen Pty Ltd., Sydney, Australia. The corresponding speech files for these transcripts are available in Gulf Arabic Conversational Telephone Speech (LDC2006S43). *Data* Each transcript file is a tab-delimited flat table, where each line contains information and text for a single contiguous utterance, presented via the following fields: * Beginning time stamp in seconds, in square brackets ("[5.7189]") * Ending time stamp in seconds, in square brackets * Channel/speaker-ID ("A:" or "B:") * "Consonant skeleton" orthography for the utterance, in UTF-8 * "Diacritized" orthography for the utterance, in ASCII The ASCII field is the Buckwalter transliteration of the fully "vowelized" (pronunciation) form of the utterance. Within fields 4 and 5, word boundaries are marked by space characters in the normal way, following common practices of Arabic orthographic convention (i.e. all definite articles and many conjunctions and prepositions are attached as prefixes to the following word). Transcript tokens enclosed in single parentheses -- e.g. "(DHk)" -- represent annotation marks for non-speech events or conditions, such as laughter, noise, etc. Multi-token strings within single parentheses involve words in some other language (typically English) or some other Arabic dialect. Double parentheses, either with or without tokens enclosed within them -- e.g. "(())", "((word))", or "((word1 word2))" -- represent regions where the transcriber was unable to tell for sure what was said. The "consonant skeleton" orthography is intended to reflect common orthographic practice in written Arabic (i.e. Modern Standard Arabic (MSA)), but without being bound strictly by the specific spellings of MSA words. That is, there may be novel (dialect-specific) words and changes of consonant quality (hence altered spelling) in words that are cognate between MSA and Gulf Arabic. The "vowelized" orthography is restricted to a character set that allows words to be rendered coherently in Arabic script (with all diacritics present as needed to represent short vowels, etc.), but is intended to reflect the perceived pronunciation of each token. As a result, a given word (type), having multiple occurrences in the text with identical "skeletal" spellings, may have multiple distinct "vowelized" spellings. In some cases, these different spellings simply reflect pronunciation variants, while in other cases, they represent distinct morphological forms (with distinct contextual meanings) where the semantic differences are conveyed solely by the short vowels (i.e. the diacritics). *Samples* Please view this transcript sample (TXT). *Updates* None at this time.

Extent: Corpus size: 11264 KB

Identifier: LDC2006T15

https://catalog.ldc.upenn.edu/LDC2006T15

ISBN: 1-58563-401-8

ISLRN: 647-896-139-023-9

DOI: 10.35111/6m9w-v698

Language: Gulf Arabic

Language (ISO639): afb

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2006T15

Rights Holder: Portions © 2006 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): lexicon

primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2006T15

DateStamp: 2023-01-11

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Appen Pty Ltd. 2006. Linguistic Data Consortium.
Terms: area_Asia country_KW dcmi_Text iso639_afb olac_lexicon olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2006T15
Up-to-date as of: Wed Oct 29 7:00:06 EDT 2025

Metadata
Title:		Gulf Arabic Conversational Telephone Speech, Transcripts
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Appen Pty Ltd. Gulf Arabic Conversational Telephone Speech, Transcripts LDC2006T15. Web Download. Philadelphia: Linguistic Data Consortium, 2006
Contributor:		Appen Pty Ltd
Date (W3CDTF):		2006
Date Issued (W3CDTF):		2006-09-19
Description:		Introduction Gulf Arabic Conversational Telephone Speech, Transcripts is a database developed by Appen Pty Ltd., Sydney, Australia and contains transcripts of roughly 2,800 min of spontaneous telephone conversations in Colloquial Gulf Arabic. A total of 976 conversation sides from 975 Gulf Arabic speakers are provided (one speaker appears on two distinct calls). The average duration per side is about 5.7 minutes. The data was collected and transcribed in 2004 by Appen Pty Ltd., Sydney, Australia. The corresponding speech files for these transcripts are available in Gulf Arabic Conversational Telephone Speech (LDC2006S43). Data Each transcript file is a tab-delimited flat table, where each line contains information and text for a single contiguous utterance, presented via the following fields: * Beginning time stamp in seconds, in square brackets ("[5.7189]") * Ending time stamp in seconds, in square brackets * Channel/speaker-ID ("A:" or "B:") * "Consonant skeleton" orthography for the utterance, in UTF-8 * "Diacritized" orthography for the utterance, in ASCII The ASCII field is the Buckwalter transliteration of the fully "vowelized" (pronunciation) form of the utterance. Within fields 4 and 5, word boundaries are marked by space characters in the normal way, following common practices of Arabic orthographic convention (i.e. all definite articles and many conjunctions and prepositions are attached as prefixes to the following word). Transcript tokens enclosed in single parentheses -- e.g. "(DHk)" -- represent annotation marks for non-speech events or conditions, such as laughter, noise, etc. Multi-token strings within single parentheses involve words in some other language (typically English) or some other Arabic dialect. Double parentheses, either with or without tokens enclosed within them -- e.g. "(())", "((word))", or "((word1 word2))" -- represent regions where the transcriber was unable to tell for sure what was said. The "consonant skeleton" orthography is intended to reflect common orthographic practice in written Arabic (i.e. Modern Standard Arabic (MSA)), but without being bound strictly by the specific spellings of MSA words. That is, there may be novel (dialect-specific) words and changes of consonant quality (hence altered spelling) in words that are cognate between MSA and Gulf Arabic. The "vowelized" orthography is restricted to a character set that allows words to be rendered coherently in Arabic script (with all diacritics present as needed to represent short vowels, etc.), but is intended to reflect the perceived pronunciation of each token. As a result, a given word (type), having multiple occurrences in the text with identical "skeletal" spellings, may have multiple distinct "vowelized" spellings. In some cases, these different spellings simply reflect pronunciation variants, while in other cases, they represent distinct morphological forms (with distinct contextual meanings) where the semantic differences are conveyed solely by the short vowels (i.e. the diacritics). Samples Please view this transcript sample (TXT). Updates None at this time.
Extent:		Corpus size: 11264 KB
Identifier:		LDC2006T15
		https://catalog.ldc.upenn.edu/LDC2006T15
		ISBN: 1-58563-401-8
		ISLRN: 647-896-139-023-9
		DOI: 10.35111/6m9w-v698
Language:		Gulf Arabic
Language (ISO639):		afb
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2006T15
Rights Holder:		Portions © 2006 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		lexicon
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2006T15
DateStamp:		2023-01-11
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Appen Pty Ltd. 2006. Linguistic Data Consortium.
Terms:		area_Asia country_KW dcmi_Text iso639_afb olac_lexicon olac_primary_text