OLAC Record: Arabic Speech Corpus

OLAC Record
oai:catalogue.elra.info:ELRA-S0384

Metadata

Title: Arabic Speech Corpus

Access Rights: Rights available for: commercialUse, attribution

Date Available (W3CDTF): 2016-08-19

Date Issued (W3CDTF): 2016-08-19

Date Modified (W3CDTF): 2017-07-05

Description: This speech corpus has been developed as part of a PhD work carried out by Nawar Halabi at the University of Southampton. The corpus was recorded through a Neumann TLM 103 Studio Microphone by one male speaker in South Levantine Arabic (Damascian accent) in a professional studio. The transcript was collected from “Aljazeera Learn” (Aljazeera 2015), a language learning website which was chosen because it contained fully diacritised text which makes it easier to phonetise. The transcript was split into utterances based on punctuation, to make it easier for the speaker during the recording sessions. Synthesized speech as an output using this corpus has produced a high quality, natural voice. It consists of 1813 utterances for a total of 3.7 hours consisting of:- 2.1 hours of normal utterances,- 1.6 hours of nonsense utterances (utterances that are not semantically, orthographically or syntactically correct).This package corresponds to version 2.0 of the corpus and includes:- 1813 .wav files containing spoken utterances, - 1813 .lab files containing text utterances,- 1813 .TextGrid files containing the phoneme labels with time stamps of the boundaries where these occur in the .wav files. These files can be opened using Praat software (see http://www.fon.hum.uva.nl/praat/),- phonetic transcriptions are gathered in one single text file which has the form "[wav_filename]" "[Phoneme Sequence]" in every line.- orthographic transcriptions are gathered in one single text file which has the form "[wav_filename]" "[Orthographic Transcript]" in every line. Orthography is in Buckwalter Format (see http://www.qamus.org/transliteration.htm) which is friendlier where there is a software that does not read Arabic script. It can be easily converted back to Arabic.- An extra set of 18 minutes of fully annotated corpus, used to evaluate the corpus, is also provided (separate from above but with the same structure as above).Arabic Speech Corpus by Nawar Halabi is licensed either under a Creative Commons Attribution License or under ELRA VAR agreement for commercial use.

Identifier: ELRA-S0384

ISLRN: 866-568-447-697-8

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-S0384/

Language: Arabic

Language (ISO639): ara

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-S0384

DateStamp: 2016-08-19

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2016. ELRA (European Language Resources Association).
Terms: dcmi_Sound iso639_ara olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-S0384
Up-to-date as of: Wed Jul 15 7:05:19 EDT 2026

Metadata
Title:		Arabic Speech Corpus
Access Rights:		Rights available for: commercialUse, attribution
Date Available (W3CDTF):		2016-08-19
Date Issued (W3CDTF):		2016-08-19
Date Modified (W3CDTF):		2017-07-05
Description:		This speech corpus has been developed as part of a PhD work carried out by Nawar Halabi at the University of Southampton. The corpus was recorded through a Neumann TLM 103 Studio Microphone by one male speaker in South Levantine Arabic (Damascian accent) in a professional studio. The transcript was collected from “Aljazeera Learn” (Aljazeera 2015), a language learning website which was chosen because it contained fully diacritised text which makes it easier to phonetise. The transcript was split into utterances based on punctuation, to make it easier for the speaker during the recording sessions. Synthesized speech as an output using this corpus has produced a high quality, natural voice. It consists of 1813 utterances for a total of 3.7 hours consisting of:- 2.1 hours of normal utterances,- 1.6 hours of nonsense utterances (utterances that are not semantically, orthographically or syntactically correct).This package corresponds to version 2.0 of the corpus and includes:- 1813 .wav files containing spoken utterances, - 1813 .lab files containing text utterances,- 1813 .TextGrid files containing the phoneme labels with time stamps of the boundaries where these occur in the .wav files. These files can be opened using Praat software (see http://www.fon.hum.uva.nl/praat/),- phonetic transcriptions are gathered in one single text file which has the form "[wav_filename]" "[Phoneme Sequence]" in every line.- orthographic transcriptions are gathered in one single text file which has the form "[wav_filename]" "[Orthographic Transcript]" in every line. Orthography is in Buckwalter Format (see http://www.qamus.org/transliteration.htm) which is friendlier where there is a software that does not read Arabic script. It can be easily converted back to Arabic.- An extra set of 18 minutes of fully annotated corpus, used to evaluate the corpus, is also provided (separate from above but with the same structure as above).Arabic Speech Corpus by Nawar Halabi is licensed either under a Creative Commons Attribution License or under ELRA VAR agreement for commercial use.
Identifier:		ELRA-S0384
Identifier:		ISLRN: 866-568-447-697-8
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-S0384/
Language:		Arabic
Language (ISO639):		ara
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-S0384
DateStamp:		2016-08-19
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2016. ELRA (European Language Resources Association).
Terms:		dcmi_Sound iso639_ara olac_primary_text