OLAC Record: NEMLAR Speech Synthesis Corpus

OLAC Record
oai:catalogue.elra.info:ELRA-S0220

Metadata

Title: NEMLAR Speech Synthesis Corpus

Access Rights: Rights available for: nonCommercialUse, commercialUse

Date Available (W3CDTF): 2006-08-11

Date Issued (W3CDTF): 2006-08-11

Date Modified (W3CDTF): 2007-02-22

Description: This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Written Corpus (ELRA-W0042) and the NEMLAR Broadcast News Speech Corpus (ELRA-S0219).The NEMLAR Speech Synthesis Corpus contains the recordings of 2 native Egyptian Arabic speakers (male and female, 35 and 27 years old respectively) recorded in a studio over 2 channels (voice + laryngograph). The recordings comprise more than 10 hours of data with transcriptions.Speech samples are stored in 96 kHz, 24 bit with the least significant byte first (“lohi” or Intel format) as (signed) integers.The speaker read 2,032 prompted sentences covering approx. 42,000 words in three categories: transcribed speech (6,600 words - 20%), written text (16,500 words - 50%), and constructed phrases (10,300 - 30%).The transcribed speech consists of text from different domains, being produced in the Broadcast news task. The written text consists of news excerpts, novels and short stories with short sentences. Each paragraph is presented on a separate prompt sheet.Constructed phrases consist of frequent phrases and diphone coverage sentences. The frequent used phrases are designed as derived from written text (article, news paper, etc.) and have been divided into six sub-domains: •Frequently used colloquial expressions•Sports/Games•News•Finance•Culture/Entertainment•Consumer InformationThe diphone coverage sentences cover the missing and rare diphones in all the data. To cover these diphones a large corpus about 150,000 words was used and from which the sentences were extracted.The database is provided with orthographic, prosodic and phonetic transcriptions in SAMPA. All transcriptions are segmented at the utterance (sentence/command word) level, annotated at the word level and checked manually. A pronunciation lexicon including 3,589 headwords with phonetics in SAMPA is also available.The database is distributed on 3 ISO 9660 DVD-ROM volumes. It has been validated by an external partner and a validation report is provided.

Identifier: ELRA-S0220

ISLRN: 361-216-121-305-9

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-S0220/

Language: Arabic

Language (ISO639): ara

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-S0220

DateStamp: 2006-08-11

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2006. ELRA (European Language Resources Association).
Terms: dcmi_Sound iso639_ara olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-S0220
Up-to-date as of: Wed Jul 15 7:03:38 EDT 2026

Metadata
Title:		NEMLAR Speech Synthesis Corpus
Access Rights:		Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):		2006-08-11
Date Issued (W3CDTF):		2006-08-11
Date Modified (W3CDTF):		2007-02-22
Description:		This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Written Corpus (ELRA-W0042) and the NEMLAR Broadcast News Speech Corpus (ELRA-S0219).The NEMLAR Speech Synthesis Corpus contains the recordings of 2 native Egyptian Arabic speakers (male and female, 35 and 27 years old respectively) recorded in a studio over 2 channels (voice + laryngograph). The recordings comprise more than 10 hours of data with transcriptions.Speech samples are stored in 96 kHz, 24 bit with the least significant byte first (“lohi” or Intel format) as (signed) integers.The speaker read 2,032 prompted sentences covering approx. 42,000 words in three categories: transcribed speech (6,600 words - 20%), written text (16,500 words - 50%), and constructed phrases (10,300 - 30%).The transcribed speech consists of text from different domains, being produced in the Broadcast news task. The written text consists of news excerpts, novels and short stories with short sentences. Each paragraph is presented on a separate prompt sheet.Constructed phrases consist of frequent phrases and diphone coverage sentences. The frequent used phrases are designed as derived from written text (article, news paper, etc.) and have been divided into six sub-domains: •Frequently used colloquial expressions•Sports/Games•News•Finance•Culture/Entertainment•Consumer InformationThe diphone coverage sentences cover the missing and rare diphones in all the data. To cover these diphones a large corpus about 150,000 words was used and from which the sentences were extracted.The database is provided with orthographic, prosodic and phonetic transcriptions in SAMPA. All transcriptions are segmented at the utterance (sentence/command word) level, annotated at the word level and checked manually. A pronunciation lexicon including 3,589 headwords with phonetics in SAMPA is also available.The database is distributed on 3 ISO 9660 DVD-ROM volumes. It has been validated by an external partner and a validation report is provided.
Identifier:		ELRA-S0220
Identifier:		ISLRN: 361-216-121-305-9
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-S0220/
Language:		Arabic
Language (ISO639):		ara
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-S0220
DateStamp:		2006-08-11
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2006. ELRA (European Language Resources Association).
Terms:		dcmi_Sound iso639_ara olac_primary_text