OLAC Record oai:catalogue.elra.info:ELRA-W0042 |
Metadata | ||
Title: | NEMLAR Written Corpus | |
Access Rights: | Rights available for: nonCommercialUse, commercialUse | |
Date Available (W3CDTF): | 2006-08-11 | |
Date Issued (W3CDTF): | 2006-08-11 | |
Date Modified (W3CDTF): | 2007-02-22 | |
Description: | This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Broadcast News Speech Corpus (ELRA-S0219) and the NEMLAR Speech Synthesis Corpus (ELRA-S0220).The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories, aiming to achieve a well-balanced corpus that offers a representation of the variety in syntactic, semantic and pragmatic features of modern Arabic language. The different categories are:•Political news: 48,000 words•Political debate: 30,000 words•Islamic text (Preaching and others): 29,000 words•Phrases of common words: 8,500 words•Text from broadcast news: 5,500 words•Business: 20,000 words•Arabic literature: 30,000 words•General news: 100,000 words•Interviews: 56,000 words•Scientific press: 50,000 words•Sports press: 50,000 words•Dictionary entries explanation: 52,000 words•Legal domain text: 21,000 wordsThe time span of the data included goes from late 1990’s to 2005.The corpus is provided in 4 different versions:•Raw text•Fully vowelized text•Text with Arabic lexical analysis•Text with Arabic POS-tagsDiacritics, lexical analysis and POS-tags were generated by RDI’s tool Fassieh©. The accuracy of the automatic analysis is around 95%. To reach about the 99% accuracy rate as defined for this corpus, the linguists used the visual revision mode of Fassieh© where the linguist has to either approve the 1st most likely analysis (most of the time) or select another one manually (in the 4% minority of the cases).The database is distributed on 1 ISO 9660 CD-ROM volume. It has been validated by an external partner and a validation report is provided. | |
Identifier: | ELRA-W0042 | |
ISLRN: 050-693-158-326-9 | ||
Identifier (URI): | https://catalog.elra.info/en-us/repository/browse/ELRA-W0042/ | |
Language: | Arabic | |
Language (ISO639): | ara | |
Medium: | Not specified | |
Publisher: | ELRA (European Language Resources Association) | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | ELRA Catalogue of Language Resources | |
Description: | http://www.language-archives.org/archive/catalogue.elra.info | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:catalogue.elra.info:ELRA-W0042 | |
DateStamp: | 2006-08-11 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | n.a. 2006. ELRA (European Language Resources Association). | |
Terms: | dcmi_Text iso639_ara olac_primary_text |