OLAC Record
oai:www.ldc.upenn.edu:LDC2004T18

Metadata
Title:Arabic English Parallel News Part 1
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Linguistic Data Consortium. Arabic English Parallel News Part 1 LDC2004T18. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:Linguistic Data Consortium
Date (W3CDTF):2004
Date Issued (W3CDTF):2004-10-26
Description:*Introduction* Arabic English Parallel News Part 1 was developed by the Linguistic Data Consortium (LDC) and contains Arabic news stories and their English translations aligned at sentence level, totaling approximately 2 million Arabic words and 2.5 million English words. *Data* LDC collected the data in this corpus via Ummah Press Service from January 2001 to September 2004. It totals 8,439 story pairs, 68,685 sentence pairs. The corpus is aligned at sentence level. All data files are SGML documents. Ummah Press Service publishes weekly digests. Each issue of the Ummah publication contains a series of articles from various Arabic newspapers (eg. Al-Ahram, Al-Hayat, Asharq Al-Awsat, Al-Hakika, Al-Alam Al-Youm, Al-Gomhouria, Al-Ittihad) and their English translations. Ummah sends every issue to LDC in CP1256 or UTF8 via email on a weekly basis. The emails were then decoded, reformatted, and the Arabic text converted to UTF8 if necessary. The data came aligned at the story level but not at the sentence level. The sentence alignment was done at LDC using Champollion v1.1. *Samples* For an example of the data in this corpus, please view this Arabic example (SGM) and this English example (SGM). *Updates* None at this time.
Identifier:LDC2004T18
https://catalog.ldc.upenn.edu/LDC2004T18
ISBN: ISBN 1-58563-310-0
ISLRN: 233-597-996-883-6
DOI: 10.35111/et6p-7264
Language:English
Standard Arabic
Language (ISO639):eng
arb
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Rights Holder:Portions © 2001-2004 Ummah Press Service, © 2003-2004 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2004T18
DateStamp:  2022-04-01
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Linguistic Data Consortium. 2004. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_GB country_SA dcmi_Text iso639_arb iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004T18
Up-to-date as of: Fri Dec 6 7:46:56 EST 2024