OLAC Record: Arabic Newswire English Translation Collection

OLAC Record
oai:www.ldc.upenn.edu:LDC2009T22

Metadata

Title: Arabic Newswire English Translation Collection

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Ma, Xiaoyi, and Dalal Zakhary. Arabic Newswire English Translation Collection LDC2009T22. Web Download. Philadelphia: Linguistic Data Consortium, 2009

Contributor: Ma, Xiaoyi

Zakhary, Dalal

Date (W3CDTF): 2009

Date Issued (W3CDTF): 2009-08-18

Description: *Introduction* Arabic English Newswire Translation Collection was developed by the Linguistic Data Consortium (LDC) and consists of approximately 550,000 words of Arabic newswire text and its English translation from Agence France Presse (France), An Nahar (Lebanon) and Assabah (Tunisia). The source Arabic text in this release is contained in LDC's Arabic Treebank series, specifically, Part 1 (Part 1 v. 2.0; Part 1 v. 3.0), Part 3 (Part 3 v. 1.0; Part 3 v. 2.0) and Part 4 (Part 4 v. 1.0). A subset of Agence France Presse (AFP) source text from Arabic Treebank: Part 1 v. 2.0 was previously translated and released by LDC in Arabic Treebank: Part 1 - 10K-word English Translation, LDC2003T07. Note the 49 translations for this AFP subset are not included in this release, resulting in a total 1,682 translations for the 1,731 source stories. The English translations in this corpus were provided by translation agencies using LDC's Arabic Translation Guidelines. While multiple translations agencies worked on both An Nahar and Assabah sources, for each specific document there is a single translation. *Data* The number of stories and their epochs for each source are as follows: AFP 734 stories; July 2000 - November 2000 An Nahar 600 stories; January 2002 - December 2002 Assabah 397 stories; September 2004 - November 2004 Total 1731 stories Word count of Arabic tokens by source is shown in the following table: AFP 102,564 An Nahar 299,681 Assabah 149,259 Total 551,504 The original source files used different encodings for the Arabic characters, including UTF8 and ASMO. SGML tags were used for marking sentence and paragraph boundaries and for annotating other information about each story. All Arabic source data was converted to UTF and most SGML tags were removed or replaced by "plain text" markers. *Samples* * Arabic Source * English Translation

Extent: Corpus size: 14028 KB

Identifier: LDC2009T22

https://catalog.ldc.upenn.edu/LDC2009T22

ISBN: 1-58563-521-9

ISLRN: 677-375-027-082-6

DOI: 10.35111/ehq4-xc75

Language: English

Standard Arabic

Arabic

Language (ISO639): eng

arb

ara

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2009T22

Rights Holder: Portions © 2000 Agence-France Presse, © 2002 An Nahar, © 2004 Assabah, © 2002-2005, 2009 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2009T22

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Ma, Xiaoyi; Zakhary, Dalal. 2009. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_GB country_SA dcmi_Text iso639_ara iso639_arb iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009T22
Up-to-date as of: Wed Oct 29 7:01:09 EDT 2025

Metadata
Title:		Arabic Newswire English Translation Collection
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Ma, Xiaoyi, and Dalal Zakhary. Arabic Newswire English Translation Collection LDC2009T22. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:		Ma, Xiaoyi
Contributor:		Zakhary, Dalal
Date (W3CDTF):		2009
Date Issued (W3CDTF):		2009-08-18
Description:		Introduction Arabic English Newswire Translation Collection was developed by the Linguistic Data Consortium (LDC) and consists of approximately 550,000 words of Arabic newswire text and its English translation from Agence France Presse (France), An Nahar (Lebanon) and Assabah (Tunisia). The source Arabic text in this release is contained in LDC's Arabic Treebank series, specifically, Part 1 (Part 1 v. 2.0; Part 1 v. 3.0), Part 3 (Part 3 v. 1.0; Part 3 v. 2.0) and Part 4 (Part 4 v. 1.0). A subset of Agence France Presse (AFP) source text from Arabic Treebank: Part 1 v. 2.0 was previously translated and released by LDC in Arabic Treebank: Part 1 - 10K-word English Translation, LDC2003T07. Note the 49 translations for this AFP subset are not included in this release, resulting in a total 1,682 translations for the 1,731 source stories. The English translations in this corpus were provided by translation agencies using LDC's Arabic Translation Guidelines. While multiple translations agencies worked on both An Nahar and Assabah sources, for each specific document there is a single translation. Data The number of stories and their epochs for each source are as follows: AFP 734 stories; July 2000 - November 2000 An Nahar 600 stories; January 2002 - December 2002 Assabah 397 stories; September 2004 - November 2004 Total 1731 stories Word count of Arabic tokens by source is shown in the following table: AFP 102,564 An Nahar 299,681 Assabah 149,259 Total 551,504 The original source files used different encodings for the Arabic characters, including UTF8 and ASMO. SGML tags were used for marking sentence and paragraph boundaries and for annotating other information about each story. All Arabic source data was converted to UTF and most SGML tags were removed or replaced by "plain text" markers. Samples * Arabic Source * English Translation
Extent:		Corpus size: 14028 KB
Identifier:		LDC2009T22
		https://catalog.ldc.upenn.edu/LDC2009T22
		ISBN: 1-58563-521-9
		ISLRN: 677-375-027-082-6
		DOI: 10.35111/ehq4-xc75
Language:		English
		Standard Arabic
		Arabic
Language (ISO639):		eng
		arb
		ara
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2009T22
Rights Holder:		Portions © 2000 Agence-France Presse, © 2002 An Nahar, © 2004 Assabah, © 2002-2005, 2009 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2009T22
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Ma, Xiaoyi; Zakhary, Dalal. 2009. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_GB country_SA dcmi_Text iso639_ara iso639_arb iso639_eng olac_primary_text