OLAC Record: Arabic Gigaword Fifth Edition

OLAC Record
oai:www.ldc.upenn.edu:LDC2011T11

Metadata

Title: Arabic Gigaword Fifth Edition

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Parker, Robert, et al. Arabic Gigaword Fifth Edition LDC2011T11. Web Download. Philadelphia: Linguistic Data Consortium, 2011

Contributor: Parker, Robert

Graff, David

Chen, Ke

Kong, Junbo

Maeda, Kazuaki

Date (W3CDTF): 2011

Date Issued (W3CDTF): 2011-10-21

Description: *Introduction* Arabic Gigaword Fifth Edition, Linguistic Data Consortium (LDC) catalog number LDC2011T11 and ISBN 1-58563-595-2, was produced by LDC. It is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by LDC at the University of Pennsylvania. Arabic Gigaword Fifth Edition includes all of the content of the fourth edition of Arabic Gigaword (LDC2009T30) plus new data covering the period from January 2009 through December 2010. Nine distinct sources of Arabic newswire are represented here: * Asharq Al-Awsat (aaw_arb) * Agence France Presse (afp_arb) * Al-Ahram (ahr_arb) * Assabah (asb_arb) * Al Hayat (hyt_arb) * An Nahar (nhr_arb) * Al-Quds Al-Arabi (qds_arb) * Ummah Press (umh_arb) * Xinhua News Agency (xin_arb) The seven-character codes shown above represent both the directory names where the data files are found, and the 7-letter prefix that appears at the beginning of every file name. The 7-letter codes consist of the three-character source name IDs and the three-character language code (arb) separated by an underscore (_) character. The three-character language code conforms to the ISO 639-3 standard. In addition to adding new data, the following updates were made: * Repeated documents in Asharq Al-Awsat data from 2008 were removed. * Document formatting and docid duplication problems were corrected in Agence France Presse (AFP) data. * Significant duplication of content in 2007-2008 An Nahar data was detected, and the duplicated documents were removed. More details about these changes can be found in the included readme file. *Data* All text data are presented in SGML form, using a very simple, minimal markup structure. For every opening tag (DOC, HEADLINE, DATELINE, TEXT, P), there is a corresponding closing tag -- always. The attribute values in the DOC tag are always presented within double-quotes the id= attribute of DOC consists of the 7-letter source abbreviation (in CAPS), an underscore character, an 8-digit date string representing the date of the story (YYYYMMDD), a period, and a 4-digit sequence number starting at 0001 for each date (e.g. XIN_ARB_200101.0001) in this way, every DOC in the corpus is uniquely identifiable by the id string. For this release, all sources have received a uniform treatment in terms of quality control, and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct types. The classification is indicated by the type=string attribute that is included in each opening DOC tag. The four types are: * story: This is by far the most frequent type, and it represents the most typical newswire item: a coherent report on a particular topic or event, consisting of paragraphs and full sentences. * multi: This type of DOC contains a series of unrelated blurbs, each of which briefly describes a particular topic or event this is typically applied to DOCs that contain summaries of todays news, news briefs in ... (some general area like finance or sports), and so on. * other: This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike multi type DOCs), and they typically do not contain paragraphs or sentences (they arent really stories) these are things like lists of sports scores, stock prices, temperatures around the world, and so on. Other Gigaword corpora (e.g., in English and Chinese) have a fourth category, advis (for advisory), which applies to DOCs that contain text intended solely for news service editors, not the news-reading public. The task of determining patterns for assigning non-story type labels was carried out by a native speaker of Arabic, and the advis category was determined to be inapplicable to the data. Note that the markup was applied algorithmically, using logic that was based on less-than-complete knowledge of the data. For the most part, the HEADLINE, DATELINE and TEXT tags have their intended content but due to the inherent variability (and the inevitable source errors) in the data, users may find occasional mishaps where the headline and/or dateline were not successfully identified (hence show up within TEXT), or where an initial sentence or paragraph has been mistakenly tagged as the headline or dateline. *Sample* Please view this sample. *Sponsorship* This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily refelct the position or policy of the Government, and no official endorsement should be inferred. *Updates* None at this time.

Extent: Corpus size: 3286401 KB

Identifier: LDC2011T11

https://catalog.ldc.upenn.edu/LDC2011T11

ISBN: 1-58563-595-2

ISLRN: 494-144-988-211-3

DOI: 10.35111/p02g-rw14

Language: Standard Arabic

Arabic

Language (ISO639): arb

ara

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2011T11

Rights Holder: Portions © 1994-2010 Agence France Presse, © 2006-2010 Al-Ahram, © 2006-2010 Al-Quds Al-Arabi, © 2006-2010 Asharq Al-Awsat, © 2004-2010 Assabah, © 1994-2003, 2005-2010 Al Hayat, © 1995-2010 An Nahar, © 2003-2010 Ummah Press, © 2001-2010 Xinhua News Agency, © 2003, 2006, 2007, 2009, 2011 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2011T11

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Parker, Robert; Graff, David; Chen, Ke; Kong, Junbo; Maeda, Kazuaki. 2011. Linguistic Data Consortium.
Terms: area_Asia country_SA dcmi_Text iso639_ara iso639_arb olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2011T11
Up-to-date as of: Wed Oct 29 7:01:17 EDT 2025

Metadata
Title:		Arabic Gigaword Fifth Edition
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Parker, Robert, et al. Arabic Gigaword Fifth Edition LDC2011T11. Web Download. Philadelphia: Linguistic Data Consortium, 2011
Contributor:		Parker, Robert
		Graff, David
		Chen, Ke
		Kong, Junbo
		Maeda, Kazuaki
Date (W3CDTF):		2011
Date Issued (W3CDTF):		2011-10-21
Description:		Introduction Arabic Gigaword Fifth Edition, Linguistic Data Consortium (LDC) catalog number LDC2011T11 and ISBN 1-58563-595-2, was produced by LDC. It is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by LDC at the University of Pennsylvania. Arabic Gigaword Fifth Edition includes all of the content of the fourth edition of Arabic Gigaword (LDC2009T30) plus new data covering the period from January 2009 through December 2010. Nine distinct sources of Arabic newswire are represented here: * Asharq Al-Awsat (aaw_arb) * Agence France Presse (afp_arb) * Al-Ahram (ahr_arb) * Assabah (asb_arb) * Al Hayat (hyt_arb) * An Nahar (nhr_arb) * Al-Quds Al-Arabi (qds_arb) * Ummah Press (umh_arb) * Xinhua News Agency (xin_arb) The seven-character codes shown above represent both the directory names where the data files are found, and the 7-letter prefix that appears at the beginning of every file name. The 7-letter codes consist of the three-character source name IDs and the three-character language code (arb) separated by an underscore (_) character. The three-character language code conforms to the ISO 639-3 standard. In addition to adding new data, the following updates were made: * Repeated documents in Asharq Al-Awsat data from 2008 were removed. * Document formatting and docid duplication problems were corrected in Agence France Presse (AFP) data. * Significant duplication of content in 2007-2008 An Nahar data was detected, and the duplicated documents were removed. More details about these changes can be found in the included readme file. Data All text data are presented in SGML form, using a very simple, minimal markup structure. For every opening tag (DOC, HEADLINE, DATELINE, TEXT, P), there is a corresponding closing tag -- always. The attribute values in the DOC tag are always presented within double-quotes the id= attribute of DOC consists of the 7-letter source abbreviation (in CAPS), an underscore character, an 8-digit date string representing the date of the story (YYYYMMDD), a period, and a 4-digit sequence number starting at 0001 for each date (e.g. XIN_ARB_200101.0001) in this way, every DOC in the corpus is uniquely identifiable by the id string. For this release, all sources have received a uniform treatment in terms of quality control, and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct types. The classification is indicated by the type=string attribute that is included in each opening DOC tag. The four types are: * story: This is by far the most frequent type, and it represents the most typical newswire item: a coherent report on a particular topic or event, consisting of paragraphs and full sentences. * multi: This type of DOC contains a series of unrelated blurbs, each of which briefly describes a particular topic or event this is typically applied to DOCs that contain summaries of todays news, news briefs in ... (some general area like finance or sports), and so on. * other: This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike multi type DOCs), and they typically do not contain paragraphs or sentences (they arent really stories) these are things like lists of sports scores, stock prices, temperatures around the world, and so on. Other Gigaword corpora (e.g., in English and Chinese) have a fourth category, advis (for advisory), which applies to DOCs that contain text intended solely for news service editors, not the news-reading public. The task of determining patterns for assigning non-story type labels was carried out by a native speaker of Arabic, and the advis category was determined to be inapplicable to the data. Note that the markup was applied algorithmically, using logic that was based on less-than-complete knowledge of the data. For the most part, the HEADLINE, DATELINE and TEXT tags have their intended content but due to the inherent variability (and the inevitable source errors) in the data, users may find occasional mishaps where the headline and/or dateline were not successfully identified (hence show up within TEXT), or where an initial sentence or paragraph has been mistakenly tagged as the headline or dateline. Sample Please view this sample. Sponsorship This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily refelct the position or policy of the Government, and no official endorsement should be inferred. Updates None at this time.
Extent:		Corpus size: 3286401 KB
Identifier:		LDC2011T11
		https://catalog.ldc.upenn.edu/LDC2011T11
		ISBN: 1-58563-595-2
		ISLRN: 494-144-988-211-3
		DOI: 10.35111/p02g-rw14
Language:		Standard Arabic
Language:		Arabic
Language (ISO639):		arb
Language (ISO639):		ara
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2011T11
Rights Holder:		Portions © 1994-2010 Agence France Presse, © 2006-2010 Al-Ahram, © 2006-2010 Al-Quds Al-Arabi, © 2006-2010 Asharq Al-Awsat, © 2004-2010 Assabah, © 1994-2003, 2005-2010 Al Hayat, © 1995-2010 An Nahar, © 2003-2010 Ummah Press, © 2001-2010 Xinhua News Agency, © 2003, 2006, 2007, 2009, 2011 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2011T11
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Parker, Robert; Graff, David; Chen, Ke; Kong, Junbo; Maeda, Kazuaki. 2011. Linguistic Data Consortium.
Terms:		area_Asia country_SA dcmi_Text iso639_ara iso639_arb olac_primary_text