OLAC Record: Spanish Gigaword Third Edition

OLAC Record
oai:www.ldc.upenn.edu:LDC2011T12

Metadata

Title: Spanish Gigaword Third Edition

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Mendonça, Ângelo, et al. Spanish Gigaword Third Edition LDC2011T12. Web Download. Philadelphia: Linguistic Data Consortium, 2011

Contributor: Mendonça, Ângelo

Jaquette, Daniel

Graff, David

DiPersio, Denise

Date (W3CDTF): 2011

Date Issued (W3CDTF): 2011-10-21

Description: *Introduction* Spanish Gigaword Third Edition is a comprehensive archive of Spanish newswire text data acquired by the Linguistic Data Consortium. It includes all of the content of the second edition (LDC2009T21) and adds data collected from January 1, 2009 through December 31, 2010. The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows: * Agence France-Presse, Spanish (afp_spa) May 1994 - Dec 2010 * Associated Press, Spanish (apw_spa) Nov 1993 - Dec 2010 * Xinhua News Agency, Spanish (xin_spa) Sep 2001 - Dec 2010 The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code (spa) separated by an underscore (_) character. The three-letter language code conforms to LDCs internal convention based on the ISO 639-3 standard. *Data* All text data are presented in SGML/XML form, using a very simple, minimal markup structure all text consists of printable ASCII, whitespace, and printable code points in the Latin1 Supplement character table, as defined by both ISO-8859-1 and the Unicode Standard (ISO 10646) for the accented characters used in Spanish. The Supplement/accented characters are rendered using UTF-8 encoding. For all of the documents in this corpus, a rudimentary (and _approximate_) categorization of DOC units into four distinct types has been applied. The classification is indicated by the type=string attribute that is included in each opening DOC tag. The four types are: * story : This is by far the most frequent type, and it represents the most typical newswire item: a coherent report on a particular topic or event, consisting of paragraphs and full sentences. * multi : This type of DOC contains a series of unrelated blurbs, each of which briefly describes a particular topic or event this is typically applied to DOCs that contain summaries of todays news, news briefs in ... (some general area like finance or sports), and so on. * advis : (short for advisory) These are DOCs which the news service addresses to news editors -- they are not intended for publication to the end users (the populations who read the news). This type contains formulaic, repetitive content (contact phone numbers, etc). * other : This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike multi type DOCS), and they typically do not contain paragraphs or sentences (they arent really stories) these are things like lists of sports scores, stock prices, temperatures around the world, and so on. *Sample* Please view this sample. *Updates* An update to Spanish Gigaword Third Edition was issued to fix an issue with 26 consecutive months of data files from Xinhua Spanish: xin_spa_200601 through xin_spa_200802 i.e. all files from 2006 and 2007, plus the first two files from 2008. The problem was that all letters with diacritic marks had been omitted in the text data for that portion of the collection. For example, the word año was presented as ao (minus the n-with-tilde character), aspiracion appeared as aspiracin, and similarly for all accented characters (UTF-8 letters outside the ASCII range). All copies of Spanish Gigaword Third Edition ordered after February 2013 will have this update included. More information is included in the readme associated with this update.

Extent: Corpus size: 2736871 KB

Identifier: LDC2011T12

https://catalog.ldc.upenn.edu/LDC2011T12

ISBN: 1-58563-596-0

ISLRN: 595-627-966-073-3

DOI: 10.35111/v6z1-cb91

Language: Spanish

Language (ISO639): spa

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2011T12

Rights Holder: Portions © 1994-2010 Agence France Presse, © 1993-2010 The Associated Press, © 2001-2010 Xinhua News Agency, © 2006, 2009, 2011, 2013 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2011T12

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Mendonça, Ângelo; Jaquette, Daniel; Graff, David; DiPersio, Denise. 2011. Linguistic Data Consortium.
Terms: area_Europe country_ES dcmi_Text iso639_spa olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2011T12
Up-to-date as of: Wed Oct 29 7:01:17 EDT 2025

Metadata
Title:		Spanish Gigaword Third Edition
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Mendonça, Ângelo, et al. Spanish Gigaword Third Edition LDC2011T12. Web Download. Philadelphia: Linguistic Data Consortium, 2011
Contributor:		Mendonça, Ângelo
		Jaquette, Daniel
		Graff, David
		DiPersio, Denise
Date (W3CDTF):		2011
Date Issued (W3CDTF):		2011-10-21
Description:		Introduction Spanish Gigaword Third Edition is a comprehensive archive of Spanish newswire text data acquired by the Linguistic Data Consortium. It includes all of the content of the second edition (LDC2009T21) and adds data collected from January 1, 2009 through December 31, 2010. The three distinct international sources of Spanish newswire in this edition, and the time spans of collection covered for each, are as follows: * Agence France-Presse, Spanish (afp_spa) May 1994 - Dec 2010 * Associated Press, Spanish (apw_spa) Nov 1993 - Dec 2010 * Xinhua News Agency, Spanish (xin_spa) Sep 2001 - Dec 2010 The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code (spa) separated by an underscore (_) character. The three-letter language code conforms to LDCs internal convention based on the ISO 639-3 standard. Data All text data are presented in SGML/XML form, using a very simple, minimal markup structure all text consists of printable ASCII, whitespace, and printable code points in the Latin1 Supplement character table, as defined by both ISO-8859-1 and the Unicode Standard (ISO 10646) for the accented characters used in Spanish. The Supplement/accented characters are rendered using UTF-8 encoding. For all of the documents in this corpus, a rudimentary (and _approximate_) categorization of DOC units into four distinct types has been applied. The classification is indicated by the type=string attribute that is included in each opening DOC tag. The four types are: * story : This is by far the most frequent type, and it represents the most typical newswire item: a coherent report on a particular topic or event, consisting of paragraphs and full sentences. * multi : This type of DOC contains a series of unrelated blurbs, each of which briefly describes a particular topic or event this is typically applied to DOCs that contain summaries of todays news, news briefs in ... (some general area like finance or sports), and so on. * advis : (short for advisory) These are DOCs which the news service addresses to news editors -- they are not intended for publication to the end users (the populations who read the news). This type contains formulaic, repetitive content (contact phone numbers, etc). * other : This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike multi type DOCS), and they typically do not contain paragraphs or sentences (they arent really stories) these are things like lists of sports scores, stock prices, temperatures around the world, and so on. Sample Please view this sample. Updates An update to Spanish Gigaword Third Edition was issued to fix an issue with 26 consecutive months of data files from Xinhua Spanish: xin_spa_200601 through xin_spa_200802 i.e. all files from 2006 and 2007, plus the first two files from 2008. The problem was that all letters with diacritic marks had been omitted in the text data for that portion of the collection. For example, the word año was presented as ao (minus the n-with-tilde character), aspiracion appeared as aspiracin, and similarly for all accented characters (UTF-8 letters outside the ASCII range). All copies of Spanish Gigaword Third Edition ordered after February 2013 will have this update included. More information is included in the readme associated with this update.
Extent:		Corpus size: 2736871 KB
Identifier:		LDC2011T12
		https://catalog.ldc.upenn.edu/LDC2011T12
		ISBN: 1-58563-596-0
		ISLRN: 595-627-966-073-3
		DOI: 10.35111/v6z1-cb91
Language:		Spanish
Language (ISO639):		spa
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2011T12
Rights Holder:		Portions © 1994-2010 Agence France Presse, © 1993-2010 The Associated Press, © 2001-2010 Xinhua News Agency, © 2006, 2009, 2011, 2013 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2011T12
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Mendonça, Ângelo; Jaquette, Daniel; Graff, David; DiPersio, Denise. 2011. Linguistic Data Consortium.
Terms:		area_Europe country_ES dcmi_Text iso639_spa olac_primary_text