OLAC Record oai:www.ldc.upenn.edu:LDC2009T28 |
Metadata | ||
Title: | French Gigaword Second Edition | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Mendonça, Ângelo, David Graff, and Denise DiPersio. French Gigaword Second Edition LDC2009T28. Web Download. Philadelphia: Linguistic Data Consortium, 2009 | |
Contributor: | Mendonça, Ângelo | |
Graff, David | ||
DiPersio, Denise | ||
Date (W3CDTF): | 2009 | |
Date Issued (W3CDTF): | 2009-11-20 | |
Description: | *Introduction* French Gigaword Second Edition is a comprehensive archive of newswire text data that has been acquired over several years by LDC. This second edition updates French Gigaword First Edition (LDC2006T17) and adds material collected from August 1, 2006 through December 31, 2008. The two distinct international sources of French newswire in this edition, and the time spans of collection covered for each, are as follows: * Agence France-Presse (afp_fre) May 1994 - Dec 2008 * Associated Press Worldstream, French (apw_fre) Nov 1994 - Dec 2008 The seven-letter codes in parentheses include the three-character source name abbreviations and the three-character language code (fre) separated by an underscore (_) character. The three-letter language code conforms to LDCs internal convention based on the ISO 639-3 standard. These codes are used in the directory names where the data files are found and in the prefix that appears at the beginning of every data file name. They are also used (in all UPPER CASE) as the initial portion of the DOC id strings that uniquely identify each news story. *Data* The overall totals for each source are summarized below. The Totl-MB numbers show the amount of data obtained when the files are uncompressed (i.e., approximately 15 gigabytes, total) the Gzip-MB column shows totals for compressed file sizes and the K-wrds numbers are the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated. Source #Files Gzip-MB Totl-MB K-wrds #DOCs AFP_FRE 172 2408 4079 560000 2060803 APW_FRE 171 2280 1719 241324 0872573 TOTAL 343 4688 5789 801324 2933376 The following tables present Text-MB, K-wrds and #DOCS broken down by source and DOC type Text-MB represents the total number of characters (including whitespace) after SGML tags are eliminated. APW_FRE1942982852240 Source Text-MB K-wrds #DOCs type=advis: AFP_FRE 88 11788 48712 APW_FRE 14 2303 9235 TOTAL 103 14091 57947 type=multi: AFP_FRE 59 8411 10269 TOTAL 253 38239 62509 type=other: AFP_FRE 178 58514 8411 APW_FRE 82 193981 29828 TOTAL 260 38239 38239 type=story: AFP_FRE 1824 198440 27216 APW_FRE 729 87662 13006 TOTAL 2553 286102 40222 The data has undergone a consistent extent of quality control to eliminate out-of-band content and other obvious forms of corruption. Since the source data is generated manually on a daily basis, there will be a small percentage of human errors common to all sources: missing whitespace, incorrect or variant spellings, badly formed sentences, and so on, as are normally seen in newspapers. No attempt has been made to address this property of the data. *Samples* For an example of the data in this corpus, please view this image of the text of French Gigaword. | |
Extent: | Corpus size: 1782579 KB | |
Identifier: | LDC2009T28 | |
https://catalog.ldc.upenn.edu/LDC2009T28 | ||
ISBN: 1-58563-528-6 | ||
ISLRN: 739-169-067-045-4 | ||
DOI: 10.35111/5s4k-q428 | ||
Language: | French | |
Language (ISO639): | fra | |
License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2009T28 | |
Rights Holder: | Portions © 1994-2008 Agence France-Presse, © 1994-2008 The Associated Press, © 2006, 2009 Trustees of the University of Pennsylvania | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2009T28 | |
DateStamp: | 2020-11-30 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Mendonça, Ângelo; Graff, David; DiPersio, Denise. 2009. Linguistic Data Consortium. | |
Terms: | area_Europe country_FR dcmi_Text iso639_fra olac_primary_text |