|  | OLAC Record oai:www.ldc.upenn.edu:LDC2006T17 | 
| Metadata | ||
| Title: | French Gigaword First Edition | |
| Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
| Bibliographic Citation: | Graff, David. French Gigaword First Edition LDC2006T17. Web Download. Philadelphia: Linguistic Data Consortium, 2006 | |
| Contributor: | Graff, David | |
| Date (W3CDTF): | 2006 | |
| Date Issued (W3CDTF): | 2006-11-17 | |
| Description: | *Introduction* French Gigaword First Edition is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) and consists of over 650 million tokens spanning approximately 2.4 million documents. The two distinct international sources of French newswire in this edition, and the time spans of collection covered for each, are as follows: * Agence France-Presse (afp_fre) May 1994 - July 2006 * Associated Press French Service (apw_fre) Nov 1994 - July 2006 *Data* The overall totals for each source are summarized below. The "K-wrds" figures are simply the number in thousands of whitespace-separated tokens of all types after all SGML tags are eliminated. Source K-wrds #DOCs AFP_FRE 482904 1797139 APW_FRE 167405 622740 TOTAL 650309 2419879 All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII, whitespace, and printable code points in the "Latin1 Supplement" character table, as defined by the Unicode Standard (ISO 10646) for the "accented" characters used in French. The Supplement/accented characters are presented in UTF-8 encoding. Most of the text data (all of AFP_FRE, most of APW_FRE) were received at LDC via dedicated, 24-hour/day electronic feeds (leased phone lines in the case of APW_FRE, a local satellite dish for AFP_FRE). These 24-hour transmission services were all susceptible to "line noise" (occasional corruption of text content), as well as service outages both at the data source and at our receiving computers. Usually, the various disruptions of a newswire data stream would leave tell-tale evidence in the form of byte values falling outside the range of printable ASCII characters, or recognizable patterns of anomalous ASCII strings. The portion of APW_FRE data beginning with 200406 was received as bulk electronic text archives via internet retrieval. As such, they were not susceptible to modem line-noise or related disruptions, though this does not guarantee that the source data are free of mishaps. More detailed information can be found in the 0readme.txt file in the associated documentation. *Samples* Please view this text sample (TXT). *Updates* None at this time. | |
| Extent: | Corpus size: 1572864 KB | |
| Identifier: | LDC2006T17 | |
| https://catalog.ldc.upenn.edu/LDC2006T17 | ||
| ISBN: 1-58563-405-0 | ||
| ISLRN: 351-085-945-382-6 | ||
| DOI: 10.35111/n8na-xw24 | ||
| Language: | French | |
| Language (ISO639): | fra | |
| License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
| Medium: | Distribution: Web Download | |
| Publisher: | Linguistic Data Consortium | |
| Publisher (URI): | https://www.ldc.upenn.edu | |
| Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2006T17 | |
| Rights Holder: | Portions © 1994-2006 Agence France-Presse, © 1994-2006 The Associated Press, © 2006 Trustees of the University of Pennsylvania | |
| Type (DCMI): | Text | |
| Type (OLAC): | primary_text | |
| OLAC Info | ||
| Archive: | The LDC Corpus Catalog | |
| Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
| GetRecord: | OAI-PMH request for OLAC format | |
| GetRecord: | Pre-generated XML file | |
| OAI Info | ||
| OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2006T17 | |
| DateStamp: | 2021-02-12 | |
| GetRecord: | OAI-PMH request for simple DC format | |
| Search Info | ||
| Citation: | Graff, David. 2006. Linguistic Data Consortium. | |
| Terms: | area_Europe country_FR dcmi_Text iso639_fra olac_primary_text | |