OLAC Record oai:www.ldc.upenn.edu:LDC2016T05 |
Metadata | ||
Title: | BOLT Chinese Discussion Forums | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Tracey, Jennifer, et al. BOLT Chinese Discussion Forums LDC2016T05. Web Download. Philadelphia: Linguistic Data Consortium, 2016 | |
Contributor: | Tracey, Jennifer | |
Lee, Haejoong | ||
Strassel, Stephanie | ||
Chen, Song | ||
Date (W3CDTF): | 2016 | |
Date Issued (W3CDTF): | 2016-02-15 | |
Description: | *Introduction* BOLT Chinese Discussion Forums was developed by the Linguistic Data Consortium (LDC) and consists of 1,597,500 discussion forum threads in Chinese harvested from the Internet using a combination of manual and automatic processes. The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The material in this release represents the unannotated Chinese source data in the discussion forum genre. The data was subseqently translated and annotated for various tasks in the BOLT program including word alignment, treebanking, propbanking and co-reference. *Data* Collection was seeded based on the results of manual data scouting by native speaker annotators. Scouts were instructed to seek content in Mandarin Chinese that was original, interactive and informal. Upon locating an appropriate thread, scouts submitted the URL and some simple judgments about it to a database, via a web browser plug-in. When multiple threads from a forum were submitted, the entire forum was automatically harvested and added to the collection. The scale of the collection precluded manual review of all data. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-Chinese content. Language identification was performed on all threads in this corpus (using CLD2), and threads for which the results indicated a high probability of largely non-Chinese content are listed in cmn_suspect_LID.txt in the docs directory of this package. The corpus is comprised of HTML and XML files. The HTML files are a raw HTML file downloaded from the discussion thread. If the thread spanned multiple URLs, it was stored as a concatenation of the downloaded HTML files. The XML files were converted from the raw HTML. *Acknowledgement* This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. *Samples* Please view this html sample and xml sample. *Updates* None at this time. | |
Extent: | Corpus size: 33506184 KB | |
Identifier: | LDC2016T05 | |
https://catalog.ldc.upenn.edu/LDC2016T05 | ||
ISBN: 1-58563-743-2 | ||
ISLRN: 682-988-480-192-1 | ||
DOI: 10.35111/3vm2-t509 | ||
Language: | Mandarin Chinese | |
Chinese | ||
Language (ISO639): | cmn | |
zho | ||
License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
Medium: | Distribution: Web Download | |
Provenance: | Collected by the Linguistic Data Consortium (LDC) in Philadelphia, PA, USA. | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2016T05 | |
Rights Holder: | Portions © 2016 Trustees of the University of Pennsylvania | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2016T05 | |
DateStamp: | 2020-11-30 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Tracey, Jennifer; Lee, Haejoong; Strassel, Stephanie; Chen, Song. 2016. Linguistic Data Consortium. | |
Terms: | area_Asia country_CN dcmi_Text iso639_cmn iso639_zho olac_primary_text |