OLAC Record oai:www.ldc.upenn.edu:LDC2016T13 |
Metadata | ||
Title: | Chinese Treebank 9.0 | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Xue, Nianwen, et al. Chinese Treebank 9.0 LDC2016T13. Web Download. Philadelphia: Linguistic Data Consortium, 2016 | |
Contributor: | Xue, Nianwen | |
Zhang, Xiuhong | ||
Jiang, Zixin | ||
Palmer, Martha | ||
Xia, Fei | ||
Chiou, Fu-Dong | ||
Chang, Meiyu | ||
Date (W3CDTF): | 2016 | |
Date Issued (W3CDTF): | 2016-06-15 | |
Description: | *Introduction* Chinese Treebank 9.0 consists of approximately two million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups, weblogs, discussion forums, chat messages and transcribed conversational telephone speech. The Chinese Treebank project began at the University of Pennsylvania in 1998, continued at the University of Colorado and then moved to Brandeis University. The project's goal is to provide a large, part-of-speech tagged and fully bracketed Chinese language corpus. The first delivery, Chinese Treebank 1.0, contained 100,000 syntactically annotated words from Xinhua News Agency newswire. It was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words. LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in 2004. A year later, LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01). Chinese Treebank 6.0 (LDC2007T36), released in 2007, consisted of 780,000 words. Chinese Treebank 7.0 (LDC2010T07), released in 2010, added new annotated newswire data, broadcast material and web text to the approximate total of one million words. Chinese Treebank 8.0 (LDC2013T21) included new annotated data from newswire, magazine articles and government documents. Chinese Treebank 9.0 adds more annotated web data and two new genres - chat messages and transcribed conversational telephone speech. *Data* There are 3,726 text files in this release, containing 132,076 sentences, 2,084,387 words, 3,247,331 characters (hanzi or foreign). The data is provided in the UTF-8 encoding, and the annotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the enclosed segmentation, POS-tagging and bracketing guidelines. The data is provided in four different formats: raw text, word segmented, POS-tagged, and syntactically bracketed formats. All files were automatically verified and manually checked. *Samples* Please view the following samples: * POSTagged * Raw * Segmented * Bracketed *Acknowledgement* This work was supported in part by the Defense Advanced Research Projects Agency DOD MDA902-97-C-0307, DARPA TIDES N66001-00-1-8915, DARPA GALE HR0011-06-0022, and DARPA BOLT HR0011-11-C-0145. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. *Updates* None at this time. | |
Extent: | Corpus size: 188784 KB | |
Identifier: | LDC2016T13 | |
https://catalog.ldc.upenn.edu/LDC2016T13 | ||
ISBN: 1-58563-757-2 | ||
ISLRN: 219-696-236-485-2 | ||
DOI: 10.35111/gvd0-xk91 | ||
Language: | Chinese | |
Mandarin Chinese | ||
Language (ISO639): | zho | |
cmn | ||
License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2016T13 | |
Rights Holder: | Portions © 2006 Agence France Presse, © 2006 Anhui TV, © 2005 Cable News Network, LP, LLLP, © 2000-2001 China Broadcasting System, © 2000-2001, 2005-2006 China Central TV, © 2000-2001 China National Radio, © 2006 Chinanews.com, © 2000-2001 China Television System, © 2006 Guangming Daily, © 2006 National Broadcasting Company, Inc., © 2006 New Tang Dynasty TV, © 2006 Peoples Daily Online, © 2005-2006 Phoenix TV, © 1996-2001 Sinorama Magazine, © 1997 The Government of the Hong Kong Special Administrative Region, © 1994-1998, 2006 Xinhua News Agency, © 1996, 2001, 2004, 2005, 2007, 2009, 2010, 2013, 2016 Trustees of the University of Pennsylvania | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2016T13 | |
DateStamp: | 2020-11-30 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Xue, Nianwen; Zhang, Xiuhong; Jiang, Zixin; Palmer, Martha; Xia, Fei; Chiou, Fu-Dong; Chang, Meiyu. 2016. Linguistic Data Consortium. | |
Terms: | area_Asia country_CN dcmi_Text iso639_cmn iso639_zho olac_primary_text |