OLAC Record
oai:www.ldc.upenn.edu:LDC2009T14

Metadata
Title:Tagged Chinese Gigaword Version 2.0
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Huang, Chu-Ren. Tagged Chinese Gigaword Version 2.0 LDC2009T14. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:Huang, Chu-Ren
Date (W3CDTF):2009
Date Issued (W3CDTF):2009-06-18
Description:*Introduction* Tagged Chinese Gigaword Version 2.0, created by scholars at Academia Sinica, Taipei, Taiwan, is a part-of-speech tagged version of LDC's Chinese Gigaword Second Edition (LDC2005T14). Like the original release, Version 2.0 contains all of the data in Chinese Gigaword Second Edition -- from Central News Agency, Xinhua News Agency and Lianhe Zaobao -- annotated with full part of speech tags. In addtion, this new release removes residual noises in the original and improves tagging accuracy by incorporating lexica of unknown words. The changes represented in Version 2.0 include the following: * A single-width space is used consistently between two segmented words. * The position of the newline character remains fixed, better reflecting the source files from Chinese Gigaword Second Edition (LDC2005T14). * The original coding of partial Latin letters or Arabic numerals is preserved. * 1,192 documents from Central News Agency (Taiwan) and 13 documents from Xinhua News Agency that were missing from the first publication are included. * A set of heuristics for building out-of-vocabulary dictionaries to improve annotation quality of very large corpora is incorporated. Documents in the corpus were assigned one of the following categories: * story: This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences. * multi: This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event; examples include "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on. * advis: These are DOCs which the news service addresses to news editors; they are not intended for publication to the "end users." * other: These DOCs clearly do not fall into any of the above types; they include items such as lists of sports scores, stock prices, temperatures around the world, and so on. *Data* Basic statistics of data from each source are summarized below. Source No. Files Compressed Size(MB) Total Size(MB) No. Words(thousands) No. Documents CNA_CMN 168 1520 6136 501456 1769953 XIN_CMN 168 898 3755 311660 992261 ZBN_CMN 10 55 214 18632 41418 TOTAL 346 2473 10105 831748 2803632 The POS tags and their corresponding explanations are listed below: Tag Explanation_Chinese Explantation_English A 非謂形容詞 Non-predicative adjective Caa 對等連接詞,如:和、跟 Conjunctive conjunction Cab 連接詞,如:等等 Conjunction, e.g.deng3deng3 Cba 連接詞,如:的話 Conjunction, e.g.de5hua4 Cbb 關聯連接詞 Correlative Conjunction D 副詞 Adverb Da 數量副詞 Quantitative Adverb DE 的, 之, 得, 地 Particle DE and its functional equivalents Dfa 動詞前程度副詞 Pre-verbal Adverb of degree Dfb 動詞後程度副詞 Post-verbal Adverb of degree Di 時態標記 Aspectual Adverb Dk 句副詞 Sentential Adverb FW 外文標記 Foreign Word I 感嘆詞 Interjection Na 普通名詞 Common Noun Nb 專有名稱 Proper Noun Nc 地方詞 Place Noun Ncd 位置詞 Localizer Nd 時間詞 Time Noun Nep 指代定詞 Demonstrative Determinatives Neqa 數量定詞 Quantitative Determinatives Neqb 後置數量定詞 Post-quantitative Determinatives Nes 特指定詞 Specific Determinatives Neu 數詞定詞 Numeral Determinatives Nf 量詞 Measure Ng 後置詞 Postposition Nh 代名詞 Pronoun P 介詞 Preposition SHI 是 you3 (to have) T 語助詞 Particle VA 動作不及物動詞 Active Intransitive Verb VAC 動作使動動詞 Active Causative Verb VB 動作類及物動詞 Active Pseudo-transitive Verb VC 動作及物動詞 Active Transitive Verb VCL 動作接地方賓語動詞 Active Verb with a Locative Object VD 雙賓動詞 Ditransitive Verb VE 動作句賓動詞 Active Verb with a Sentential Object VF 動作謂賓動詞 Active Verb with a Verbal Object VG 分類動詞 Classificatory Verb VH 狀態不及物動詞 Stative Intransitive Verb VHC 狀態使動動詞 Stative Causative Verb VI 狀態類及物動詞 Stative Pseudo-transitive Verb VJ 狀態及物動詞 Stative Transitive Verb VK 狀態句賓動詞 Stative Verb with a Sentential Object VL 狀態謂賓動詞 Stative Verb with a Verbal Object V_2 有 有 Since neither manual checking nor automatic checking against a gold standard is feasible for gigaword size corpora, the authors proposed quality assurance of automatic annotation of very large corpora based on heterogeneous CKIP and ICTCLAS tagging systems (Huang et al., 2008). By comparing to word lists generated from the ICTCLAS version of an automatic tagged Xinhua portion of Chinese Gigaword, a set of heuristics for building out-of-vocabulary dictionaries to improve quality were proposed. Randomly selected texts for evaluating effects of these out-of-vocabulary dictionaries were manually checked. Experimental results indicate that there were 30,562 correct words (about 97.3 %) of tested words. The quality control test result follows: Corpora Thousands of words No. Test words No. Correct Words CNA 501459 42,695 41,449 XIN 311718 28,744 27,967 ZBN 18632 22,825 22,270 Total 831809 31,421 30,562 *Samples* Please view this sample.
Extent:Corpus size: 2453667 KB
Identifier:LDC2009T14
https://catalog.ldc.upenn.edu/LDC2009T14
ISBN: 1-58563-516-2
ISLRN: 247-043-830-464-8
DOI: 10.35111/9bhh-2s82
Language:Mandarin Chinese
Chinese
Language (ISO639):cmn
zho
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2009T14
Rights Holder:Portions © 2005-2009 Academia Sinica, © 1991-1994 Central News Agengy (Taiwan), © 2000-2003 SPH AsiaOne, Ltd., © 1990-2004 Xinhua News Agency, © 2005, 2007, 2009 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2009T14
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Huang, Chu-Ren. 2009. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn iso639_zho olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009T14
Up-to-date as of: Thu Oct 24 7:30:25 EDT 2024