OLAC Record: 2009 CoNLL Shared Task Part 2

OLAC Record
oai:www.ldc.upenn.edu:LDC2012T04

Metadata

Title: 2009 CoNLL Shared Task Part 2

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Hajič, Jan , et al. 2009 CoNLL Shared Task Part 2 LDC2012T04. Web Download. Philadelphia: Linguistic Data Consortium, 2012

Contributor: Hajič, Jan

Ciaramita, Massimiliano

Johansson, Richard

Meyers, Adam

Štěpánek, Jan

Nivre, Joakim

Straňák, Pavel

Surdeanu, Mihai

Nianwen (Bert) Xue

Zhang, Yi

Date (W3CDTF): 2012

Date Issued (W3CDTF): 2012-04-20

Description: *Introduction* 2009 CoNLL Shared Task Part 2, LDC Catalog Number LDC2012T04 and ISBN 1-58563-611-8, contains the Chinese and English trial corpora, training corpora, development and test data for the 2009 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2009 Shared Task developed syntactic dependency annotations, including the semantic dependencies model roles of both verbal and nominal predicates. The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. The 2004 and 2005 CoNLL shared tasks were dedicated to semantic role labeling (SRL) in a monolingual setting (English). In 2006 and 2007, the shared tasks were devoted to the parsing of syntactic dependencies and used corpora from up to thirteen languages. In 2008, the shared task focused on English and employed a unified dependency-based formalism and merged the task of syntactic dependency parsing and the task of identifying semantic arguments and labeling them with semantic roles that data has been released by LDC as 2008 CoNLL Shared Task Data. The 2009 task extended the 2008 task to several languages (English plus Catalan, Chinese, Czech, German, Japanese and Spanish). Among the new features were comparison of time and space complexity based on participants input, and learning curve comparison for languages with large datasets. The 2009 shared task was divided into two subtasks: * parsing syntactic dependencies * identification of arguments and assignment of semantic roles for each predicate 2009 CoNLL Shared Task Part 1 (LDC2012T03) contains the Catalan, Czech, German and Spanish task data and is also available through LDC. LDC has also released the following CoNLL Shared Task data sets: * 2006 CoNLL Shared Task - Ten Languages (LDC2015T11) * 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12) * 2008 CoNLL Shared Task Data (LDC2009T12) * 2015-2016 CoNLL Shared Task (LDC2017T13) *Data* The materials in this release consist of excerpts from the following corpora: * Treebank-2 (LDC95T7) (English): over one million words of annotated English newswire and other text developed by the University of Pennsylvania * PropBank (LDC2004T14) (English): semantic annotation of newswire text from Treebank-2 developed by the University of Pennsylvania * NomBank (LDC2008T23) (English): argument structure for instances of common nouns in Treebank-2 and Treebank-3 (LDC99T42) texts developed by New York University * Chinese Treebank 6.0 (LDC2007T36) (Chinese): 780,000 words (over 1.28 million characters) of annotated Chinese newswire, magazine and administrative texts and transcripts from various broadcast news programs developed by the University of Pennsylvania and the University of Colorado * Chinese Proposition Bank 2.0 (LDC2008T07) (Chinese): predicate-argument annotation on 500,000 words from Chinese Treebank 6.0 developed by the University of Pennsylvania and the University of Colorado In addition, an archive of all of the uploaded data from the participants is included in the eval-data folder. Users should note that not all data indicated in the individual READMEs is included in this release and neither are some of the corresponding DTDs for the XML. Additionally, all data is presented in its uncompressed form for ease of use. Within the user eval-data folder, the two folders marked bad contain references to data from languages included in Part 1 of this release as well as to Japanese data. Japanese data is not included in this release. *Samples* For samples of documents from each language use the links below: * Chinese * English *Updates* None at this time.

Extent: Corpus size: 342062 KB

Identifier: LDC2012T04

https://catalog.ldc.upenn.edu/LDC2012T04

ISBN: 1-58563-611-8

ISLRN: 088-658-711-565-5

DOI: 10.35111/gd8z-qp80

Language: English

Mandarin Chinese

Chinese

Language (ISO639): eng

cmn

zho

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2012T04

Rights Holder: Portions © 2000-2001 China Broadcasting System, © 2000-2001 China Central TV, © 2000-2001 China National Radio, © 2000-2001 China Television System, © 1987-1989 Dow Jones & Company, Inc., © 1997 The Government of the Hong Kong Special Administrative Region, © 1996-2001 Sinorama Magazine, © 1994-1998 Xinhua News Agency, © 1995, 1999, 2001, 2004, 2005, 2007, 2008, 2012 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2012T04

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Hajič, Jan; Ciaramita, Massimiliano; Johansson, Richard; Meyers, Adam; Štěpánek, Jan; Nivre, Joakim; Straňák, Pavel; Surdeanu, Mihai; Nianwen (Bert) Xue; Zhang, Yi. 2012. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng iso639_zho olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2012T04
Up-to-date as of: Wed Oct 29 7:01:19 EDT 2025

Metadata
Title:		2009 CoNLL Shared Task Part 2
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Hajič, Jan , et al. 2009 CoNLL Shared Task Part 2 LDC2012T04. Web Download. Philadelphia: Linguistic Data Consortium, 2012
Contributor:		Hajič, Jan
		Ciaramita, Massimiliano
		Johansson, Richard
		Meyers, Adam
		Štěpánek, Jan
		Nivre, Joakim
		Straňák, Pavel
		Surdeanu, Mihai
		Nianwen (Bert) Xue
		Zhang, Yi
Date (W3CDTF):		2012
Date Issued (W3CDTF):		2012-04-20
Description:		Introduction 2009 CoNLL Shared Task Part 2, LDC Catalog Number LDC2012T04 and ISBN 1-58563-611-8, contains the Chinese and English trial corpora, training corpora, development and test data for the 2009 CoNLL (Conference on Computational Natural Language Learning) Shared Task Evaluation. The 2009 Shared Task developed syntactic dependency annotations, including the semantic dependencies model roles of both verbal and nominal predicates. The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. The 2004 and 2005 CoNLL shared tasks were dedicated to semantic role labeling (SRL) in a monolingual setting (English). In 2006 and 2007, the shared tasks were devoted to the parsing of syntactic dependencies and used corpora from up to thirteen languages. In 2008, the shared task focused on English and employed a unified dependency-based formalism and merged the task of syntactic dependency parsing and the task of identifying semantic arguments and labeling them with semantic roles that data has been released by LDC as 2008 CoNLL Shared Task Data. The 2009 task extended the 2008 task to several languages (English plus Catalan, Chinese, Czech, German, Japanese and Spanish). Among the new features were comparison of time and space complexity based on participants input, and learning curve comparison for languages with large datasets. The 2009 shared task was divided into two subtasks: * parsing syntactic dependencies * identification of arguments and assignment of semantic roles for each predicate 2009 CoNLL Shared Task Part 1 (LDC2012T03) contains the Catalan, Czech, German and Spanish task data and is also available through LDC. LDC has also released the following CoNLL Shared Task data sets: * 2006 CoNLL Shared Task - Ten Languages (LDC2015T11) * 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12) * 2008 CoNLL Shared Task Data (LDC2009T12) * 2015-2016 CoNLL Shared Task (LDC2017T13) Data The materials in this release consist of excerpts from the following corpora: * Treebank-2 (LDC95T7) (English): over one million words of annotated English newswire and other text developed by the University of Pennsylvania * PropBank (LDC2004T14) (English): semantic annotation of newswire text from Treebank-2 developed by the University of Pennsylvania * NomBank (LDC2008T23) (English): argument structure for instances of common nouns in Treebank-2 and Treebank-3 (LDC99T42) texts developed by New York University * Chinese Treebank 6.0 (LDC2007T36) (Chinese): 780,000 words (over 1.28 million characters) of annotated Chinese newswire, magazine and administrative texts and transcripts from various broadcast news programs developed by the University of Pennsylvania and the University of Colorado * Chinese Proposition Bank 2.0 (LDC2008T07) (Chinese): predicate-argument annotation on 500,000 words from Chinese Treebank 6.0 developed by the University of Pennsylvania and the University of Colorado In addition, an archive of all of the uploaded data from the participants is included in the eval-data folder. Users should note that not all data indicated in the individual READMEs is included in this release and neither are some of the corresponding DTDs for the XML. Additionally, all data is presented in its uncompressed form for ease of use. Within the user eval-data folder, the two folders marked bad contain references to data from languages included in Part 1 of this release as well as to Japanese data. Japanese data is not included in this release. Samples For samples of documents from each language use the links below: * Chinese * English Updates None at this time.
Extent:		Corpus size: 342062 KB
Identifier:		LDC2012T04
		https://catalog.ldc.upenn.edu/LDC2012T04
		ISBN: 1-58563-611-8
		ISLRN: 088-658-711-565-5
		DOI: 10.35111/gd8z-qp80
Language:		English
		Mandarin Chinese
		Chinese
Language (ISO639):		eng
		cmn
		zho
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2012T04
Rights Holder:		Portions © 2000-2001 China Broadcasting System, © 2000-2001 China Central TV, © 2000-2001 China National Radio, © 2000-2001 China Television System, © 1987-1989 Dow Jones & Company, Inc., © 1997 The Government of the Hong Kong Special Administrative Region, © 1996-2001 Sinorama Magazine, © 1994-1998 Xinhua News Agency, © 1995, 1999, 2001, 2004, 2005, 2007, 2008, 2012 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2012T04
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Hajič, Jan; Ciaramita, Massimiliano; Johansson, Richard; Meyers, Adam; Štěpánek, Jan; Nivre, Joakim; Straňák, Pavel; Surdeanu, Mihai; Nianwen (Bert) Xue; Zhang, Yi. 2012. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB dcmi_Text iso639_cmn iso639_eng iso639_zho olac_primary_text