OLAC Record: ACE 2005 Multilingual Training Corpus

OLAC Record
oai:www.ldc.upenn.edu:LDC2006T06

Metadata

Title: ACE 2005 Multilingual Training Corpus

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Walker, Christopher, et al. ACE 2005 Multilingual Training Corpus LDC2006T06. Web Download. Philadelphia: Linguistic Data Consortium, 2006

Contributor: Walker, Christopher

Strassel, Stephanie

Medero, Julie

Maeda, Kazuaki

Date (W3CDTF): 2006

Date Issued (W3CDTF): 2006-02-15

Description: *Introduction* ACE 2005 Multilingual Training Corpus was developed by the Linguistic Data Consortium (LDC) and contains approximately 1,800 files of mixed genre text in English, Arabic, and Chinese annotated for entities, relations, and events. This represents the complete set of training data in those languages for the 2005 Automatic Content Extraction (ACE) technology evaluation. The genres include newswire, broadcast news, broadcast conversation, weblog, discussion forums, and conversational telephone speech. The data was annotated by LDC with support from the ACE Program and additional assistance from LDC. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form. In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation, and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese, and Arabic. Events tasks were evaluated in English and Chinese only. This release comprises the official training data for these evaluation tasks. For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions and other documentation, see LDC's ACE website. *Data* Below is information about the amount of data in this release and its annotation status. Further information such as breakdown of genres and formats can be found in the associated README file. * 1P: data subject to first pass (complete) annotation * DUAL: data also subject to dual first pass (complete) annotation * ADJ: data also subject to discrepancy resolution/adjudication * NORM: data also subject to TIMEX2 normalization English words files 1P DUAL ADJ NORM 1P DUAL ADJ NORM 303833 297185 216545 259889 666 650 535 599 Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word. chars files 1P DUAL ADJ 1P DUAL ADJ 334121 325834 307991 687 671 633 Arabic words files 1P DUAL ADJ 1P DUAL ADJ 112233 103504 100114 433 409 403 *Samples* For examples of the data in this publication, please review the following samples: * Arabic (XML) * English (XML) * Chinese (XML) *Updates* None at this time.

Extent: Corpus size: 1572864 KB

Identifier: LDC2006T06

https://catalog.ldc.upenn.edu/LDC2006T06

ISBN: 1-58563-376-3

ISLRN: 458-031-085-383-4

DOI: 10.35111/mwxc-vh88

Language: Mandarin Chinese

Standard Arabic

English

Language (ISO639): cmn

arb

eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2006T06

Rights Holder: Portions © 2000-2003 Agence France Presse, © 2003 The Associated Press, © 2003 New York Times, © 2000-2001, 2003 Xinhua News Agency, © 2003 Cable News Network LP, LLLP, © 2000-2001 SPH AsiaOne Ltd, © 2000-2001 China Broadcasting System, © 2000-2001 China National Radio, © 2000-2001 China Television System, © 2000-2001 China Central TV, © 2000-2001 Al Hayat, © 2000-2001 An-Nahar, © 2000-2001 Nile TV, © 2005, 2006 Trustees of the University of Pennsylvania

Subject (OLAC): computational_linguistics

Type (DCMI): Text

Type (Discourse): dialogue

report

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2006T06

DateStamp: 2024-04-29

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Walker, Christopher; Strassel, Stephanie; Medero, Julie; Maeda, Kazuaki. 2006. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_arb iso639_cmn iso639_eng olac_computational_linguistics olac_dialogue olac_primary_text olac_report

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2006T06
Up-to-date as of: Wed Oct 29 7:00:54 EDT 2025

Metadata
Title:		ACE 2005 Multilingual Training Corpus
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Walker, Christopher, et al. ACE 2005 Multilingual Training Corpus LDC2006T06. Web Download. Philadelphia: Linguistic Data Consortium, 2006
Contributor:		Walker, Christopher
		Strassel, Stephanie
		Medero, Julie
		Maeda, Kazuaki
Date (W3CDTF):		2006
Date Issued (W3CDTF):		2006-02-15
Description:		Introduction ACE 2005 Multilingual Training Corpus was developed by the Linguistic Data Consortium (LDC) and contains approximately 1,800 files of mixed genre text in English, Arabic, and Chinese annotated for entities, relations, and events. This represents the complete set of training data in those languages for the 2005 Automatic Content Extraction (ACE) technology evaluation. The genres include newswire, broadcast news, broadcast conversation, weblog, discussion forums, and conversational telephone speech. The data was annotated by LDC with support from the ACE Program and additional assistance from LDC. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form. In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation, and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese, and Arabic. Events tasks were evaluated in English and Chinese only. This release comprises the official training data for these evaluation tasks. For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions and other documentation, see LDC's ACE website. Data Below is information about the amount of data in this release and its annotation status. Further information such as breakdown of genres and formats can be found in the associated README file. * 1P: data subject to first pass (complete) annotation * DUAL: data also subject to dual first pass (complete) annotation * ADJ: data also subject to discrepancy resolution/adjudication * NORM: data also subject to TIMEX2 normalization English words files 1P DUAL ADJ NORM 1P DUAL ADJ NORM 303833 297185 216545 259889 666 650 535 599 Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word. chars files 1P DUAL ADJ 1P DUAL ADJ 334121 325834 307991 687 671 633 Arabic words files 1P DUAL ADJ 1P DUAL ADJ 112233 103504 100114 433 409 403 Samples For examples of the data in this publication, please review the following samples: * Arabic (XML) * English (XML) * Chinese (XML) Updates None at this time.
Extent:		Corpus size: 1572864 KB
Identifier:		LDC2006T06
		https://catalog.ldc.upenn.edu/LDC2006T06
		ISBN: 1-58563-376-3
		ISLRN: 458-031-085-383-4
		DOI: 10.35111/mwxc-vh88
Language:		Mandarin Chinese
		Standard Arabic
		English
Language (ISO639):		cmn
		arb
		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2006T06
Rights Holder:		Portions © 2000-2003 Agence France Presse, © 2003 The Associated Press, © 2003 New York Times, © 2000-2001, 2003 Xinhua News Agency, © 2003 Cable News Network LP, LLLP, © 2000-2001 SPH AsiaOne Ltd, © 2000-2001 China Broadcasting System, © 2000-2001 China National Radio, © 2000-2001 China Television System, © 2000-2001 China Central TV, © 2000-2001 Al Hayat, © 2000-2001 An-Nahar, © 2000-2001 Nile TV, © 2005, 2006 Trustees of the University of Pennsylvania
Subject (OLAC):		computational_linguistics
Type (DCMI):		Text
Type (Discourse):		dialogue
Type (Discourse):		report
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2006T06
DateStamp:		2024-04-29
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Walker, Christopher; Strassel, Stephanie; Medero, Julie; Maeda, Kazuaki. 2006. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_arb iso639_cmn iso639_eng olac_computational_linguistics olac_dialogue olac_primary_text olac_report