OLAC Record: REFLEX Entity Translation Training/DevTest

OLAC Record
oai:www.ldc.upenn.edu:LDC2009T11

Metadata

Title: REFLEX Entity Translation Training/DevTest

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Walker, Christopher, et al. REFLEX Entity Translation Training/DevTest LDC2009T11. Web Download. Philadelphia: Linguistic Data Consortium, 2009

Contributor: Walker, Christopher

Chen, Song

Strassel, Stephanie

Medero, Julie

Maeda, Kazuaki

Date (W3CDTF): 2009

Date Issued (W3CDTF): 2009-03-17

Description: *Introduction* REFLEX Entity Translation Training/DevTest was developed by the Linguistic Data Consortium for the Automatic Contact Extraction (ACE) program. This release constitutes the complete set of training data and development test data for the 2007 REFLEX Entity Translation evaluation sponsored by the National Institute of Standards and Technology (NIST) and consists of approximately 67.5k words of newswire and weblog text for each of three languages: English, Chinese and Arabic. The data set is made up of 22.5k words of English data, 22.5k words of Chinese data, and 22.5k words of Arabic data translated into each of the other two languages and annotated for entities and TIMEX2 extents and normalization. Entity Annotation. The annotations identify seven types of entities: Person, Organization, Location, Facility, Weapon, Vehicle and GeoPolitical Entity. Each type is further divided into subtypes (for instance, Person subtypes include Individual, Group and Indefinite). Annotators tagged all mentions of each entity within a document, whether named, nominal or pronominal. For every mention, the annotator identified the maximal extent of the string that represents the entity and labeled the head of each mention. Nested mentions were also captured. Each entity was classified according to its type and subtype. Each entity mention was further tagged according to its class such as specific, generic, attributive, negatively quantified or under specified. Annotators also reviewed the entire document to group mentions of the same entity together; they also labeled cases of metonymy, where the name of one entity is used to refer to another entity (or entities) related to it. TIMEX2 Annotation. TIMEX2 annotation of events and temporal relations fulfills two objectives. The first is the interpretation of expressions that refer to time. Such expressions tell when something happened, or how long something lasted, or how often something occurs. Such expressions also often require knowledge of the temporal context in order to truly understand them. A second objective is the normalization of temporal expressions. This facilitates interoperability between systems. Problems occur, for example, when a programmer in France encodes "October sixteenth 1962" as "1962.10.16" and one in the U.S. encodes it as "10/16/1962". It will appear as if two different dates are being referenced. The standards presented here require that the same meaning is always encoded in the same way. *Sample* Please use this link for a sample.

Extent: Corpus size: 331776 KB

Identifier: LDC2009T11

https://catalog.ldc.upenn.edu/LDC2009T11

ISBN: 1-58563-514-6

ISLRN: 364-559-117-639-5

DOI: 10.35111/eqds-c312

Language: English

Mandarin Chinese

Standard Arabic

Arabic

Language (ISO639): eng

cmn

arb

ara

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2009T11

Rights Holder: Portions © 2000, 2003 Agence France-Presse, © 2003 The Associated Press, © 2000 Al Hayat, © 2000, 2002 An Nahar, © 1994-1998, 2000, 2003 Xinhua News Agency, © 1994-2009 Trustees of University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2009T11

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Walker, Christopher; Chen, Song; Strassel, Stephanie; Medero, Julie; Maeda, Kazuaki. 2009. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_ara iso639_arb iso639_cmn iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009T11
Up-to-date as of: Wed Oct 29 7:01:07 EDT 2025

Metadata
Title:		REFLEX Entity Translation Training/DevTest
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Walker, Christopher, et al. REFLEX Entity Translation Training/DevTest LDC2009T11. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:		Walker, Christopher
		Chen, Song
		Strassel, Stephanie
		Medero, Julie
		Maeda, Kazuaki
Date (W3CDTF):		2009
Date Issued (W3CDTF):		2009-03-17
Description:		Introduction REFLEX Entity Translation Training/DevTest was developed by the Linguistic Data Consortium for the Automatic Contact Extraction (ACE) program. This release constitutes the complete set of training data and development test data for the 2007 REFLEX Entity Translation evaluation sponsored by the National Institute of Standards and Technology (NIST) and consists of approximately 67.5k words of newswire and weblog text for each of three languages: English, Chinese and Arabic. The data set is made up of 22.5k words of English data, 22.5k words of Chinese data, and 22.5k words of Arabic data translated into each of the other two languages and annotated for entities and TIMEX2 extents and normalization. Entity Annotation. The annotations identify seven types of entities: Person, Organization, Location, Facility, Weapon, Vehicle and GeoPolitical Entity. Each type is further divided into subtypes (for instance, Person subtypes include Individual, Group and Indefinite). Annotators tagged all mentions of each entity within a document, whether named, nominal or pronominal. For every mention, the annotator identified the maximal extent of the string that represents the entity and labeled the head of each mention. Nested mentions were also captured. Each entity was classified according to its type and subtype. Each entity mention was further tagged according to its class such as specific, generic, attributive, negatively quantified or under specified. Annotators also reviewed the entire document to group mentions of the same entity together; they also labeled cases of metonymy, where the name of one entity is used to refer to another entity (or entities) related to it. TIMEX2 Annotation. TIMEX2 annotation of events and temporal relations fulfills two objectives. The first is the interpretation of expressions that refer to time. Such expressions tell when something happened, or how long something lasted, or how often something occurs. Such expressions also often require knowledge of the temporal context in order to truly understand them. A second objective is the normalization of temporal expressions. This facilitates interoperability between systems. Problems occur, for example, when a programmer in France encodes "October sixteenth 1962" as "1962.10.16" and one in the U.S. encodes it as "10/16/1962". It will appear as if two different dates are being referenced. The standards presented here require that the same meaning is always encoded in the same way. Sample Please use this link for a sample.
Extent:		Corpus size: 331776 KB
Identifier:		LDC2009T11
		https://catalog.ldc.upenn.edu/LDC2009T11
		ISBN: 1-58563-514-6
		ISLRN: 364-559-117-639-5
		DOI: 10.35111/eqds-c312
Language:		English
		Mandarin Chinese
		Standard Arabic
		Arabic
Language (ISO639):		eng
		cmn
		arb
		ara
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2009T11
Rights Holder:		Portions © 2000, 2003 Agence France-Presse, © 2003 The Associated Press, © 2000 Al Hayat, © 2000, 2002 An Nahar, © 1994-1998, 2000, 2003 Xinhua News Agency, © 1994-2009 Trustees of University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2009T11
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Walker, Christopher; Chen, Song; Strassel, Stephanie; Medero, Julie; Maeda, Kazuaki. 2009. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_ara iso639_arb iso639_cmn iso639_eng olac_primary_text