OLAC Record
oai:www.ldc.upenn.edu:LDC2004T09

Metadata
Title:TIDES Extraction (ACE) 2003 Multilingual Training Data
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Mitchell, Alexis, et al. TIDES Extraction (ACE) 2003 Multilingual Training Data LDC2004T09. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:Mitchell, Alexis
Strassel, Stephanie
Przybocki, Mark
Davis, JK
Doddington, George R.
Grishman, Ralph
Meyers, Adam
Brunstein, Ada
Ferro, Lisa
Sundheim, Beth
Date (W3CDTF):2004
Date Issued (W3CDTF):2004-02-16
Description:*Introduction* TIDES Extraction (ACE) 2003 Multilingual Training Data was produced by the Linguistic Data Consortium (LDC) and contains approximately 231,000 words of broadcast news and newswire text in Arabic, Chinese, and English annotated for entities and relations. This corpus was created and previously distributed by Linguistic Data Consortium as an e-corpus (catalog number LDC2003E18) to support the September 2003 TIDES Extraction (ACE) program evaluation. For more information about ACE annotation and ongoing ACE corpus development, including annotation guidelines, task definitions, annotation tools and other project documentation, please visit LDC's ACE Project page. *Data* The source material for this corpus consists of broadcast and newswire data drawn from October 2000 through the end of December 2000. The sources are listed below with details and whether they include both Entity Detection and Tracking (EDT) and Relation Detection and Characterization (RDC). Language Genre Source Program Words Files Arabic (EDT) Newswire Agence France-Presse 11,154 66 Al-Hayat 7,437 20 An-Nahar 7,734 20 Broadcast News Voice of America Arabic news programs 8,360 57 Nile TV 7,512 43 Totals 42,197 206 Chinese (EDT) (RDC) Newswire Xinhua 28,157 57 Zaobao 25,591 42 Broadcast News China National Radio 4,758 21 China Television System 7,160 22 Voice of America Chinese news programs 18,160 42 China TV Program Agency 6,017 18 China Broadcasting System 8,130 19 Totals 97,973 221 English (EDT) (RDC) Newswire New York Times 18,983 24 Associated Press Worldstream 38,222 81 Broadcast News Cable News Network "Headline News" 5,706 54 American Broadcasting Co. "World News Tonight" 4,453 15 Public Radio International "The World" 9,785 27 Voice of America English news programs 4,203 28 MSNBC "The News With Brian Williams" 4,356 8 National Broadcasting Company "Nightly News" 4,976 15 Totals 90,684 252 Grand Totals 230,854 679 This publication includes both the source data files in .sgm format and the annotation files in ACE Pilot Format (.apf.xml), as well as the ACE DTD and supporting documentation. The data files for each language are divided by source type (bnews, nwire). For Chinese, the annotation files (.apf.xml) are encoded in UTF-8. We have included source files (.sgm) in both GB and UTF8 encoding. *Samples* Please view these samples: * Arabic Source (sgm) * Arabic Annotation (apf) * Chinese Source (sgm) * Chinese Annotation (apf) * English Source (sgm) * English Annotation (apf) *Updates* There are no updates available at this time. © 2000 American Broadcasting Corporation © 2000 Cable News Network, Inc. © 2000 Press Association, Inc. © 2000 New York Times © 2000 National Broadcasting Company, Inc. © 2000 Public Radio International © 2000 Agency France Press © 2000 Al Hayat © 2000 An-Nahar © 2000 Nile TV © 2000 Xinhua News © 2000 SPH AsiaOne Ltd. © 2000 China National Radio © 2000 China Television System © 2000 China TV Program Agency © 2000 China Broadcasting System
Extent:Corpus size: 28672 KB
Identifier:LDC2004T09
https://catalog.ldc.upenn.edu/LDC2004T09
ISBN: 1-58563-292-9
ISLRN: 685-740-491-198-0
DOI: 10.35111/7xtm-ys65
Language:English
Standard Arabic
Mandarin Chinese
Language (ISO639):eng
arb
cmn
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2004T09
Rights Holder: "The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston. © 2000 American Broadcasting Corporation © 2000 Cable News Network, Inc. © 2000 Press Association, Inc. © 2000 New York Times © 2000 National Broadcasting Company, Inc. © 2000 Public Radio International © 2000 Agency France Press © 2000 Al Hayat © 2000 An-Nahar © 2000 Nile TV © 2000 Xinhua News © 2000 SPH AsiaOne Ltd. © 2000 China National Radio © 2000 China Television System © 2000 China TV Program Agency © 2000 China Broadcasting System
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2004T09
DateStamp:  2024-03-12
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Mitchell, Alexis; Strassel, Stephanie; Przybocki, Mark; Davis, JK; Doddington, George R.; Grishman, Ralph; Meyers, Adam; Brunstein, Ada; Ferro, Lisa; Sundheim, Beth. 2004. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_arb iso639_cmn iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004T09
Up-to-date as of: Fri Dec 6 7:46:54 EST 2024