OLAC Record: OntoNotes Release 3.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2009T24

Metadata

Title: OntoNotes Release 3.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Weischedel, Ralph, et al. OntoNotes Release 3.0 LDC2009T24. Web Download. Philadelphia: Linguistic Data Consortium, 2009

Contributor: Weischedel, Ralph

Pradhan, Sameer

Ramshaw, Lance

Kaufman

Franchini, Michelle

El-Bachouti, Mohammed

Xue, Nianwen

Palmer, Martha

Marcus, Mitchell

Taylor, Ann

Greenberg, Craig

Hovy, Eduard

Belvin, Robert

Houston, Ann

Date (W3CDTF): 2009

Date Issued (W3CDTF): 2009-10-20

Description: *Introduction* The OntoNotes project is a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the University of Southern California's Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes Release 3.0 is a continuation of the OntoNotes project and is supported by the Defense Advanced Research Projects Agency, GALE Program Contract No. HR0011-06-C-0022. OntoNotes Release 1.0 (LDC2007T21) contains 400k words of Chinese newswire data (from Xinhua News Agency and Sinorama Magazine) and 300k words of English newswire data (from the Wall Street Journal). OntoNotes Release 2.0 (LDC2008T04) added the following to the corpus: 274k words of Chinese broadcast news data (from China Broadcasting System, China Central TV, China National Radio, China Television System and Voice of America); and 200k words of English broadcast news data (from ABC, CNN, NBC, Public Radio International and Voice of America). OntoNotes Release 3.0 incorporates the following new material: 250k words of English newswire data (from the Wall Street Journal and Xinhua News Agency), 200k of English broadcast news data (from ABC, CNN, NBC, Public Radio International and Voice of America); 200k words of English broadcast conversation material (translated from China Central TV and Phoenix TV); 250k words of Chinese newswire data (from Xinhua News Agency and Sinorama Magazine); 250k words of Chinese broadcast news material (from China Broadcasting System, China Central TV, China National Radio, China Television System and Voice of America); 150k words of Chinese broadcast conversation data (from China Central TV and Phoenix TV); and 200k words of Arabic newswire material (from An Nahar). Natural language applications like machine translation, question answering and summarization currently are forced to depend on impoverished text models like bags of words or n-grams, while the decisions that they are making ought to be based on the meanings of those words in context. That lack of semantics causes problems throughout the applications. Misinterpreting the meaning of an ambiguous word results in failing to extract data, incorrect alignments for translation, and ambiguous language models. Incorrect coreference resolution results in missed information (because a connection is not made) or incorrectly conflated information (due to false connections). OntoNotes builds on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation will include word sense disambiguation for nouns and verbs, with each word sense connected to an ontology, and coreference. The current goals call for annotation of over a million words each of English and Chinese, and half a million words of Arabic over five years. *Data* Each data directory has been stored as a Gnu Zipped Tar File (.tgz) due to the complexity and depth of each directory and the limitations of the ISO CD9660 file system for CD and DVD media. These directories may be easily unpacked using the Unix command line or using utilities such as StuffIt or WinZip under Windows. *Samples* * Arabic * Chinese * English *Sponsorship* This work is supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred. The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Extent: Corpus size: 445440 KB

Identifier: LDC2009T24

https://catalog.ldc.upenn.edu/LDC2009T24

ISBN: 1-58563-524-3

ISLRN: 591-792-796-939-8

DOI: 10.35111/hrvd-0t12

Language: English

Mandarin Chinese

Standard Arabic

Chinese

Arabic

Language (ISO639): eng

cmn

arb

zho

ara

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2009T24

Rights Holder: Portions © 2000-2001 American Broadcasting Company, © 2002 An Nahar, © 2000-2001 Cable News Network, LP, LLLP, © 2000-2001 China Broadcasting System, © 2000-2001, 2005-2009 China Central TV, © 2000-2001 China National Radio, © 2000-2001 China Television System, © 1989 Dow Jones & Company, Inc., © 2000-2001 National Broadcasting, Company, Inc., © 2005-2009 Phoenix TV, © 2000-2001 Public Radio International, © 1996-2001 Sinorama Magazine, © 1994-1998 Xinhua News Agency, © 1995, 2005, 2006, 2007, 2008, 2009 Trustees of the University of Pennsylvania

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2009T24

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Weischedel, Ralph; Pradhan, Sameer; Ramshaw, Lance; Kaufman; Franchini, Michelle; El-Bachouti, Mohammed; Xue, Nianwen; Palmer, Martha; Marcus, Mitchell; Taylor, Ann; Greenberg, Craig; Hovy, Eduard; Belvin, Robert; Houston, Ann. 2009. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_ara iso639_arb iso639_cmn iso639_eng iso639_zho olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009T24
Up-to-date as of: Wed Oct 29 7:01:09 EDT 2025

Metadata
Title:		OntoNotes Release 3.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Weischedel, Ralph, et al. OntoNotes Release 3.0 LDC2009T24. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:		Weischedel, Ralph
		Pradhan, Sameer
		Ramshaw, Lance
		Kaufman
		Franchini, Michelle
		El-Bachouti, Mohammed
		Xue, Nianwen
		Palmer, Martha
		Marcus, Mitchell
		Taylor, Ann
		Greenberg, Craig
		Hovy, Eduard
		Belvin, Robert
		Houston, Ann
Date (W3CDTF):		2009
Date Issued (W3CDTF):		2009-10-20
Description:		Introduction The OntoNotes project is a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the University of Southern California's Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes Release 3.0 is a continuation of the OntoNotes project and is supported by the Defense Advanced Research Projects Agency, GALE Program Contract No. HR0011-06-C-0022. OntoNotes Release 1.0 (LDC2007T21) contains 400k words of Chinese newswire data (from Xinhua News Agency and Sinorama Magazine) and 300k words of English newswire data (from the Wall Street Journal). OntoNotes Release 2.0 (LDC2008T04) added the following to the corpus: 274k words of Chinese broadcast news data (from China Broadcasting System, China Central TV, China National Radio, China Television System and Voice of America); and 200k words of English broadcast news data (from ABC, CNN, NBC, Public Radio International and Voice of America). OntoNotes Release 3.0 incorporates the following new material: 250k words of English newswire data (from the Wall Street Journal and Xinhua News Agency), 200k of English broadcast news data (from ABC, CNN, NBC, Public Radio International and Voice of America); 200k words of English broadcast conversation material (translated from China Central TV and Phoenix TV); 250k words of Chinese newswire data (from Xinhua News Agency and Sinorama Magazine); 250k words of Chinese broadcast news material (from China Broadcasting System, China Central TV, China National Radio, China Television System and Voice of America); 150k words of Chinese broadcast conversation data (from China Central TV and Phoenix TV); and 200k words of Arabic newswire material (from An Nahar). Natural language applications like machine translation, question answering and summarization currently are forced to depend on impoverished text models like bags of words or n-grams, while the decisions that they are making ought to be based on the meanings of those words in context. That lack of semantics causes problems throughout the applications. Misinterpreting the meaning of an ambiguous word results in failing to extract data, incorrect alignments for translation, and ambiguous language models. Incorrect coreference resolution results in missed information (because a connection is not made) or incorrectly conflated information (due to false connections). OntoNotes builds on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation will include word sense disambiguation for nouns and verbs, with each word sense connected to an ontology, and coreference. The current goals call for annotation of over a million words each of English and Chinese, and half a million words of Arabic over five years. Data Each data directory has been stored as a Gnu Zipped Tar File (.tgz) due to the complexity and depth of each directory and the limitations of the ISO CD9660 file system for CD and DVD media. These directories may be easily unpacked using the Unix command line or using utilities such as StuffIt or WinZip under Windows. Samples * Arabic * Chinese * English Sponsorship This work is supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred. The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Extent:		Corpus size: 445440 KB
Identifier:		LDC2009T24
		https://catalog.ldc.upenn.edu/LDC2009T24
		ISBN: 1-58563-524-3
		ISLRN: 591-792-796-939-8
		DOI: 10.35111/hrvd-0t12
Language:		English
		Mandarin Chinese
		Standard Arabic
		Chinese
		Arabic
Language (ISO639):		eng
		cmn
		arb
		zho
		ara
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2009T24
Rights Holder:		Portions © 2000-2001 American Broadcasting Company, © 2002 An Nahar, © 2000-2001 Cable News Network, LP, LLLP, © 2000-2001 China Broadcasting System, © 2000-2001, 2005-2009 China Central TV, © 2000-2001 China National Radio, © 2000-2001 China Television System, © 1989 Dow Jones & Company, Inc., © 2000-2001 National Broadcasting, Company, Inc., © 2005-2009 Phoenix TV, © 2000-2001 Public Radio International, © 1996-2001 Sinorama Magazine, © 1994-1998 Xinhua News Agency, © 1995, 2005, 2006, 2007, 2008, 2009 Trustees of the University of Pennsylvania The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2009T24
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Weischedel, Ralph; Pradhan, Sameer; Ramshaw, Lance; Kaufman; Franchini, Michelle; El-Bachouti, Mohammed; Xue, Nianwen; Palmer, Martha; Marcus, Mitchell; Taylor, Ann; Greenberg, Craig; Hovy, Eduard; Belvin, Robert; Houston, Ann. 2009. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_ara iso639_arb iso639_cmn iso639_eng iso639_zho olac_primary_text