OLAC Record: NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets

OLAC Record
oai:www.ldc.upenn.edu:LDC2013T07

Metadata

Title: NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: NIST Multimodal Information Group. NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets LDC2013T07. Web Download. Philadelphia: Linguistic Data Consortium, 2013

Contributor: NIST Multimodal Information Group

Date (W3CDTF): 2013

Date Issued (W3CDTF): 2013-04-15

Description: *Introduction* NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets was developed by NIST Multimodal Information Group. This release contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plans for the Arabic-to-English and Chinese-to-English progress test sets for the NIST OpenMT 2008, 2009, and 2012 evaluations. The test data remained unseen between evaluations and was reused unchanged each time. The package was compiled, and scoring software was developed, at NIST, making use of Chinese and Arabic newswire and web data and reference translations collected and developed by the Linguistic Data Consortium (LDC). The objective of the OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original. The MT evaluation series started in 2001 as part of the DARPA TIDES (Translingual Information Detection, Extraction) program. Beginning with the 2006 evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities in MT. The OpenMT evaluations are intended to be of interest to all researchers working on the general problem of automatic translation between human languages. To this end, they are designed to be simple, to focus on core technology issues and to be fully supported. For more general information about the NIST OpenMT evaluations, please refer to the NIST OpenMT website. This evaluation kit includes a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation. LDC has released the following associated corpora: * NIST 2008 Open Machine Translation (OpenMT) Evaluation (LDC2010T21) * NIST 2009 Open Machine Translation (OpenMT) Evaluation (LDC2010T23) * NIST 2012 Open Machine Translation (OpenMT) Evaluation (LDC2013T03) *Data* This release contains 2,748 documents with corresponding source and reference files, the latter of which contains four independent human reference translations of the source data. The source data is comprised of Arabic and Chinese newswire and web data collected by LDC in 2007. The table below displays statistics by source, genre, documents, segments and source tokens. Source Genre Documents Segments Source Tokens Arabic Newswire 84 784 20039 Arabic Web Data 51 594 14793 Chinese Newswire 82 688 26923 Chinese Web Data 40 682 19112 The token counts for Chinese data are character counts, which were obtained by counting tokens matching the UNICODE-based regular expression w. The Python re module was used to obtain those counts. The data in this package are in XML format compliant with the included DTD. *Samples* Please consult the following source sample and translation sample. *Updates* None at this time.

Extent: Corpus size: 4384 KB

Identifier: LDC2013T07

https://catalog.ldc.upenn.edu/LDC2013T07

ISBN: 1-58563-640-1

ISLRN: 112-444-010-598-0

DOI: 10.35111/xh7k-8m27

Language: English

Mandarin Chinese

Arabic

Chinese

Language (ISO639): eng

cmn

ara

zho

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2013T07

Rights Holder: Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, Al-Quds Al-Arabi, Asharq Al-Awsat, An Nahar, Assabah, China Military Online, Chinanews.com, Guangming Daily, Xinhua News Agency, © 2007, 2013 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2013T07

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: NIST Multimodal Information Group. 2013. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB dcmi_Text iso639_ara iso639_cmn iso639_eng iso639_zho olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2013T07
Up-to-date as of: Wed Oct 29 7:00:42 EDT 2025

Metadata
Title:		NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		NIST Multimodal Information Group. NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets LDC2013T07. Web Download. Philadelphia: Linguistic Data Consortium, 2013
Contributor:		NIST Multimodal Information Group
Date (W3CDTF):		2013
Date Issued (W3CDTF):		2013-04-15
Description:		Introduction NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets was developed by NIST Multimodal Information Group. This release contains the evaluation sets (source data and human reference translations), DTD, scoring software, and evaluation plans for the Arabic-to-English and Chinese-to-English progress test sets for the NIST OpenMT 2008, 2009, and 2012 evaluations. The test data remained unseen between evaluations and was reused unchanged each time. The package was compiled, and scoring software was developed, at NIST, making use of Chinese and Arabic newswire and web data and reference translations collected and developed by the Linguistic Data Consortium (LDC). The objective of the OpenMT evaluation series is to support research in, and help advance the state of the art of, machine translation (MT) technologies -- technologies that translate text between human languages. Input may include all forms of text. The goal is for the output to be an adequate and fluent translation of the original. The MT evaluation series started in 2001 as part of the DARPA TIDES (Translingual Information Detection, Extraction) program. Beginning with the 2006 evaluation, the evaluations have been driven and coordinated by NIST as NIST OpenMT. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities in MT. The OpenMT evaluations are intended to be of interest to all researchers working on the general problem of automatic translation between human languages. To this end, they are designed to be simple, to focus on core technology issues and to be fully supported. For more general information about the NIST OpenMT evaluations, please refer to the NIST OpenMT website. This evaluation kit includes a single Perl script (mteval-v13a.pl) that may be used to produce a translation quality score for one (or more) MT systems. The script works by comparing the system output translation with a set of (expert) reference translations of the same source text. Comparison is based on finding sequences of words in the reference translations that match word sequences in the system output translation. LDC has released the following associated corpora: * NIST 2008 Open Machine Translation (OpenMT) Evaluation (LDC2010T21) * NIST 2009 Open Machine Translation (OpenMT) Evaluation (LDC2010T23) * NIST 2012 Open Machine Translation (OpenMT) Evaluation (LDC2013T03) Data This release contains 2,748 documents with corresponding source and reference files, the latter of which contains four independent human reference translations of the source data. The source data is comprised of Arabic and Chinese newswire and web data collected by LDC in 2007. The table below displays statistics by source, genre, documents, segments and source tokens. Source Genre Documents Segments Source Tokens Arabic Newswire 84 784 20039 Arabic Web Data 51 594 14793 Chinese Newswire 82 688 26923 Chinese Web Data 40 682 19112 The token counts for Chinese data are character counts, which were obtained by counting tokens matching the UNICODE-based regular expression w. The Python re module was used to obtain those counts. The data in this package are in XML format compliant with the included DTD. Samples Please consult the following source sample and translation sample. Updates None at this time.
Extent:		Corpus size: 4384 KB
Identifier:		LDC2013T07
		https://catalog.ldc.upenn.edu/LDC2013T07
		ISBN: 1-58563-640-1
		ISLRN: 112-444-010-598-0
		DOI: 10.35111/xh7k-8m27
Language:		English
		Mandarin Chinese
		Arabic
		Chinese
Language (ISO639):		eng
		cmn
		ara
		zho
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2013T07
Rights Holder:		Portions © 2007 Agence France Presse, Al-Ahram, Al Hayat, Al-Quds Al-Arabi, Asharq Al-Awsat, An Nahar, Assabah, China Military Online, Chinanews.com, Guangming Daily, Xinhua News Agency, © 2007, 2013 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2013T07
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		NIST Multimodal Information Group. 2013. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB dcmi_Text iso639_ara iso639_cmn iso639_eng iso639_zho olac_primary_text