OLAC Record: 1993-2007 United Nations Parallel Text

OLAC Record
oai:www.ldc.upenn.edu:LDC2013T06

Metadata

Title: 1993-2007 United Nations Parallel Text

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Franz, Alex, Shankar Kumar, and Thorsten Brants. 1993-2007 United Nations Parallel Text LDC2013T06. Web Download. Philadelphia: Linguistic Data Consortium, 2013

Contributor: Franz, Alex

Kumar, Shankar

Brants, Thorsten

Date (W3CDTF): 2013

Date Issued (W3CDTF): 2013-03-15

Description: *Introduction* 1993-2007 United Nations Parallel Text was developed by Google Research. It consists of United Nations (UN) parliamentary documents from 1993 through 2007 in the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish. There are 673,670 raw text documents and 520,283 word alignment documents. UN parliamentary documents are available from the UN Official Document System (UN ODS) at http://ods.un.org/. UN ODS, in its main UNDOC database, contains the full text of all types of UN parliamentary documents. It has complete coverage datng from 1993 and variable coverage before that. Documents exist in one or more of the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish. UN ODS also contains a large number of German documents, marked with the language other, but these are not included in this dataset. For more information, see the UN ODS documentation at http://documents.un.org/help_E.htm. For more details of the UN bibliographic systems, see http://www.un.org/depts/dhl/unbisref_manual/. LDC has released parallel UN parliamentary documents in English, French and Spanish spanning the period 1988-1993, UN Parallel Text (Complete) (LDC94T4A). *Data* The data is presented as raw text and word-aligned text. The raw text is very close to what was extracted from the original word processing documents in UN ODS (e.g., Word, WordPerfect, PDF), converted to UTF-8 encoding. The word-aligned text was normalized, tokenized, aligned at the sentence-level, further broken into sub-sentential chunk-pairs, and then aligned at the word. The sentence, chunk, and word alignment operations were performed separately for each individual language pair. The files are presented in tar files and compressed using the bzip2 compression utility. The bzip2 utility is standard in most Linux releases. For Windows users, there are a variety of decompression software options. 7-Zip will decompress tar and bzip2 formats. Note that in the data/aligned folder, the en-zh-1993.tar.bz2 and en-zh-1994.tar.bz2 archives decompress into empty folders. This is intentional as there is no Chinese aligned data for those two years. *Samples* Please view this raw English sample, raw French sample, aligned English-French sample. *Updates* None at this time.

Extent: Corpus size: 11810328 KB

Identifier: LDC2013T06

https://catalog.ldc.upenn.edu/LDC2013T06

ISBN: 1-58563-638-X

ISLRN: 375-727-871-052-5

DOI: 10.35111/2ntv-xb56

Language: Spanish

Russian

French

English

Mandarin Chinese

Arabic

Chinese

Language (ISO639): spa

rus

fra

eng

cmn

ara

zho

License: UN Parallel Text Agreement: https://catalog.ldc.upenn.edu/license/un-parallel-text-license.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2013T06

Rights Holder: Portions © 2012 Google Inc., © 1993-2007 United Nations, © 2013 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2013T06

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Franz, Alex; Kumar, Shankar; Brants, Thorsten. 2013. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_ES country_FR country_GB country_RU dcmi_Text iso639_ara iso639_cmn iso639_eng iso639_fra iso639_rus iso639_spa iso639_zho olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2013T06
Up-to-date as of: Wed Oct 29 7:01:23 EDT 2025

Metadata
Title:		1993-2007 United Nations Parallel Text
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Franz, Alex, Shankar Kumar, and Thorsten Brants. 1993-2007 United Nations Parallel Text LDC2013T06. Web Download. Philadelphia: Linguistic Data Consortium, 2013
Contributor:		Franz, Alex
		Kumar, Shankar
		Brants, Thorsten
Date (W3CDTF):		2013
Date Issued (W3CDTF):		2013-03-15
Description:		Introduction 1993-2007 United Nations Parallel Text was developed by Google Research. It consists of United Nations (UN) parliamentary documents from 1993 through 2007 in the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish. There are 673,670 raw text documents and 520,283 word alignment documents. UN parliamentary documents are available from the UN Official Document System (UN ODS) at http://ods.un.org/. UN ODS, in its main UNDOC database, contains the full text of all types of UN parliamentary documents. It has complete coverage datng from 1993 and variable coverage before that. Documents exist in one or more of the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish. UN ODS also contains a large number of German documents, marked with the language other, but these are not included in this dataset. For more information, see the UN ODS documentation at http://documents.un.org/help_E.htm. For more details of the UN bibliographic systems, see http://www.un.org/depts/dhl/unbisref_manual/. LDC has released parallel UN parliamentary documents in English, French and Spanish spanning the period 1988-1993, UN Parallel Text (Complete) (LDC94T4A). Data The data is presented as raw text and word-aligned text. The raw text is very close to what was extracted from the original word processing documents in UN ODS (e.g., Word, WordPerfect, PDF), converted to UTF-8 encoding. The word-aligned text was normalized, tokenized, aligned at the sentence-level, further broken into sub-sentential chunk-pairs, and then aligned at the word. The sentence, chunk, and word alignment operations were performed separately for each individual language pair. The files are presented in tar files and compressed using the bzip2 compression utility. The bzip2 utility is standard in most Linux releases. For Windows users, there are a variety of decompression software options. 7-Zip will decompress tar and bzip2 formats. Note that in the data/aligned folder, the en-zh-1993.tar.bz2 and en-zh-1994.tar.bz2 archives decompress into empty folders. This is intentional as there is no Chinese aligned data for those two years. Samples Please view this raw English sample, raw French sample, aligned English-French sample. Updates None at this time.
Extent:		Corpus size: 11810328 KB
Identifier:		LDC2013T06
		https://catalog.ldc.upenn.edu/LDC2013T06
		ISBN: 1-58563-638-X
		ISLRN: 375-727-871-052-5
		DOI: 10.35111/2ntv-xb56
Language:		Spanish
		Russian
		French
		English
		Mandarin Chinese
		Arabic
		Chinese
Language (ISO639):		spa
		rus
		fra
		eng
		cmn
		ara
		zho
License:		UN Parallel Text Agreement: https://catalog.ldc.upenn.edu/license/un-parallel-text-license.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2013T06
Rights Holder:		Portions © 2012 Google Inc., © 1993-2007 United Nations, © 2013 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2013T06
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Franz, Alex; Kumar, Shankar; Brants, Thorsten. 2013. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_ES country_FR country_GB country_RU dcmi_Text iso639_ara iso639_cmn iso639_eng iso639_fra iso639_rus iso639_spa iso639_zho olac_primary_text