OLAC Record: CETEMpublico

OLAC Record
oai:www.ldc.upenn.edu:LDC2001T62

Metadata

Title: CETEMpublico

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Santos, Diana, and Paulo Rocha. CETEMpublico LDC2001T62. Web Download. Philadelphia: Linguistic Data Consortium, 2001

Contributor: Santos, Diana

Rocha, Paulo

Date (W3CDTF): 2001

Description: *Introduction* CETEMPublico Version 1.7 (Corpus de Extractos de Textos Electronicos MCT/Publico), produced by the Linguistic Data Consortium (LDC) as catalog number LDC2001T62 with ISBN 1-58563-206-6, is a corpus of newspaper texts from the Portuguese daily newspaper Publico, compiled for purposes of research and development in natural language processing (NLP) by the Computational Processing of Portuguese Project, under an agreement between Publico and the Portuguese Ministry of Science and Technology (MCT). *Data* The corpus includes the text of approximately 2,600 editions of Publico, produced between 1991 and 1998, and amounting to approximately 180 million words. CETEMPublico Version 1.7 contains 1,504,258 extracts (CETEMPublico Version 1.0 had 1,567,625). Version 1.7 was created in Oslo on August 6, 2001 and uses SGML tagging. The corpus is in 196 compressed text files, with names in the form cetemXXX.gz, from cetem001.gz to cetem196.gz. This corpus was designed to assist researchers who develop computer programs processing the Portuguese language and who would need raw material for their work. In addition, the authors wished for the corpus to be useful to everyone who studies the Portuguese language and wishes to verify their hypotheses in previously organized text material. The online and the CQP versions are meant for such users, who are, in any case, also welcome to get it on CD in order to process the corpus locally, possibly by means of the corpus processing system of their choice. More detailed information is available at http://www.linguateca.pt/cetempublico. *Updates* There are no updates at this time.

Extent: Corpus size: 466944 KB

Identifier: LDC2001T62

https://catalog.ldc.upenn.edu/LDC2001T62

ISBN: 1-58563-201-5

ISLRN: 544-982-311-455-3

DOI: 10.35111/4sr4-3r57

Language: Portuguese

Language (ISO639): por

License: CETEMPúblico Agreement: https://catalog.ldc.upenn.edu/license/cetempublico-user-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2001T62

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2001T62

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Santos, Diana; Rocha, Paulo. 2001. Linguistic Data Consortium.
Terms: area_Europe country_PT dcmi_Text iso639_por olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2001T62
Up-to-date as of: Wed Oct 29 7:00:10 EDT 2025

Metadata
Title:		CETEMpublico
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Santos, Diana, and Paulo Rocha. CETEMpublico LDC2001T62. Web Download. Philadelphia: Linguistic Data Consortium, 2001
Contributor:		Santos, Diana
Contributor:		Rocha, Paulo
Date (W3CDTF):		2001
Description:		Introduction CETEMPublico Version 1.7 (Corpus de Extractos de Textos Electronicos MCT/Publico), produced by the Linguistic Data Consortium (LDC) as catalog number LDC2001T62 with ISBN 1-58563-206-6, is a corpus of newspaper texts from the Portuguese daily newspaper Publico, compiled for purposes of research and development in natural language processing (NLP) by the Computational Processing of Portuguese Project, under an agreement between Publico and the Portuguese Ministry of Science and Technology (MCT). Data The corpus includes the text of approximately 2,600 editions of Publico, produced between 1991 and 1998, and amounting to approximately 180 million words. CETEMPublico Version 1.7 contains 1,504,258 extracts (CETEMPublico Version 1.0 had 1,567,625). Version 1.7 was created in Oslo on August 6, 2001 and uses SGML tagging. The corpus is in 196 compressed text files, with names in the form cetemXXX.gz, from cetem001.gz to cetem196.gz. This corpus was designed to assist researchers who develop computer programs processing the Portuguese language and who would need raw material for their work. In addition, the authors wished for the corpus to be useful to everyone who studies the Portuguese language and wishes to verify their hypotheses in previously organized text material. The online and the CQP versions are meant for such users, who are, in any case, also welcome to get it on CD in order to process the corpus locally, possibly by means of the corpus processing system of their choice. More detailed information is available at http://www.linguateca.pt/cetempublico. Updates There are no updates at this time.
Extent:		Corpus size: 466944 KB
Identifier:		LDC2001T62
		https://catalog.ldc.upenn.edu/LDC2001T62
		ISBN: 1-58563-201-5
		ISLRN: 544-982-311-455-3
		DOI: 10.35111/4sr4-3r57
Language:		Portuguese
Language (ISO639):		por
License:		CETEMPúblico Agreement: https://catalog.ldc.upenn.edu/license/cetempublico-user-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2001T62
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2001T62
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Santos, Diana; Rocha, Paulo. 2001. Linguistic Data Consortium.
Terms:		area_Europe country_PT dcmi_Text iso639_por olac_primary_text