OLAC Record
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000E-011B-8

Metadata
Title:Corpus of contemporary blogs
Bibliographic Citation:http://hdl.handle.net/11858/00-097C-0000-000E-011B-8
Creator:Grác, Marek
Date (W3CDTF):2013-02-26T13:40:06Z
Date Available:2013-02-26T13:40:06Z
Description:In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators manually split the corpus of contemporary text CBB.blog (1 million tokens) into sentences. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators. The corpus was created from ten contemporary blogs: hintzu.otaku.cz modnipeklo.cz bloc.cz aleneprokopova.blogspot.com blog.aktualne.cz fuchsova.blog.onaidnes.cz havlik.blog.idnes.cz blog.aktualne.centrum.cz klusak.blogspot.cz myego.cz/welldone
Identifier (URI):http://hdl.handle.net/11858/00-097C-0000-000E-011B-8
Language:Czech
Language (ISO639):ces
Publisher:Masaryk University, NLP Centre
Rights:Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)
http://creativecommons.org/licenses/by-nc-nd/3.0/
Subject:corpus
blogs
annotation
annotators
sentences
machine learning
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11858/00-097C-0000-000E-011B-8
DateStamp:  2021-06-29
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Grác, Marek. 2013. Masaryk University, NLP Centre.
Terms: area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11858/00-097C-0000-000E-011B-8
Up-to-date as of: Thu Oct 5 0:38:54 EDT 2023