OLAC Record oai:lindat.mff.cuni.cz:11858/00-097C-0000-000E-011B-8 |
Metadata | ||
Title: | Corpus of contemporary blogs | |
Bibliographic Citation: | http://hdl.handle.net/11858/00-097C-0000-000E-011B-8 | |
Creator: | Grác, Marek | |
Date (W3CDTF): | 2013-02-26T13:40:06Z | |
Date Available: | 2013-02-26T13:40:06Z | |
Description: | In NLP Centre, dividing text into sentences is currently done with a tool which uses rule-based system. In order to make enough training data for machine learning, annotators manually split the corpus of contemporary text CBB.blog (1 million tokens) into sentences. Each file contains one hundredth of the whole corpus and all data were processed in parallel by two annotators. The corpus was created from ten contemporary blogs: hintzu.otaku.cz modnipeklo.cz bloc.cz aleneprokopova.blogspot.com blog.aktualne.cz fuchsova.blog.onaidnes.cz havlik.blog.idnes.cz blog.aktualne.centrum.cz klusak.blogspot.cz myego.cz/welldone | |
Identifier (URI): | http://hdl.handle.net/11858/00-097C-0000-000E-011B-8 | |
Language: | Czech | |
Language (ISO639): | ces | |
Publisher: | Masaryk University, NLP Centre | |
Rights: | Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0) | |
http://creativecommons.org/licenses/by-nc-nd/3.0/ | ||
Subject: | corpus | |
blogs | ||
annotation | ||
annotators | ||
sentences | ||
machine learning | ||
Type: | corpus | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University | |
Description: | http://www.language-archives.org/archive/lindat.mff.cuni.cz | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:lindat.mff.cuni.cz:11858/00-097C-0000-000E-011B-8 | |
DateStamp: | 2021-06-29 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Grác, Marek. 2013. Masaryk University, NLP Centre. | |
Terms: | area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text |