OLAC Record oai:lindat.mff.cuni.cz:11234/1-4835 |
Metadata | ||
Title: | Czech Web Corpus 2017 (csTenTen17) | |
Bibliographic Citation: | http://hdl.handle.net/11234/1-4835 | |
Creator: | Suchomel, Vít | |
Date (W3CDTF): | 2022-09-15T14:18:30Z | |
Date Available: | 2022-09-15T14:18:30Z | |
Description: | The Czech Web Corpus 2017 (csTenTen17) is a Czech corpus made up of texts collected from the Internet, mostly from the Czech national top level domain ".cz". The data was crawled by web crawler SpiderLing (https://corpus.tools/wiki/SpiderLing).
The data was cleaned by removing boilerplate (using https://corpus.tools/wiki/Justext), removing near-duplicate paragraphs (by https://corpus.tools/wiki/Onion) and discarding paragraphs not in the target language.
The corpus was POS annotated by morphological analyser Majka using this POS tagset: https://www.sketchengine.eu/tagset-reference-for-czech/.
Text sources: General web, Wikipedia.
Time span of crawling: May, October and November 2017, October and November 2016, October and November 2015. The Czech Wikipedia part was downloaded in November 2017.
Data format: Plain text, vertical (one token per line), gzip compressed. There are the following structures in the vertical: Documents ( to | |
Identifier (URI): | http://hdl.handle.net/11234/1-4835 | |
Language: | Czech | |
Language (ISO639): | ces | |
Publisher: | Masaryk University, NLP Centre | |
Lexical Computing CZ s.r.o. | ||
Rights: | NLP Centre Web Corpus License | |
https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC | ||
Subject: | Web corpus | |
Type: | corpus | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University | |
Description: | http://www.language-archives.org/archive/lindat.mff.cuni.cz | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:lindat.mff.cuni.cz:11234/1-4835 | |
DateStamp: | 2023-01-10 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Suchomel, Vít. 2022. Masaryk University, NLP Centre. | |
Terms: | area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text |