OLAC Record
oai:lindat.mff.cuni.cz:11372/LRT-2638

Metadata
Title:CEHugeWebCorpus
Bibliographic Citation:http://hdl.handle.net/11372/LRT-2638
Creator:Rüdiger, Jan Oliver
Date (W3CDTF):2020-01-21T08:05:49Z
Date Available:2020-01-21T08:05:49Z
Description:This corpus was originally created for performance testing (server infrastructure CorpusExplorer - see: diskurslinguistik.net / diskursmonitor.de). It includes the filtered database (German texts only) of CommonCrawl (as of March 2018). First, the URLs were filtered according to their top-level domain (de, at, ch). Then the texts were classified using NTextCat and only uniquely German texts were included in the corpus. The texts were then annotated using TreeTagger (token, lemma, part-of-speech). 2.58 million documents - 232.87 million sentences - 3.021 billion tokens. You can use CorpusExplorer (http://hdl.handle.net/11234/1-2634) to convert this data into various other corpus formats (XML, JSON, Weblicht, TXM and many more).
Identifier (URI):http://hdl.handle.net/11372/LRT-2638
Language:German
Language (ISO639):deu
Publisher:Rüdiger, Jan Oliver
Rights:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
Subject:corpus
German
Germanistik
Web corpus
web corpus
web corpora
CorpusExplorer
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11372/LRT-2638
DateStamp:  2021-06-29
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Rüdiger, Jan Oliver. 2020. Rüdiger, Jan Oliver.
Terms: area_Europe country_DE dcmi_Text iso639_deu olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11372/LRT-2638
Up-to-date as of: Thu Oct 5 0:40:52 EDT 2023