OLAC Record
oai:lindat.mff.cuni.cz:11234/1-2615

Metadata
Title:SumeCzech
Bibliographic Citation:http://hdl.handle.net/11234/1-2615
Creator:Straka, Milan
Mediankin, Nikita
Kocmi, Tom
Žabokrtský, Zdeněk
Hudeček, Vojtěch
Hajič, Jan
Date (W3CDTF):2020-01-10T09:44:46Z
Date Available:2020-01-10T09:44:46Z
Description:This entry contains the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al. The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format. The MPL 2.0 license applies to the scripts downloading the dataset and to the RougeRAW implementation. Note: sumeczech-1.0-update-230225.zip is the updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11. However, the downloaded dataset is still exactly the same. The original archive sumeczech-1.0.zip was renamed to sumeczech-1.0-obsolete-180213.zip and is kept for reference.
Identifier (URI):http://hdl.handle.net/11234/1-2615
Language:Czech
Language (ISO639):ces
Publisher:Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Rights:Mozilla Public License 2.0
http://opensource.org/licenses/MPL-2.0
Subject:summarization
SumeCzech
Rouge
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11234/1-2615
DateStamp:  2023-02-27
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Straka, Milan; Mediankin, Nikita; Kocmi, Tom; Žabokrtský, Zdeněk; Hudeček, Vojtěch; Hajič, Jan. 2020. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-2615
Up-to-date as of: Thu Oct 5 0:40:51 EDT 2023