OLAC Record oai:lindat.mff.cuni.cz:11234/1-3505 |
Metadata | ||
Title: | SumeCzech-NER | |
Bibliographic Citation: | http://hdl.handle.net/11234/1-3505 | |
Creator: | Marek, Petr | |
Müller, Štěpán | ||
Date (W3CDTF): | 2021-02-03T08:30:28Z | |
Date Available: | 2021-02-03T08:30:28Z | |
Description: | SumeCzech-NER SumeCzech-NER contains named entity annotations of SumeCzech 1.0 (Straka et al. 2018, SumeCzech: Large Czech News-Based Summarization Dataset). Format The dataset is split into four files. Files are in jsonl format. There is one JSON object on each line of the file. The most important fields of JSON objects are: - dataset: train, dev, test, oodtest - ne_abstract: list of named entity annotations of article's abstract - ne_headline: list of named entity annotations of article's headline - ne_text: list of name entity annotations of article's text - url: article's URL that can be used to match article across SumeCzech and SumeCzech-NER Annotations We used SpaCy's NER model trained on CoNLL-based extended CNEC 2.0. The model achieved a 78.45 F-Score on the dataset's testing set. The annotations are in IOB2 format. The entity types are: Numbers in addresses, Geographical names, Institutions, Media names, Artifact names, Personal names, and Time expressions. Tokenization We used the following Python code for tokenization: from typing import List from nltk.tokenize import word_tokenize def tokenize(text: str) -> List[str]: for mark in ('.', ',', '?', '!', '-', '–', '/'): text = text.replace(mark, f' {mark} ') tokens = word_tokenize(text) return tokens | |
Identifier (URI): | http://hdl.handle.net/11234/1-3505 | |
Language: | Czech | |
Language (ISO639): | ces | |
Publisher: | Czech Technical University in Prague | |
Rights: | Mozilla Public License 2.0 | |
http://opensource.org/licenses/MPL-2.0 | ||
Subject: | SumeCzech | |
named entity recognition | ||
named entitity corpus | ||
summarization | ||
Type: | corpus | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University | |
Description: | http://www.language-archives.org/archive/lindat.mff.cuni.cz | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:lindat.mff.cuni.cz:11234/1-3505 | |
DateStamp: | 2021-06-29 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Marek, Petr; Müller, Štěpán. 2021. Czech Technical University in Prague. | |
Terms: | area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text |