OLAC Record
oai:lindat.mff.cuni.cz:11234/1-2607

Metadata
Title:Corpus for training and evaluating diacritics restoration systems
Bibliographic Citation:http://hdl.handle.net/11234/1-2607
Creator:Náplava, Jakub
Straka, Milan
Hajič, Jan
Straňák, Pavel
Date (W3CDTF):2018-03-05T14:37:18Z
Date Available:2018-03-05T14:37:18Z
Description:Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized. All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better. The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.
Identifier (URI):http://hdl.handle.net/11234/1-2607
Language:Czech
Vietnamese
Romanian
Polish
Slovak
Spanish
Croatian
Irish
Latvian
Hungarian
French
Turkish
Language (ISO639):ces
vie
ron
pol
slk
spa
hrv
gle
lav
hun
fra
tur
Publisher:Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Rights:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
Subject:diacritical marks generation
natural language correction
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11234/1-2607
DateStamp:  2021-06-29
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Náplava, Jakub; Straka, Milan; Hajič, Jan; Straňák, Pavel. 2018. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Asia area_Europe country_CZ country_ES country_FR country_HR country_HU country_IE country_PL country_RO country_SK country_TR country_VN dcmi_Text iso639_ces iso639_fra iso639_gle iso639_hrv iso639_hun iso639_lav iso639_pol iso639_ron iso639_slk iso639_spa iso639_tur iso639_vie olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-2607
Up-to-date as of: Thu Oct 5 0:40:51 EDT 2023