OLAC Record oai:lindat.mff.cuni.cz:11234/1-2607 |
Metadata | ||
Title: | Corpus for training and evaluating diacritics restoration systems | |
Bibliographic Citation: | http://hdl.handle.net/11234/1-2607 | |
Creator: | Náplava, Jakub | |
Straka, Milan | ||
Hajič, Jan | ||
Straňák, Pavel | ||
Date (W3CDTF): | 2018-03-05T14:37:18Z | |
Date Available: | 2018-03-05T14:37:18Z | |
Description: | Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized. All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better. The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration. | |
Identifier (URI): | http://hdl.handle.net/11234/1-2607 | |
Language: | Czech | |
Vietnamese | ||
Romanian | ||
Polish | ||
Slovak | ||
Spanish | ||
Croatian | ||
Irish | ||
Latvian | ||
Hungarian | ||
French | ||
Turkish | ||
Language (ISO639): | ces | |
vie | ||
ron | ||
pol | ||
slk | ||
spa | ||
hrv | ||
gle | ||
lav | ||
hun | ||
fra | ||
tur | ||
Publisher: | Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) | |
Rights: | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) | |
http://creativecommons.org/licenses/by-nc-sa/4.0/ | ||
Subject: | diacritical marks generation | |
natural language correction | ||
Type: | corpus | |
Type (DCMI): | Text | |
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University | |
Description: | http://www.language-archives.org/archive/lindat.mff.cuni.cz | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:lindat.mff.cuni.cz:11234/1-2607 | |
DateStamp: | 2021-06-29 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Náplava, Jakub; Straka, Milan; Hajič, Jan; Straňák, Pavel. 2018. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL). | |
Terms: | area_Asia area_Europe country_CZ country_ES country_FR country_HR country_HU country_IE country_PL country_RO country_SK country_TR country_VN dcmi_Text iso639_ces iso639_fra iso639_gle iso639_hrv iso639_hun iso639_lav iso639_pol iso639_ron iso639_slk iso639_spa iso639_tur iso639_vie olac_primary_text |