OLAC Record
oai:lindat.mff.cuni.cz:11234/1-5478

Metadata
Title:Coreference in Universal Dependencies 1.2 (CorefUD 1.2)
Bibliographic Citation:http://hdl.handle.net/11234/1-5478
Creator:Popel, Martin
Novák, Michal
Žabokrtský, Zdeněk
Zeman, Daniel
Nedoluzhko, Anna
Acar, Kutay
Bamman, David
Bourgonje, Peter
Cinková, Silvie
Eckhoff, Hanne
Cebiroğlu Eryiğit, Gülşen
Hajič, Jan
Hardmeier, Christian
Haug, Dag
Jørgensen, Tollef
Kåsen, Andre
Krielke, Pauline
Landragin, Frédéric
Lapshinova-Koltunski, Ekaterina
Mæhlum, Petter
Martí, M. Antònia
Mikulová, Marie
Nøklestad, Anders
Ogrodniczuk, Maciej
Øvrelid, Lilja
Pamay Arslan, Tuğba
Recasens, Marta
Solberg, Per Erik
Stede, Manfred
Straka, Milan
Swanson, Daniel
Toldova, Svetlana
Vadász, Noémi
Velldal, Erik
Vincze, Veronika
Zeldes, Amir
Žitkus, Voldemaras
Date (W3CDTF):2024-04-02T12:48:43Z
Date Available:2024-04-02T12:48:43Z
Description:CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.2 consists of 25 datasets for 16 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 21 datasets for 15 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 2 for Czech, 3 for English, 1 for French, 2 for German, 2 for Hungarian, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource, too. Compared to the previous version 1.1, the version 1.2 comprises new languages and corpora, namely Ancient_Greek-PROIEL, Ancient_Hebrew-PTNK, English-LitBank, and Old_Church_Slavonic-PROIEL. In addition, English-GUM and Turkish-ITCC have been updated to newer versions, conversion of zeros in Polish-PCC has been improved, and the conversion pipelines for multiple other datasets have been refined (a list of all changes in each dataset can be found in the corresponding README file).
Identifier (URI):http://hdl.handle.net/11234/1-5478
Language:Ancient Greek (to 1453)
Ancient Hebrew
Catalan
Czech
English
French
German
Hungarian
Lithuanian
Norwegian
Church Slavic
Polish
Russian
Spanish
Turkish
Language (ISO639):grc
hbo
cat
ces
eng
fra
deu
hun
lit
nor
chu
pol
rus
spa
tur
Publisher:Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Replaces (URI):http://hdl.handle.net/11234/1-5053
Rights:Licence CorefUD v1.2
https://lindat.mff.cuni.cz/repository/xmlui/page/license-corefud-1.2
Subject:coreference
bridging relations
harmonized annotation
dependency
treebank
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11234/1-5478
DateStamp:  2024-10-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Popel, Martin; Novák, Michal; Žabokrtský, Zdeněk; Zeman, Daniel; Nedoluzhko, Anna; Acar, Kutay; Bamman, David; Bourgonje, Peter; Cinková, Silvie; Eckhoff, Hanne; Cebiroğlu Eryiğit, Gülşen; Hajič, Jan; Hardmeier, Christian; Haug, Dag; Jørgensen, Tollef; Kåsen, Andre; Krielke, Pauline; Landragin, Frédéric; Lapshinova-Koltunski, Ekaterina; Mæhlum, Petter; Martí, M. Antònia; Mikulová, Marie; Nøklestad, Anders; Ogrodniczuk, Maciej; Øvrelid, Lilja; Pamay Arslan, Tuğba; Recasens, Marta; Solberg, Per Erik; Stede, Manfred; Straka, Milan; Swanson, Daniel; Toldova, Svetlana; Vadász, Noémi; Velldal, Erik; Vincze, Veronika; Zeldes, Amir; Žitkus, Voldemaras. 2024. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Asia area_Europe country_CZ country_DE country_ES country_FR country_GB country_GR country_HU country_IL country_LT country_NO country_PL country_RU country_TR dcmi_Text iso639_cat iso639_ces iso639_chu iso639_deu iso639_eng iso639_fra iso639_grc iso639_hbo iso639_hun iso639_lit iso639_nor iso639_pol iso639_rus iso639_spa iso639_tur olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-5478
Up-to-date as of: Wed Mar 5 0:42:36 EST 2025