OLAC Record
oai:lindat.mff.cuni.cz:11234/1-1743

Metadata
Title:Deltacorpus 1.1
Bibliographic Citation:http://hdl.handle.net/11234/1-1743
Creator:Mareček, David
Yu, Zhiwei
Zeman, Daniel
Žabokrtský, Zdeněk
Date (W3CDTF):2016-06-27T12:27:25Z
Date Available:2016-06-27T12:27:25Z
Description:Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia). Changes in version 1.1: 1. Universal Dependencies tagset instead of the older and smaller Google Universal POS tagset. 2. SVM classifier trained on Universal Dependencies 1.2 instead of HamleDT 2.0. 3. Balto-Slavic languages, Germanic languages and Romance languages were tagged by classifier trained only on the respective group of languages. Other languages were tagged by a classifier trained on all available languages. The "c7" combination from version 1.0 is no longer used.
Identifier (URI):http://hdl.handle.net/11234/1-1743
Language:Belarusian
Bosnian
Bulgarian
Czech
Serbo-Croatian
Croatian
Upper Sorbian
Macedonian
Polish
Russian
Slovak
Slovenian
Serbian
Ukrainian
Latvian
Lithuanian
Afrikaans
Danish
German
English
Faroese
Western Frisian
Swiss German
Icelandic
Limburgan
Luxembourgish
Low German
Dutch
Norwegian Nynorsk
Norwegian
Scots
Swedish
Yiddish
Aragonese
Asturian
Catalan
French
Galician
Haitian
Italian
Latin
Lombard
Neapolitan
Piemontese
Portuguese
Romanian
Spanish
Venetian
Walloon
Breton
Welsh
Scottish Gaelic
Irish
Modern Greek (1453-)
Armenian
Albanian
Dimli (individual language)
Persian
Gilaki
Kurdish
Tajik
Bengali
Bishnupriya
Gujarati
Fiji Hindi
Hindi
Marathi
Nepali (macrolanguage)
Urdu
Amharic
Arabic
Egyptian Arabic
Hebrew
Estonian
Finnish
Hungarian
Basque
Georgian
Chuvash
Azerbaijani
Turkish
Uzbek
Kazakh
Tatar
Yakut
Korean
Mongolian
Telugu
Kannada
Malayalam
Tamil
Newari
Vietnamese
Indonesian
Javanese
Malagasy
Maori
Malay (macrolanguage)
Pampanga
Sundanese
Tagalog
Waray (Philippines)
Swahili (macrolanguage)
Esperanto
Ido
Interlingua (International Auxiliary Language Association)
Volapük
Language (ISO639):bel
bos
bul
ces
hbs
hrv
hsb
mkd
pol
rus
slk
slv
srp
ukr
lav
lit
afr
dan
deu
eng
fao
fry
gsw
isl
lim
ltz
nds
nld
nno
nor
sco
swe
yid
arg
ast
cat
fra
glg
hat
ita
lat
lmo
nap
pms
por
ron
spa
vec
wln
bre
cym
gla
gle
ell
hye
sqi
diq
fas
glk
kur
tgk
ben
bpy
guj
hif
hin
mar
nep
urd
amh
ara
arz
heb
est
fin
hun
eus
kat
chv
aze
tur
uzb
kaz
tat
sah
kor
mon
tel
kan
mal
tam
new
vie
ind
jav
mlg
mri
msa
pam
sun
tgl
war
swa
epo
ido
ina
vol
Publisher:Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Replaces (URI):http://hdl.handle.net/11234/1-1662
Rights:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
http://creativecommons.org/licenses/by-sa/4.0/
Subject:part of speech
tagging
semi-supervised
cross-language
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11234/1-1743
DateStamp:  2021-06-29
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Mareček, David; Yu, Zhiwei; Zeman, Daniel; Žabokrtský, Zdeněk. 2016. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Africa area_Americas area_Asia area_Europe area_Pacific country_AM country_BA country_BD country_BE country_BG country_BY country_CH country_CZ country_DE country_DK country_EG country_ES country_ET country_FI country_FJ country_FR country_GB country_GE country_GR country_HR country_HT country_HU country_ID country_IE country_IL country_IN country_IR country_IS country_IT country_KR country_KZ country_LT country_LU country_MK country_NL country_NO country_NP country_NZ country_PH country_PK country_PL country_PT country_RO country_RS country_RU country_SE country_SI country_SK country_TJ country_TR country_UA country_VA country_VN country_ZA dcmi_Text iso639_afr iso639_amh iso639_ara iso639_arg iso639_arz iso639_ast iso639_aze iso639_bel iso639_ben iso639_bos iso639_bpy iso639_bre iso639_bul iso639_cat iso639_ces iso639_chv iso639_cym iso639_dan iso639_deu iso639_diq iso639_ell iso639_eng iso639_epo iso639_est iso639_eus iso639_fao iso639_fas iso639_fin iso639_fra iso639_fry iso639_gla iso639_gle iso639_glg iso639_glk iso639_gsw iso639_guj iso639_hat iso639_hbs iso639_heb iso639_hif iso639_hin iso639_hrv iso639_hsb iso639_hun iso639_hye iso639_ido iso639_ina iso639_ind iso639_isl iso639_ita iso639_jav iso639_kan iso639_kat iso639_kaz iso639_kor iso639_kur iso639_lat iso639_lav iso639_lim iso639_lit iso639_lmo iso639_ltz iso639_mal iso639_mar iso639_mkd iso639_mlg iso639_mon iso639_mri iso639_msa iso639_nap iso639_nds iso639_nep iso639_new iso639_nld iso639_nno iso639_nor iso639_pam iso639_pms iso639_pol iso639_por iso639_ron iso639_rus iso639_sah iso639_sco iso639_slk iso639_slv iso639_spa iso639_sqi iso639_srp iso639_sun iso639_swa iso639_swe iso639_tam iso639_tat iso639_tel iso639_tgk iso639_tgl iso639_tur iso639_ukr iso639_urd iso639_uzb iso639_vec iso639_vie iso639_vol iso639_war iso639_wln iso639_yid olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-1743
Up-to-date as of: Thu Oct 5 0:40:29 EDT 2023