OLAC Record
oai:lindat.mff.cuni.cz:11234/1-5687

Metadata
Title:ORTOFON v3: corpus of informal spoken Czech with multi-tier transcription (transcriptions)
Bibliographic Citation:http://hdl.handle.net/11234/1-5687
Creator:Lukeš, David
Kopřivová, Marie
Laubeová, Zuzana
Poukarová, Petra
Horký, Václav
Jelínek, Tomáš
Křivan, Jan
Waclawičová, Martina
Benešová, Lucie
Škarpová, Marie
Date (W3CDTF):2024-10-10T10:40:18Z
Date Available:2024-10-10T10:40:18Z
Description:ORTOFON v3 is a corpus of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) that covers the area of the whole Czech Republic. The corpus is composed of 697 recordings from 2012–2020 and contains 2 445 793 orthographic words (i.e. a total of 2 976 742 tokens including punctuation); a total of 1 121 different speakers appear in the probes. ORTOFON v3 is partially balanced regarding the basic sociolinguistic speaker categories (gender, age group, level of education and region of childhood residence). The transcription is linked to the corresponding audio track. Unlike the ORAL-series corpora, the transcription was carried out on two main tiers, orthographic and phonetic, supplemented by an additional metalanguage tier. ORTOFON v3 is lemmatized and morphologically tagged according to the SYN2020 standard. This was performed with special attention paid to the specificity of the informal spoken Czech and includes also spoken training data. The (anonymized) corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query engine to registered users of the CNC at http://www.korpus.cz Please note: this item includes only the transcriptions, audio (and the transcripts in their original format) is available under more restrictive non-CC license at http://hdl.handle.net/11234/1-5686
Identifier (URI):http://hdl.handle.net/11234/1-5687
Language:Czech
Language (ISO639):ces
Publisher:Charles University, Faculty of Arts, Institute of the Czech National Corpus
Replaces (URI):http://hdl.handle.net/11234/1-2580
Rights:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
Subject:spoken language
informal language
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11234/1-5687
DateStamp:  2024-10-10
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Lukeš, David; Kopřivová, Marie; Laubeová, Zuzana; Poukarová, Petra; Horký, Václav; Jelínek, Tomáš; Křivan, Jan; Waclawičová, Martina; Benešová, Lucie; Škarpová, Marie. 2024. Charles University, Faculty of Arts, Institute of the Czech National Corpus.
Terms: area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-5687
Up-to-date as of: Wed Mar 5 0:42:41 EST 2025