![]() |
OLAC Record oai:www.ldc.upenn.edu:LDC2025S01 |
Metadata | ||
Title: | MATERIAL Georgian-English Language Pack | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Asatiani, Sandro, et al. MATERIAL Georgian-English Language Pack LDC2025S01. Web Download. Philadelphia: Linguistic Data Consortium, 2025 | |
Contributor: | Asatiani, Sandro | |
Bills, Aric | ||
Brunckhorst, Rachael | ||
Chouder, Sarra | ||
Corey, Cassian | ||
Dubinski, Eyal | ||
Ellis, Corinna | ||
Gibby, Paul | ||
Kalkhitashvili, Tamar | ||
Kazi, Michael | ||
Tong, Audrey | ||
Lam, Julie | ||
Le, Hanh | ||
Malyska, Nicolas | ||
Marcucci, Giorgia | ||
Marvi, Sarah | ||
McConnell, Sara | ||
Melot, Jennifer | ||
Mensch, Alyssa | ||
Morrison, Michelle | ||
Paget, Shelley | ||
Richardson, Frederick | ||
Roberts, Annette | ||
Rubino, Carl | ||
Samushia, Lela | ||
Date (W3CDTF): | 2025 | |
Date Issued (W3CDTF): | 2025-02-17 | |
Description: | *Introduction* MATERIAL Georgian-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 79 hours of Georgian conversational telephone speech, transcripts, English translations, annotations and queries. The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries. *Data* The Georgian speech in this release represents that spoken in the Eastern and Western dialect regions of Georgia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 75 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle. Transcripts cover approximately half of the speech data, and approximately 3% of the speech data was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release. Georgian-English Language Pack also includes English queries and their relevance annotations. Annotators marked transcripts by query (simple, conceptual, hybrid) and by their relevance to query search terms. Speech data is presented mostly as two channel wav or single channel sphere files, both in 8kHz A-law format. Some wav files are 48kHz PCM. All text data is UTF-8 encoded. *Samples* * Georgian Transcription Sample (TXT) * English Translation Sample (TXT) * Audio Sample (WAV) *Updates* None at this time. | |
Extent: | Corpus size: 10012254 KB | |
Format: | Sampling Rate: 8000 | |
Sampling Format: alaw | ||
Identifier: | LDC2025S01 | |
https://catalog.ldc.upenn.edu/LDC2025S01 | ||
ISLRN: 518-912-923-506-5 | ||
DOI: 10.35111/a8jn-8696 | ||
Language: | Georgian | |
English | ||
Language (ISO639): | kat | |
eng | ||
License: | MATERIAL Georgian-English Agreement (For-Profit): https://catalog.ldc.upenn.edu/license/material-georgian-english-agreement-for-profit.pdf | |
MATERIAL Georgian-English Agreement (Non-Member): https://catalog.ldc.upenn.edu/license/material-georgian-english-agreement-non-member.pdf | ||
MATERIAL Georgian-English Agreement (Not-For-Profit): https://catalog.ldc.upenn.edu/license/material-georgian-english-agreement-not-for-profit.pdf | ||
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2025S01 | |
Rights Holder: | Portions © 2025 U.S. Government, © 2025 Trustees of the University of Pennsylvania | |
Type (DCMI): | Sound | |
Text | ||
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2025S01 | |
DateStamp: | 2025-02-18 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Asatiani, Sandro; Bills, Aric; Brunckhorst, Rachael; Chouder, Sarra; Corey, Cassian; Dubinski, Eyal; Ellis, Corinna; Gibby, Paul; Kalkhitashvili, Tamar; Kazi, Michael; Tong, Audrey; Lam, Julie; Le, Hanh; Malyska, Nicolas; Marcucci, Giorgia; Marvi, Sarah; McConnell, Sara; Melot, Jennifer; Mensch, Alyssa; Morrison, Michelle; Paget, Shelley; Richardson, Frederick; Roberts, Annette; Rubino, Carl; Samushia, Lela. 2025. Linguistic Data Consortium. | |
Terms: | area_Asia area_Europe country_GB country_GE dcmi_Sound dcmi_Text iso639_eng iso639_kat olac_primary_text |