OLAC Record
oai:www.ldc.upenn.edu:LDC2025S01

Metadata
Title:MATERIAL Georgian-English Language Pack
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Asatiani, Sandro, et al. MATERIAL Georgian-English Language Pack LDC2025S01. Web Download. Philadelphia: Linguistic Data Consortium, 2025
Contributor:Asatiani, Sandro
Bills, Aric
Brunckhorst, Rachael
Chouder, Sarra
Corey, Cassian
Dubinski, Eyal
Ellis, Corinna
Gibby, Paul
Kalkhitashvili, Tamar
Kazi, Michael
Tong, Audrey
Lam, Julie
Le, Hanh
Malyska, Nicolas
Marcucci, Giorgia
Marvi, Sarah
McConnell, Sara
Melot, Jennifer
Mensch, Alyssa
Morrison, Michelle
Paget, Shelley
Richardson, Frederick
Roberts, Annette
Rubino, Carl
Samushia, Lela
Date (W3CDTF):2025
Date Issued (W3CDTF):2025-02-17
Description:*Introduction* MATERIAL Georgian-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 79 hours of Georgian conversational telephone speech, transcripts, English translations, annotations and queries. The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries. *Data* The Georgian speech in this release represents that spoken in the Eastern and Western dialect regions of Georgia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 75 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle. Transcripts cover approximately half of the speech data, and approximately 3% of the speech data was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release. Georgian-English Language Pack also includes English queries and their relevance annotations. Annotators marked transcripts by query (simple, conceptual, hybrid) and by their relevance to query search terms. Speech data is presented mostly as two channel wav or single channel sphere files, both in 8kHz A-law format. Some wav files are 48kHz PCM. All text data is UTF-8 encoded. *Samples* * Georgian Transcription Sample (TXT) * English Translation Sample (TXT) * Audio Sample (WAV) *Updates* None at this time.
Extent:Corpus size: 10012254 KB
Format:Sampling Rate: 8000
Sampling Format: alaw
Identifier:LDC2025S01
https://catalog.ldc.upenn.edu/LDC2025S01
ISLRN: 518-912-923-506-5
DOI: 10.35111/a8jn-8696
Language:Georgian
English
Language (ISO639):kat
eng
License:MATERIAL Georgian-English Agreement (For-Profit): https://catalog.ldc.upenn.edu/license/material-georgian-english-agreement-for-profit.pdf
MATERIAL Georgian-English Agreement (Non-Member): https://catalog.ldc.upenn.edu/license/material-georgian-english-agreement-non-member.pdf
MATERIAL Georgian-English Agreement (Not-For-Profit): https://catalog.ldc.upenn.edu/license/material-georgian-english-agreement-not-for-profit.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2025S01
Rights Holder:Portions © 2025 U.S. Government, © 2025 Trustees of the University of Pennsylvania
Type (DCMI):Sound
Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2025S01
DateStamp:  2025-02-18
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Asatiani, Sandro; Bills, Aric; Brunckhorst, Rachael; Chouder, Sarra; Corey, Cassian; Dubinski, Eyal; Ellis, Corinna; Gibby, Paul; Kalkhitashvili, Tamar; Kazi, Michael; Tong, Audrey; Lam, Julie; Le, Hanh; Malyska, Nicolas; Marcucci, Giorgia; Marvi, Sarah; McConnell, Sara; Melot, Jennifer; Mensch, Alyssa; Morrison, Michelle; Paget, Shelley; Richardson, Frederick; Roberts, Annette; Rubino, Carl; Samushia, Lela. 2025. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_GB country_GE dcmi_Sound dcmi_Text iso639_eng iso639_kat olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2025S01
Up-to-date as of: Wed Feb 19 6:32:52 EST 2025