OLAC Record
oai:www.ldc.upenn.edu:LDC2024S10

Metadata
Title:MATERIAL Somali-English Language Pack
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Abdi, Zeinab, et al. MATERIAL Somali-English Language Pack LDC2024S10. Web Download. Philadelphia: Linguistic Data Consortium, 2024
Contributor:Abdi, Zeinab
Ali, Zahra
Bills, Aric
Bishop, Judith
Boyle, Anne
Chouder, Sarra
Clair, Nathaniel
Conners, Tom
Corey, Cassian
Dubinski, Eyal
Ellis, Corinna
Fernando, Jess
Gibby, Paul
Abdi, Farah H
Hammond, Simon
Hubert, Maxime
Kaiser-Schatzlein, Alice
Kazi, Michael
Lam, Julie
Lazar, Rosie
Le, Hanh
Levot, Michael
Malyska, Nicolas
Melot, Jennifer
Mensch, Alyssa
Omar, Abdulkadir Arale
Paget, Shelley
Richardson, Frederick
Rubino, Carl
Samko, Bern
Sanders, Gregory
Soh, Stephanie
Strahan, Tania E.
Taylor, Jonathan
Thompson, Brian
Tong, Audrey
Tong, Richard
Yelle, Julie
Yu, Jennifer
Zavorin, Ilya
Date (W3CDTF):2024
Date Issued (W3CDTF):2024-09-16
Description:*Introduction* MATERIAL Somali-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 80 hours of Somali conversational telephone speech, transcripts, English translations, annotations and queries. The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries. *Data* The Somali speech in this release represents that spoken in the Northern and Benaadir dialect regions of Somalia. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 60 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle. Transcripts cover approximately 10% of the speech data, and approximately 4% of the speech data was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release. Somali-English Language Pack also includes domain annotations, English queries and their relevance annotations. Annotators marked transcripts by domain (e.g., lifestyle, business-and-commerce, sports, education, and so on), by query (simple, conceptual, hybrid) and by their relevance to query search terms. Speech data is presented either as two channel wav or single channel sphere files, predominately in 8kHz A-law format, with some wav files at a sample rate of 48kHz. All text data is UTF-8 encoded. *Samples* Please view the following samples: * Audio Sample (WAV) * Transcript Sample (TXT) * Translation Sample (TXT) *Updates* None at this time.
Extent:Corpus size: 13076233 KB
Format:Sampling Rate: 8000
Sampling Format: alaw
Identifier:LDC2024S10
https://catalog.ldc.upenn.edu/LDC2024S10
ISLRN: 462-281-226-328-3
DOI: 10.35111/5550-f323
Language:Somali
English
Language (ISO639):som
eng
License:MATERIAL Somali-English Agreement (For-Profit): https://catalog.ldc.upenn.edu/license/material-somali-english-agreement-for-profit.pdf
MATERIAL Somali-English Agreement (Non-Member): https://catalog.ldc.upenn.edu/license/material-somali-english-agreement-non-member.pdf
MATERIAL Somali-English Agreement (Not-For-Profit): https://catalog.ldc.upenn.edu/license/material-somali-english-agreement-not-for-profit.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2024S10
Rights Holder:Portions © 2024 U.S. Government, © 2024 Trustees of the University of Pennsylvania

The U.S. Government acquired this data from Appen which assigned the copyright to the data in the U.S. Government.
Type (DCMI):Sound
Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2024S10
DateStamp:  2024-11-19
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Abdi, Zeinab; Ali, Zahra; Bills, Aric; Bishop, Judith; Boyle, Anne; Chouder, Sarra; Clair, Nathaniel; Conners, Tom; Corey, Cassian; Dubinski, Eyal; Ellis, Corinna; Fernando, Jess; Gibby, Paul; Abdi, Farah H; Hammond, Simon; Hubert, Maxime; Kaiser-Schatzlein, Alice; Kazi, Michael; Lam, Julie; Lazar, Rosie; Le, Hanh; Levot, Michael; Malyska, Nicolas; Melot, Jennifer; Mensch, Alyssa; Omar, Abdulkadir Arale; Paget, Shelley; Richardson, Frederick; Rubino, Carl; Samko, Bern; Sanders, Gregory; Soh, Stephanie; Strahan, Tania E.; Taylor, Jonathan; Thompson, Brian; Tong, Audrey; Tong, Richard; Yelle, Julie; Yu, Jennifer; Zavorin, Ilya. 2024. Linguistic Data Consortium.
Terms: area_Africa area_Europe country_GB country_SO dcmi_Sound dcmi_Text iso639_eng iso639_som olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2024S10
Up-to-date as of: Fri Dec 6 7:49:18 EST 2024