OLAC Record: MATERIAL Farsi-English Language Pack

OLAC Record
oai:www.ldc.upenn.edu:LDC2024S13

Metadata

Title: MATERIAL Farsi-English Language Pack

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Bills, Aric, et al. MATERIAL Farsi-English Language Pack LDC2024S13. Web Download. Philadelphia: Linguistic Data Consortium, 2024

Contributor: Bills, Aric

Chouder, Sarra

Corey, Cassian

Davoodian, Marjan

Dubinski, Eyal

Ellis, Corinna

Farnam, Reza

Gibby, Paul

Hartwig, Luke

Kalnins, Dagmara

Kazi, Michael

Lam, Julie

Le, Hanh

Malyska, Nicolas

Marvi, Sarah

McConnell, Sara

Melot, Jennifer

Mensch, Alyssa

Moore, Alex

Morrison, Michelle

Paget, Shelley

Richardson, Frederick

Roberts, Annette

Rubino, Carl

Moaddel, Marjan Sadeghi

Date (W3CDTF): 2024

Date Issued (W3CDTF): 2024-12-16

Description: *Introduction* MATERIAL Farsi-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 61 hours of Farsi conversational telephone speech, transcripts, English translations, annotations and queries. The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries. *Data* The Farsi speech in this release represents that spoken in the Greater Tehran, Central/Southwest, Northeast, and Northwest dialect regions of Iran, as well as a standard formal dialect in use throughout the country. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle. Transcripts cover approximately a third of the speech data, and approximately 3% of the speech data was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release. Farsi-English Language Pack also includes English queries and their relevance annotations. Annotators marked transcripts by query (simple, conceptual, hybrid) and by their relevance to query search terms. Speech data is presented either as two channel wav or single channel sphere files, both in 8kHz A-law format. All text data is UTF-8 encoded. *Samples* Please view the following samples: * Audio Sample (WAV) * Transcript Sample (TXT) * Translation Sample (TXT) *Updates* None at this time.

Extent: Corpus size: 2664892 KB

Format: Sampling Rate: 8000

Sampling Format: alaw

Identifier: LDC2024S13

https://catalog.ldc.upenn.edu/LDC2024S13

ISLRN: 202-347-751-598-9

DOI: 10.35111/7dhe-8213

Language: English

Persian

Language (ISO639): eng

fas

License: MATERIAL Farsi-English Agreement (For-Profit): https://catalog.ldc.upenn.edu/license/material-farsi-english-agreement-for-profit.pdf

MATERIAL Farsi-English Agreement (Non-Member): https://catalog.ldc.upenn.edu/license/material-farsi-english-agreement-non-member.pdf

MATERIAL Farsi-English Agreement (Not-For-Profit): https://catalog.ldc.upenn.edu/license/material-farsi-english-agreement-not-for-profit.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2024S13

Rights Holder: Portions © 2024 U.S. Government, © 2024 Trustees of the University of Pennsylvania The U.S. Government acquired this data from Appen which assigned the copyright to the data to the U.S. Government.

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2024S13

DateStamp: 2025-01-29

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Bills, Aric; Chouder, Sarra; Corey, Cassian; Davoodian, Marjan; Dubinski, Eyal; Ellis, Corinna; Farnam, Reza; Gibby, Paul; Hartwig, Luke; Kalnins, Dagmara; Kazi, Michael; Lam, Julie; Le, Hanh; Malyska, Nicolas; Marvi, Sarah; McConnell, Sara; Melot, Jennifer; Mensch, Alyssa; Moore, Alex; Morrison, Michelle; Paget, Shelley; Richardson, Frederick; Roberts, Annette; Rubino, Carl; Moaddel, Marjan Sadeghi. 2024. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng iso639_fas olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2024S13
Up-to-date as of: Wed Oct 29 7:02:18 EDT 2025

Metadata
Title:		MATERIAL Farsi-English Language Pack
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Bills, Aric, et al. MATERIAL Farsi-English Language Pack LDC2024S13. Web Download. Philadelphia: Linguistic Data Consortium, 2024
Contributor:		Bills, Aric
		Chouder, Sarra
		Corey, Cassian
		Davoodian, Marjan
		Dubinski, Eyal
		Ellis, Corinna
		Farnam, Reza
		Gibby, Paul
		Hartwig, Luke
		Kalnins, Dagmara
		Kazi, Michael
		Lam, Julie
		Le, Hanh
		Malyska, Nicolas
		Marvi, Sarah
		McConnell, Sara
		Melot, Jennifer
		Mensch, Alyssa
		Moore, Alex
		Morrison, Michelle
		Paget, Shelley
		Richardson, Frederick
		Roberts, Annette
		Rubino, Carl
		Moaddel, Marjan Sadeghi
Date (W3CDTF):		2024
Date Issued (W3CDTF):		2024-12-16
Description:		Introduction MATERIAL Farsi-English Language Pack was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) MATERIAL (Machine Translation for English Retrieval of Information in Any Language) program. It contains approximately 61 hours of Farsi conversational telephone speech, transcripts, English translations, annotations and queries. The MATERIAL program focused on underserved languages with the ultimate goal to build cross language information retrieval systems to find speech and text content using English search queries. Data The Farsi speech in this release represents that spoken in the Greater Tehran, Central/Southwest, Northeast, and Northwest dialect regions of Iran, as well as a standard formal dialect in use throughout the country. The gender distribution among speakers is approximately equal; speakers' ages range from 16 years to 67 years. Calls were made using different telephones (e.g., mobile, landline) from a variety of environments including the street, a home or office, a public place, and inside a vehicle. Transcripts cover approximately a third of the speech data, and approximately 3% of the speech data was translated into English. Further information about transcription and translation methodologies is contained in the documentation accompanying this release. Farsi-English Language Pack also includes English queries and their relevance annotations. Annotators marked transcripts by query (simple, conceptual, hybrid) and by their relevance to query search terms. Speech data is presented either as two channel wav or single channel sphere files, both in 8kHz A-law format. All text data is UTF-8 encoded. Samples Please view the following samples: * Audio Sample (WAV) * Transcript Sample (TXT) * Translation Sample (TXT) Updates None at this time.
Extent:		Corpus size: 2664892 KB
Format:		Sampling Rate: 8000
Format:		Sampling Format: alaw
Identifier:		LDC2024S13
		https://catalog.ldc.upenn.edu/LDC2024S13
		ISLRN: 202-347-751-598-9
		DOI: 10.35111/7dhe-8213
Language:		English
Language:		Persian
Language (ISO639):		eng
Language (ISO639):		fas
License:		MATERIAL Farsi-English Agreement (For-Profit): https://catalog.ldc.upenn.edu/license/material-farsi-english-agreement-for-profit.pdf
		MATERIAL Farsi-English Agreement (Non-Member): https://catalog.ldc.upenn.edu/license/material-farsi-english-agreement-non-member.pdf
		MATERIAL Farsi-English Agreement (Not-For-Profit): https://catalog.ldc.upenn.edu/license/material-farsi-english-agreement-not-for-profit.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2024S13
Rights Holder:		Portions © 2024 U.S. Government, © 2024 Trustees of the University of Pennsylvania The U.S. Government acquired this data from Appen which assigned the copyright to the data to the U.S. Government.
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2024S13
DateStamp:		2025-01-29
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Bills, Aric; Chouder, Sarra; Corey, Cassian; Davoodian, Marjan; Dubinski, Eyal; Ellis, Corinna; Farnam, Reza; Gibby, Paul; Hartwig, Luke; Kalnins, Dagmara; Kazi, Michael; Lam, Julie; Le, Hanh; Malyska, Nicolas; Marvi, Sarah; McConnell, Sara; Melot, Jennifer; Mensch, Alyssa; Moore, Alex; Morrison, Michelle; Paget, Shelley; Richardson, Frederick; Roberts, Annette; Rubino, Carl; Moaddel, Marjan Sadeghi. 2024. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng iso639_fas olac_primary_text