OLAC Record: Mixer 6 - CHiME 8 Transcribed Calls and Interviews

OLAC Record
oai:www.ldc.upenn.edu:LDC2025S07

Metadata

Title: Mixer 6 - CHiME 8 Transcribed Calls and Interviews

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Wiesner, Matthew, et al. Mixer 6 - CHiME 8 Transcribed Calls and Interviews LDC2025S07. Web Download. Philadelphia: Linguistic Data Consortium, 2025

Contributor: Wiesner, Matthew

Raj, Desh

Maciejewski, Matthew

Haviland, Chloe

Cornell, Samuele

Chodroff, Eleanor

Khudanpur, Sanjeev

Godfrey, Jack

Date (W3CDTF): 2025

Date Issued (W3CDTF): 2025-08-15

Description: *Introduction* Mixer 6 - CHiME 8 Transcribed Calls and Interviews was developed for the 7th and 8th CHiME (Computational Hearing in Multisource Environments) challenges. It contains 80 hours of English interviews and telephone speech from Mixer 6 Speech (LDC2013S03) with transcripts developed for the CHiME challenges and divided into training, development and test sets. This data was used in CHiME 7 Task 1 and CHiME 8 Task 1 both of which focused on transcription and segmentation across varied recording conditions such as interviews, meetings, and dinner parties, with an emphasis on generalization across recording device types and array topologies. Mixer 6 Speech was developed by the Linguistic Data Consortium (LDC) and comprises 15,863 hours of audio recordings of interviews, transcript readings and conversational telephone speech involving 594 distinct native English speakers recorded over 14 channels. This material was collected by LDC in 2009 and 2010 as part of the Mixer project, specifically phase 6, the focus of which was on native American English speakers local to the Philadelphia area. *Data* The data includes audio from Mixer 6 Speech recorded on 13 microphones for a total of 1063 hours corresponding to 80 hours of speech. The development and test splits are speaker-disjoint from the training data and consist of fully transcribed, multi-microphone interviews. The transcripts were developed in three phases: (1) manual transcription, segmentation and automatic alignment with speech; (2) splitting sessions into sets; and (3) splitting certain sessions from the training set. Each segment was labeled with the speaker, the uttered text, and the start and end times in seconds for that segment. Audio data is provided as 16 bit FLAC files sampled at 16kHz. Transcripts are released as UTF-8 encoded JSON files. *Samples* Please view the following samples: * Speech Audio (FLAC) * Transcripts (JSON) *Updates* No updates at this time.

Extent: Corpus size: 108000000 KB

Format: Sampling Rate: 16000

Sampling Format: 16-bit FLAC

Identifier: LDC2025S07

https://catalog.ldc.upenn.edu/LDC2025S07

ISLRN: 017-424-674-662-6

DOI: 10.35111/pk0y-qp29

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2025S07

Rights Holder: Portions © 2009-2010, 2013, 2025 Trustees of the University of Pennsylvania

Subject: English language

Subject (ISO639): eng

Subject (OLAC): text_and_corpus_linguistics

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2025S07

DateStamp: 2026-01-01

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Wiesner, Matthew; Raj, Desh; Maciejewski, Matthew; Haviland, Chloe; Cornell, Samuele; Chodroff, Eleanor; Khudanpur, Sanjeev; Godfrey, Jack. 2025. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng olac_primary_text olac_text_and_corpus_linguistics

Inferred Metadata
Country: United Kingdom
Area: Europe

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2025S07
Up-to-date as of: Wed Jul 8 7:30:33 EDT 2026

Metadata
Title:		Mixer 6 - CHiME 8 Transcribed Calls and Interviews
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Wiesner, Matthew, et al. Mixer 6 - CHiME 8 Transcribed Calls and Interviews LDC2025S07. Web Download. Philadelphia: Linguistic Data Consortium, 2025
Contributor:		Wiesner, Matthew
		Raj, Desh
		Maciejewski, Matthew
		Haviland, Chloe
		Cornell, Samuele
		Chodroff, Eleanor
		Khudanpur, Sanjeev
		Godfrey, Jack
Date (W3CDTF):		2025
Date Issued (W3CDTF):		2025-08-15
Description:		Introduction Mixer 6 - CHiME 8 Transcribed Calls and Interviews was developed for the 7th and 8th CHiME (Computational Hearing in Multisource Environments) challenges. It contains 80 hours of English interviews and telephone speech from Mixer 6 Speech (LDC2013S03) with transcripts developed for the CHiME challenges and divided into training, development and test sets. This data was used in CHiME 7 Task 1 and CHiME 8 Task 1 both of which focused on transcription and segmentation across varied recording conditions such as interviews, meetings, and dinner parties, with an emphasis on generalization across recording device types and array topologies. Mixer 6 Speech was developed by the Linguistic Data Consortium (LDC) and comprises 15,863 hours of audio recordings of interviews, transcript readings and conversational telephone speech involving 594 distinct native English speakers recorded over 14 channels. This material was collected by LDC in 2009 and 2010 as part of the Mixer project, specifically phase 6, the focus of which was on native American English speakers local to the Philadelphia area. Data The data includes audio from Mixer 6 Speech recorded on 13 microphones for a total of 1063 hours corresponding to 80 hours of speech. The development and test splits are speaker-disjoint from the training data and consist of fully transcribed, multi-microphone interviews. The transcripts were developed in three phases: (1) manual transcription, segmentation and automatic alignment with speech; (2) splitting sessions into sets; and (3) splitting certain sessions from the training set. Each segment was labeled with the speaker, the uttered text, and the start and end times in seconds for that segment. Audio data is provided as 16 bit FLAC files sampled at 16kHz. Transcripts are released as UTF-8 encoded JSON files. Samples Please view the following samples: * Speech Audio (FLAC) * Transcripts (JSON) Updates No updates at this time.
Extent:		Corpus size: 108000000 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: 16-bit FLAC
Identifier:		LDC2025S07
		https://catalog.ldc.upenn.edu/LDC2025S07
		ISLRN: 017-424-674-662-6
		DOI: 10.35111/pk0y-qp29
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2025S07
Rights Holder:		Portions © 2009-2010, 2013, 2025 Trustees of the University of Pennsylvania
Subject:		English language
Subject (ISO639):		eng
Subject (OLAC):		text_and_corpus_linguistics
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2025S07
DateStamp:		2026-01-01
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Wiesner, Matthew; Raj, Desh; Maciejewski, Matthew; Haviland, Chloe; Cornell, Samuele; Chodroff, Eleanor; Khudanpur, Sanjeev; Godfrey, Jack. 2025. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound dcmi_Text iso639_eng olac_primary_text olac_text_and_corpus_linguistics
Inferred Metadata
Country:		United Kingdom
Area:		Europe