OLAC Record: Corpus of Spontaneous Japanese (CSJ)

OLAC Record
oai:catalogue.elra.info:ELRA-S0488

Metadata

Title: Corpus of Spontaneous Japanese (CSJ)

Access Rights: Rights available for: nonCommercialUse, commercialUse

Date Available (W3CDTF): 2023-09-26

Date Issued (W3CDTF): 2023-09-26

Description: The "Corpus of Spontaneous Japanese" (or CSJ) is a database containing a large collection of Japanese spoken language data and information for use in linguistic research; jointly developed by NINJAL, NICT and the Tokyo Institute of Technology, the CSJ is world-class in both the quantity and quality of the available data.The corpus has been used for a wide variety of research purposes such as spoken language processing, natural language processing, phonetics, psychology, sociology, Japanese education, and dictionary compilation.The whole CSJ contains about 650 hours of spontaneous speech that correspond to about 7000k words. All these speech materials are recorded using head-worn close-talking microphones and DAT, and down-sampled to 16kHz, 16bit accuracy. The speech material is transcribed using a two-way transcription scheme designed especially for CSJ. Also, POS (part-of-speech) analysis based upon two different kinds of 'word' is applied for the whole corpus.Recorded speech is transcribed in two different ways: orthographic and phonetic transcriptions:- In "orthographic" transcription, speech is transcribed using Kanji (Chinese logograph) and Kana (Japanese syllabary) just like ordinary Japanese text, but unlike the ordinary Japanese writing, the orthographic transcription has rigorous rules about the usage of Kanji and Kana letters. In ordinary text, for example, there are more than five ways of transcribing the phonemic string of /hanasiai/ ("meeting") using Kanji and Kana, but in the CSJ orthographic transcription, only one is allowed. - "Phonetic" transcription is written exclusively in Kana letters so that the phonetic details of the utterance being transcribed can be traced.There is a true subset of CSJ, called the Core, which contains about 500k words or 45 hours of speech. Core is the part of CSJ to which the cost of annotation is concentrated. In addition to the two-way transcription and two-way POS analysis, segment label, intonation label, and other miscellaneous annotations are provided for the Core.

Identifier: ELRA-S0488

ISLRN: 280-594-494-328-0

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-S0488/

Language: Japanese

Language (ISO639): jpn

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-S0488

DateStamp: 2023-09-26

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2023. ELRA (European Language Resources Association).
Terms: area_Asia country_JP dcmi_Sound iso639_jpn olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-S0488
Up-to-date as of: Wed Oct 1 0:58:20 EDT 2025

Metadata
Title:		Corpus of Spontaneous Japanese (CSJ)
Access Rights:		Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):		2023-09-26
Date Issued (W3CDTF):		2023-09-26
Description:		The "Corpus of Spontaneous Japanese" (or CSJ) is a database containing a large collection of Japanese spoken language data and information for use in linguistic research; jointly developed by NINJAL, NICT and the Tokyo Institute of Technology, the CSJ is world-class in both the quantity and quality of the available data.The corpus has been used for a wide variety of research purposes such as spoken language processing, natural language processing, phonetics, psychology, sociology, Japanese education, and dictionary compilation.The whole CSJ contains about 650 hours of spontaneous speech that correspond to about 7000k words. All these speech materials are recorded using head-worn close-talking microphones and DAT, and down-sampled to 16kHz, 16bit accuracy. The speech material is transcribed using a two-way transcription scheme designed especially for CSJ. Also, POS (part-of-speech) analysis based upon two different kinds of 'word' is applied for the whole corpus.Recorded speech is transcribed in two different ways: orthographic and phonetic transcriptions:- In "orthographic" transcription, speech is transcribed using Kanji (Chinese logograph) and Kana (Japanese syllabary) just like ordinary Japanese text, but unlike the ordinary Japanese writing, the orthographic transcription has rigorous rules about the usage of Kanji and Kana letters. In ordinary text, for example, there are more than five ways of transcribing the phonemic string of /hanasiai/ ("meeting") using Kanji and Kana, but in the CSJ orthographic transcription, only one is allowed. - "Phonetic" transcription is written exclusively in Kana letters so that the phonetic details of the utterance being transcribed can be traced.There is a true subset of CSJ, called the Core, which contains about 500k words or 45 hours of speech. Core is the part of CSJ to which the cost of annotation is concentrated. In addition to the two-way transcription and two-way POS analysis, segment label, intonation label, and other miscellaneous annotations are provided for the Core.
Identifier:		ELRA-S0488
Identifier:		ISLRN: 280-594-494-328-0
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-S0488/
Language:		Japanese
Language (ISO639):		jpn
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-S0488
DateStamp:		2023-09-26
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2023. ELRA (European Language Resources Association).
Terms:		area_Asia country_JP dcmi_Sound iso639_jpn olac_primary_text