OLAC Record: Japanese Web N-gram Version 1

OLAC Record
oai:www.ldc.upenn.edu:LDC2009T08

Metadata

Title: Japanese Web N-gram Version 1

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Kudo, Taku, and Hideto Kazawa. Japanese Web N-gram Version 1 LDC2009T08. Web Download. Philadelphia: Linguistic Data Consortium, 2009

Contributor: Kudo, Taku

Kazawa, Hideto

Date (W3CDTF): 2009

Date Issued (W3CDTF): 2009-04-16

Description: *Introduction* Japanese Web N-gram Version 1, Linguistic Data Consortium (LDC) catalog number LDC2009T08 and isbn 1-58563-510-3, was created by Google Inc. It consists of Japanese "word" n-grams and their observed frequency counts generated from over 255 billion tokens of text. The length of the n-grams ranges from unigrams to seven-grams. The n-grams were extracted from publicly accessible web pages that were crawled by Google in July 2007. This data set contains only n-grams that appear at least 20 times in the processed sentences. Less frequent n-grams were simply discarded. Those web pages requiring user authentication, pages containing "noarchive" or "noindex" meta tags, and pages under other special restrictions were excluded from the final release. While the aim was to process only Japanese pages, the corpus may contain some pages in other languages due to language detection errors. This dataset will be useful for research in areas such as statistical machine translation, language modeling and speech recognition, among others. *Data* Before the n-grams were collected, the web pages were converted into UTF-8 encoding, normalized into Unicode Normalization Form KC (see below), and split into sentences. Ill-formed sentences were filtered out, and the remaining sentences were segmented into "words". All strings were normalized into Unicode Normalization Form KC (NFKC), which is described in http://www.unicode.org/unicode/reports/tr15/. Japanese strings were normalized according to the following rules: * Full-width letters/digits were converted to ASCII letters/digits * Half-width katakana were converted to full-width katakana * Glyphs for Roman digits were converted to ASCII characters * Certain Japanese-specific symbols were converted The vocabulary was restricted to "words" that appeared at least 50 times in the processed sentences. Statistical information about the corpus is set forth in the following table: Data size The total compressed data size is about 26GB. Number of tokens: 255,198,240,937 Number of sentences: 20,036,793,177 Number of unique unigrams: 2,565,424 Number of unique bigrams: 80,513,289 Number of unique trigrams: 394,482,216 Number of unique 4-grams: 707,787,333 Number of unique 5-grams: 776,378,943 Number of unique 6-grams: 688,782,933 Number of unique 7-grams: 570,204,252 *Samples* Japanese Bigram Japanese Trigram

Extent: Corpus size: 26214400 KB

Identifier: LDC2009T08

https://catalog.ldc.upenn.edu/LDC2009T08

ISBN: 1-58563-510-3

ISLRN: 380-138-081-238-9

DOI: 10.35111/gs5s-gg06

Language: Japanese

Language (ISO639): jpn

License: Japanese Web N-gram Version 1 Agreement: https://catalog.ldc.upenn.edu/license/japanese-web-n-gram-version-1-ldc2009t08.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2009T08

Rights Holder: Portions © 2007 Google Inc., © 2009 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2009T08

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Kudo, Taku; Kazawa, Hideto. 2009. Linguistic Data Consortium.
Terms: area_Asia country_JP dcmi_Text iso639_jpn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009T08
Up-to-date as of: Wed Oct 29 7:01:06 EDT 2025

Metadata
Title:		Japanese Web N-gram Version 1
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Kudo, Taku, and Hideto Kazawa. Japanese Web N-gram Version 1 LDC2009T08. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:		Kudo, Taku
Contributor:		Kazawa, Hideto
Date (W3CDTF):		2009
Date Issued (W3CDTF):		2009-04-16
Description:		Introduction Japanese Web N-gram Version 1, Linguistic Data Consortium (LDC) catalog number LDC2009T08 and isbn 1-58563-510-3, was created by Google Inc. It consists of Japanese "word" n-grams and their observed frequency counts generated from over 255 billion tokens of text. The length of the n-grams ranges from unigrams to seven-grams. The n-grams were extracted from publicly accessible web pages that were crawled by Google in July 2007. This data set contains only n-grams that appear at least 20 times in the processed sentences. Less frequent n-grams were simply discarded. Those web pages requiring user authentication, pages containing "noarchive" or "noindex" meta tags, and pages under other special restrictions were excluded from the final release. While the aim was to process only Japanese pages, the corpus may contain some pages in other languages due to language detection errors. This dataset will be useful for research in areas such as statistical machine translation, language modeling and speech recognition, among others. Data Before the n-grams were collected, the web pages were converted into UTF-8 encoding, normalized into Unicode Normalization Form KC (see below), and split into sentences. Ill-formed sentences were filtered out, and the remaining sentences were segmented into "words". All strings were normalized into Unicode Normalization Form KC (NFKC), which is described in http://www.unicode.org/unicode/reports/tr15/. Japanese strings were normalized according to the following rules: * Full-width letters/digits were converted to ASCII letters/digits * Half-width katakana were converted to full-width katakana * Glyphs for Roman digits were converted to ASCII characters * Certain Japanese-specific symbols were converted The vocabulary was restricted to "words" that appeared at least 50 times in the processed sentences. Statistical information about the corpus is set forth in the following table: Data size The total compressed data size is about 26GB. Number of tokens: 255,198,240,937 Number of sentences: 20,036,793,177 Number of unique unigrams: 2,565,424 Number of unique bigrams: 80,513,289 Number of unique trigrams: 394,482,216 Number of unique 4-grams: 707,787,333 Number of unique 5-grams: 776,378,943 Number of unique 6-grams: 688,782,933 Number of unique 7-grams: 570,204,252 Samples Japanese Bigram Japanese Trigram
Extent:		Corpus size: 26214400 KB
Identifier:		LDC2009T08
		https://catalog.ldc.upenn.edu/LDC2009T08
		ISBN: 1-58563-510-3
		ISLRN: 380-138-081-238-9
		DOI: 10.35111/gs5s-gg06
Language:		Japanese
Language (ISO639):		jpn
License:		Japanese Web N-gram Version 1 Agreement: https://catalog.ldc.upenn.edu/license/japanese-web-n-gram-version-1-ldc2009t08.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2009T08
Rights Holder:		Portions © 2007 Google Inc., © 2009 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2009T08
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Kudo, Taku; Kazawa, Hideto. 2009. Linguistic Data Consortium.
Terms:		area_Asia country_JP dcmi_Text iso639_jpn olac_primary_text