OLAC Record: Klex: Finite-State Lexical Transducer for Korean

OLAC Record
oai:www.ldc.upenn.edu:LDC2004L01

Metadata

Title: Klex: Finite-State Lexical Transducer for Korean

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Han, Na-Rae. Klex: Finite-State Lexical Transducer for Korean LDC2004L01. Web Download. Philadelphia: Linguistic Data Consortium, 2004

Contributor: Han, Na-Rae

Date (W3CDTF): 2004

Date Issued (W3CDTF): 2004-03-15

Description: *Introduction* Klex: Finite-State Lexical Transducer for Korean was produced by the Linguistic Data Consortium (LDC) and contains a set of script and text files comprising a tool for morphological analysis and generation of Korean text. Klex is a lexical transducer with the lexical string in the upper section and the inflected surface string in the lower section. Klex was developed on the XFST (Xerox Finite State Tool) software platform, developed and distributed by the Xerox Corporation. The most common application for such lexical transducers is morphological analysis and generation. *Data* The distribution consists of approximately 7.8 MB of data. Characters in Hangul (Korean alphabet) can be displayed by selecting Korean encoding in your browser. A lexicon in the form of a transducer has the following basic structure: fly/VV+s/ECS | flies 돕/VV+었/EPF+다/EFN | 도왔다 A sequence of morphemes along with the respective part-of-speech constitutes the upper string; a fully lexicalized form constitutes the lower string. A transducer network as a whole consists of all such possible morpheme sequence / word pairs in the language. Given the lower lexicalized form, the transducer can produce the analyzed morpheme sequence (the process of "looking-up"); conversely, the transducer can be used in producing the fully inflected surface form of grammatical sequence of morphemes (opposite of "looking-up," hence Xerox's terminology of "looking-down"). These two operations are the most typical applications of such lexical transducers, namely morphological analysis and generation. Output of Klex when used as a morphological analyzer is compatible with the Morphologically Annotated Korean Text (LDC2004T03) corpus. It also conforms to the Korean Treebank POS annotation standards, with slight variation. The Korean morphological grammar employed by Klex was constructed by Na-Rae Han, under the guidance of Ken Beesley, Lauri Karttunen, and Martha Palmer. The lexicon was fine-tuned by testing against various corpora, by fixing undesirable outputs and adding missing lexical entries. Klex was partially supported by the Korean Treebank Project, whose result was published in 2002 as the Korean English Treebank Annotations (LDC2002T26). *Samples* Please view these examples of the analyzer: * Small text from "Little Prince" * Tokenized text * Tokenized and romanized text * Sample output of morphological analyzer *Updates* There are no updates available at this time. *Sponsorship* The Klex corpus was funded in part through a five-year grant (BCS-998009, KDI, SBE) from the National Science Foundation via TalkBank, an interdisciplinary project to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data. Additional funding was provided by Linguistic Data Consortium. *Note* The cost of the first 50 copies of this publication (not counting the copies distributed to LDC members) is covered by NSF Grant Number BCS-998009, and therefore free of charge. After these first 50 copies are distributed, additional copies will be available for the cost of $2000.

Identifier: LDC2004L01

https://catalog.ldc.upenn.edu/LDC2004L01

ISBN: 1-58563-283-x

ISLRN: 031-806-130-080-1

DOI: 10.35111/jtsj-py44

Language: Korean

Language (ISO639): kor

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2004L01

Rights Holder: Portions © 2004 Trustees of the University of Pennsylvania

Subject: Korean language

Subject (ISO639): kor

Type (DCMI): Text

Type (OLAC): lexicon

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2004L01

DateStamp: 2024-04-03

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Han, Na-Rae. 2004. Linguistic Data Consortium.
Terms: area_Asia country_KR dcmi_Text iso639_kor olac_lexicon

Inferred Metadata
Country: South Korea
Area: Asia

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004L01
Up-to-date as of: Wed Oct 29 7:00:17 EDT 2025

Metadata
Title:		Klex: Finite-State Lexical Transducer for Korean
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Han, Na-Rae. Klex: Finite-State Lexical Transducer for Korean LDC2004L01. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:		Han, Na-Rae
Date (W3CDTF):		2004
Date Issued (W3CDTF):		2004-03-15
Description:		Introduction Klex: Finite-State Lexical Transducer for Korean was produced by the Linguistic Data Consortium (LDC) and contains a set of script and text files comprising a tool for morphological analysis and generation of Korean text. Klex is a lexical transducer with the lexical string in the upper section and the inflected surface string in the lower section. Klex was developed on the XFST (Xerox Finite State Tool) software platform, developed and distributed by the Xerox Corporation. The most common application for such lexical transducers is morphological analysis and generation. Data The distribution consists of approximately 7.8 MB of data. Characters in Hangul (Korean alphabet) can be displayed by selecting Korean encoding in your browser. A lexicon in the form of a transducer has the following basic structure: fly/VV+s/ECS \| flies 돕/VV+었/EPF+다/EFN \| 도왔다 A sequence of morphemes along with the respective part-of-speech constitutes the upper string; a fully lexicalized form constitutes the lower string. A transducer network as a whole consists of all such possible morpheme sequence / word pairs in the language. Given the lower lexicalized form, the transducer can produce the analyzed morpheme sequence (the process of "looking-up"); conversely, the transducer can be used in producing the fully inflected surface form of grammatical sequence of morphemes (opposite of "looking-up," hence Xerox's terminology of "looking-down"). These two operations are the most typical applications of such lexical transducers, namely morphological analysis and generation. Output of Klex when used as a morphological analyzer is compatible with the Morphologically Annotated Korean Text (LDC2004T03) corpus. It also conforms to the Korean Treebank POS annotation standards, with slight variation. The Korean morphological grammar employed by Klex was constructed by Na-Rae Han, under the guidance of Ken Beesley, Lauri Karttunen, and Martha Palmer. The lexicon was fine-tuned by testing against various corpora, by fixing undesirable outputs and adding missing lexical entries. Klex was partially supported by the Korean Treebank Project, whose result was published in 2002 as the Korean English Treebank Annotations (LDC2002T26). Samples Please view these examples of the analyzer: * Small text from "Little Prince" * Tokenized text * Tokenized and romanized text * Sample output of morphological analyzer Updates There are no updates available at this time. Sponsorship The Klex corpus was funded in part through a five-year grant (BCS-998009, KDI, SBE) from the National Science Foundation via TalkBank, an interdisciplinary project to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data. Additional funding was provided by Linguistic Data Consortium. Note The cost of the first 50 copies of this publication (not counting the copies distributed to LDC members) is covered by NSF Grant Number BCS-998009, and therefore free of charge. After these first 50 copies are distributed, additional copies will be available for the cost of $2000.
Identifier:		LDC2004L01
		https://catalog.ldc.upenn.edu/LDC2004L01
		ISBN: 1-58563-283-x
		ISLRN: 031-806-130-080-1
		DOI: 10.35111/jtsj-py44
Language:		Korean
Language (ISO639):		kor
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2004L01
Rights Holder:		Portions © 2004 Trustees of the University of Pennsylvania
Subject:		Korean language
Subject (ISO639):		kor
Type (DCMI):		Text
Type (OLAC):		lexicon
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2004L01
DateStamp:		2024-04-03
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Han, Na-Rae. 2004. Linguistic Data Consortium.
Terms:		area_Asia country_KR dcmi_Text iso639_kor olac_lexicon
Inferred Metadata
Country:		South Korea
Area:		Asia