OLAC Record: Chared

OLAC Record
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67A-9

Metadata

Title: Chared

Bibliographic Citation: http://hdl.handle.net/11858/00-097C-0000-000D-F67A-9

Creator: Pomikálek, Jan

Date (W3CDTF): 2013-02-01T16:32:21Z

Date Available: 2013-02-01T16:32:21Z

Description: Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at http://code.google.com/p/chared/. The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9.

PRESEMT, Lexical Computing Ltd

Identifier (URI): http://hdl.handle.net/11858/00-097C-0000-000D-F67A-9

Language: English

Language (ISO639): eng

Publisher: Masaryk University, NLP Centre

Rights: BSD 3-Clause "New" or "Revised" license

http://opensource.org/licenses/BSD-3-Clause

Subject: character encoding

character encoding detection

charset

unicode

Type: toolService

Type (DCMI): Software

OLAC Info

Archive: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Description: http://www.language-archives.org/archive/lindat.mff.cuni.cz

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67A-9

DateStamp: 2021-06-29

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Pomikálek, Jan. 2013. Masaryk University, NLP Centre.
Terms: area_Europe country_GB dcmi_Software iso639_eng

http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67A-9
Up-to-date as of: Sat Mar 22 1:01:42 EDT 2025

Metadata
Title:		Chared
Bibliographic Citation:		http://hdl.handle.net/11858/00-097C-0000-000D-F67A-9
Creator:		Pomikálek, Jan
Date (W3CDTF):		2013-02-01T16:32:21Z
Date Available:		2013-02-01T16:32:21Z
Description:		Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at http://code.google.com/p/chared/. The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9.
Description:		PRESEMT, Lexical Computing Ltd
Identifier (URI):		http://hdl.handle.net/11858/00-097C-0000-000D-F67A-9
Language:		English
Language (ISO639):		eng
Publisher:		Masaryk University, NLP Centre
Rights:		BSD 3-Clause "New" or "Revised" license
Rights:		http://opensource.org/licenses/BSD-3-Clause
Subject:		character encoding
		character encoding detection
		charset
		unicode
Type:		toolService
Type (DCMI):		Software
OLAC Info
Archive:		LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:		http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67A-9
DateStamp:		2021-06-29
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Pomikálek, Jan. 2013. Masaryk University, NLP Centre.
Terms:		area_Europe country_GB dcmi_Software iso639_eng