OLAC Record: onion

OLAC Record
oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67B-7

Metadata

Title: onion

Bibliographic Citation: http://hdl.handle.net/11858/00-097C-0000-000D-F67B-7

Creator: Pomikálek, Jan

Date (W3CDTF): 2013-02-01T16:34:32Z

Date Available: 2013-02-01T16:34:32Z

Description: onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/onion/). It is being successfuly used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The deduplication algorithm is based on comparing n-grams of words of text. The author's algorithm has been shown to be more suitable for textual corpora deduplication than competing algorithms (Broder, Charikar): in addition to detection of identical or very similar (95 %) duplicates, it is able to detect even partially similar duplicates (50 %) still achieving great performace (further described in author's Ph.D. thesis). The unique deduplication capabilities and scalability of the algorithm were been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- several TB of text documents were deduplicated resulting in corpora of 70 billions tokens altogether.

PRESEMT, Lexical Computing Ltd

Identifier (URI): http://hdl.handle.net/11858/00-097C-0000-000D-F67B-7

Language: English

Language (ISO639): eng

Publisher: Masaryk University, NLP Centre

Rights: BSD 3-Clause "New" or "Revised" license

http://opensource.org/licenses/BSD-3-Clause

Subject: deduplication

corpus

text deduplication

n-gram deduplication

n-gram model

Type: toolService

Type (DCMI): Software

OLAC Info

Archive: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Description: http://www.language-archives.org/archive/lindat.mff.cuni.cz

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67B-7

DateStamp: 2021-06-29

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Pomikálek, Jan. 2013. Masaryk University, NLP Centre.
Terms: area_Europe country_GB dcmi_Software iso639_eng

http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67B-7
Up-to-date as of: Sat Mar 22 1:01:43 EDT 2025

Metadata
Title:		onion
Bibliographic Citation:		http://hdl.handle.net/11858/00-097C-0000-000D-F67B-7
Creator:		Pomikálek, Jan
Date (W3CDTF):		2013-02-01T16:34:32Z
Date Available:		2013-02-01T16:34:32Z
Description:		onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. The tool has been implemented in Python, licensed under New BSD License and made an open source software (available for download including the source code at http://code.google.com/p/onion/). It is being successfuly used for cleaning large textual corpora at Natural language processing centre at Faculty of informatics, Masaryk university Brno and it's industry partners. The research leading to this piece of software was published in author's Ph.D. thesis "Removing Boilerplate and Duplicate Content from Web Corpora". The deduplication algorithm is based on comparing n-grams of words of text. The author's algorithm has been shown to be more suitable for textual corpora deduplication than competing algorithms (Broder, Charikar): in addition to detection of identical or very similar (95 %) duplicates, it is able to detect even partially similar duplicates (50 %) still achieving great performace (further described in author's Ph.D. thesis). The unique deduplication capabilities and scalability of the algorithm were been demonstrated while building corpora of American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting --- several TB of text documents were deduplicated resulting in corpora of 70 billions tokens altogether.
Description:		PRESEMT, Lexical Computing Ltd
Identifier (URI):		http://hdl.handle.net/11858/00-097C-0000-000D-F67B-7
Language:		English
Language (ISO639):		eng
Publisher:		Masaryk University, NLP Centre
Rights:		BSD 3-Clause "New" or "Revised" license
Rights:		http://opensource.org/licenses/BSD-3-Clause
Subject:		deduplication
		corpus
		text deduplication
		n-gram deduplication
		n-gram model
Type:		toolService
Type (DCMI):		Software
OLAC Info
Archive:		LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:		http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:lindat.mff.cuni.cz:11858/00-097C-0000-000D-F67B-7
DateStamp:		2021-06-29
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Pomikálek, Jan. 2013. Masaryk University, NLP Centre.
Terms:		area_Europe country_GB dcmi_Software iso639_eng