OLAC Record: CAMIO Transcription Languages

OLAC Record
oai:www.ldc.upenn.edu:LDC2022T07

Metadata

Title: CAMIO Transcription Languages

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Arrigo, Michael, Stephanie Strassel, and Christopher Caruso. CAMIO Transcription Languages LDC2022T07. Web Download. Philadelphia: Linguistic Data Consortium, 2022

Contributor: Arrigo, Michael

Strassel, Stephanie

Caruso, Christopher

Date (W3CDTF): 2022

Date Issued (W3CDTF): 2022-12-15

Description: *Introduction* CAMIO Transcription Languages was developed by the Linguistic Data Consortium and contains nearly 70,000 images of machine printed text with corresponding annotations and transcripts in the following 13 languages: Arabic, Chinese, English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, and Vietnamese. This corpus is a subset of data created for a broader effort to support the development and evaluation of optical character recognition (OCR) and related technologies for 35 languages across 24 unique script types. The CAMIO (Corpus of Annotated Multilingual Images for OCR) collection was designed to address gaps in language and script coverage from existing corpora and to support future evaluation of OCR capabilities through a systematically constructed data set. *Data* Most images were annotated for text localization, resulting in over 2.3M line-level bounding boxes. For the 13 languages represented in this release, 1250 images per language were also annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in a comprehensive XML output format defined for this corpus. The script for each language is indicated in parentheses: Arabic (Arabic), Chinese (Simplified), English (Latin), Farsi (Arabic), Hindi (Devanagari), Japanese (Japanese), Kannada (Kannada), Korean (Hangul), Russian (Cyrillic), Tamil (Tamil), Thai (Thai), Urdu (Arabic), and Vietnamese (Latin). Data for each language is partitioned into test, train or validation sets. *Samples* Please view these samples: * Image Sample (png.ldcc) * Annotation Sample (xml) *Updates* None at this time.

Extent: Corpus size: 12595989 KB

Identifier: LDC2022T07

https://catalog.ldc.upenn.edu/LDC2022T07

ISLRN: 014-810-264-834-8

DOI: 10.35111/r7ds-gy89

Language: English

Arabic

Persian

Hindi

Japanese

Kannada

Korean

Russian

Tamil

Thai

Urdu

Vietnamese

Mandarin Chinese

Language (ISO639): eng

ara

fas

hin

jpn

kan

kor

rus

tam

tha

urd

vie

cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2022T07

Rights Holder: Portions © 2007, 2015, 2017-2020 1399 picofiles, © 2015-2019 65tes-habeshamusic.com, © 2019-2020 Accessify.com, © 2019-2020 Adobe, © 2013, 2019-2020 Alamy Ltd., © 2010-2011, 2019-2020, Amazon.com, Inc. or its affiliates, © 2008, 2018-2019 ambebi.ge, © 2000, 2019-2020 A Medium Corporation, © 2019-2020 App Annie, © 2019 AppKiwi, © 2014, 2019 Armenian News - Tert.am, © 2012-2014, 2018-2019 ARMENPRESS, © 2002, 2006, 2008, 2010-2011, 2013-2014, 2019-2020 Assimba.org, © 2011-2019 Atv - Eritrean Satellite Television, © 2016-2017 AtYourService.pk, © 2018-2019 Aysor, © 2019 Bag, © 2002-2003, 2009-2019 Baidu, © 2017-2019 Bangla sms bengali shayari, © 2019 bbcode0.com, © 2014, 2019-2020 Benawa Network, © 2002, 2012-2019 Bennett Coleman & Co. Ltd., © 2013-2019 Best TV, © 2000, 2019-2020 BigCommerce Pty. Ltd., © 2019 Bnet Technologies, © 2017 BONDHU2U, © 2011, 2015-2018 BuzzFeed, Inc., © 2016-2017, 2019 CBSEPORTAL.COM, © 2019-2020 cinejosh.com, © 2019 Civic Network OPORA, © 2010, 2018-2019 Clipart.com, a division of Vital Imagery Ltd., © 2000, 2015, 2019-2020 Cloudinary, © 2012, 2019 CMS, © 2013, 2018-2019, COUNTRY.ua, © 2014-2019 CyberAgent, Inc., © 2019 Daily Hunt, © 2013-2020 Dehai.org, © 2015, 2019 Deutsche Welle, © 2016-2019 DF Marketplace Company Limited, © 2019 DigitalOcean, LLC, © 2017, 2019 DocPlayer.hu, © 2019 Dreamstime, © 2000, 2007-2019 DuckDuckGo Blog, © 2012-2014, 2016-2019 DVB Multimedia Group, © 2018-2019 DYODEKA SA, © 2012-2019 EastAFRO.com, © 2018-2020 eBay Inc., © 2017-2019 Electronic Database of Cultural Values, © 2019-2020 ePapersland.com, © 2010-2019 Eritrea-Chat.Com, © 2019 Ethiopian Press Agency, © 2019 Etsy, Inc., © 2015-2016, 2019-2020 Exotic India, © 2017-2018 Ezinemart, © 2019-2020 F5, Inc., © 2019 Fine Arts Department. Ministry of Culture, © 2013-2019 Free 4 Reader, © 2016-2019 FRESH NEWS, © 2016-2019 Global Publishers, © 2000-2001, 2003, 2005-2006, 2011, 2015-2020 Google Inc., © 2019-2020 Google LLC, © 2013-2018 Goolgule, © 2013-2014, 2016, 2019 Hetq, © 2010-2014, 2016 Himalayabon.com, © 2011, 2019-2020 Holding "Labyrinth", © 2019 Houshamadyan - Houshamadyan e.V., © 2014, 2019 HRAPARAK, © 2019-2020 Imgur, Inc, © 2016, 2019 Institute for Development of Freedom of Information, © 2013-2015, 2017-2019 IRAVABAN.NET, © 2018-2019 Islam land, © 2019 Jagran Prakashan Ltd, © 2013, 2019 Jofogas, © 2005-2006, 2019 Kapruka.com, © 2019 Kerala Niyamasabha, © 2019 Kesari Weekly, © 2019 Khamsat.com, subsidiary of Hsoub, © 2019 Kidzpark, © 2019 LEPL LEGISLATIVE HERALD OF GEORGIA, © 2019 LLC "Infourok", © 2012, 2019-2020 LLC "Yandex", © 2014, 2019 Magzter Inc, © 2003, 2019 Mahibere Kidusan, © 2006, 2017-2020 Mashreq News, © 2012, 2016-2019 Matichon Public Co., Ltd., © 2016-2018 MemeBuster, © 2019 Mereb Inc., © 2019 Microsoft, © 2014-2019 MillardAyo.com, © 2019 Minhaj-ul-Quran International, © 2015-2016, 2018-2019 MJ Innovations (Pvt) Ltd, © 2003, 2008, 2019-2020 Mohalla Tech Pvt. Ltd., © 2019 Mohsensoft, © 2019 MyShared Inc., © 2018 Nai, © 2014, 2017-2019 Newsroom Ltd., © 2019 Nikand, © 2002, 2019 nplg.gov.ge, © 2016, 2019 nuaodisha.com, © 2012, 2015-2016, 2018-2019 OdiaWeb, © 2019 Omedia Studio, © 2015-2019 online auction auction.ru, © 2019-2020 Owler Inc., © 2019 Oxford University Press, © 2019 "Paste.Pics", © 2010, 2014-2019 People's Daily Online, © 2017, 2019 Pinterest, © 2019 Prom.ua, © 2019 Qurango, © 2011, 2013, 2019-2020 ResearchGate GmbH, © 2019-2020 Reddit Inc, © 2014, 2016-2017, 2019-2020 RFE/RL, Inc., © 2019 Rozetka Online Store, © 2012, 2018-2019 Sambad, © 2016-2019 Satenaw News/Breaking News, © 2019 Scribd Inc., © 2018-2019 Semayat Book Store, © 2018-2019 Shant TV, © 2002, 2008, 2013, 2019 Share Your Essays, © 2018-2019 Shutterstock, Inc., © 2019 Simon Ager, © 2008, 2010-2013, 2019-2020 SlidePlayer.com Inc., © 2017, 2019-2020 SlideServe, © 2019 Slide-Share, © 2016-2017 Smart Doc Posters, © 2009, 2019-2020 SmugMug, Inc., © 2019 spotidoc.com, © 2009, 2019 Squarespace, © 2007, 2019 svitppt Inc., © 2013-2014, 2017, 2019 Tabula, © 2019 Teachers Pay Teachers|Teacher Synergy LLC, © 2019-2020 TAMIL TEXTBOOKS, © 2019 Tanzania Educational Publishers Ltd, © 2019-2020 TeluguOne.com, © 2010, 2019 Text Book Centre Ltd, © 2016, 2018-2019 The Hankyoreh, © 2019 The News Minute, © 2019 The Samaja Epaper, © 2000, 2019 The University of Chicago, © 2010, 2018-2019 Tibet News, © 2015, 2017 Tibetan Community Health Network, © 2017-2019 Tigray Communication Affairs Bureau, © 2019-2020 TripAdvisor LLC, © 2015-2019 Tsanpo.com, © 2015-2019 Tsem Rinpoche, © 2017-2019 University of South-East Asia, © 2004-2007, 2011-2014, 2016-2019 Upali Newspapers (Pvt) Ltd., © 2019 VietTouch, © 2019 Vindad, © 2019 Vinh Phuc Newspaper, © 2012, 2019 Wasabi Technologies, Inc., © 2019-2020 WatKhemaraRatanaram.org, © 2011, 2016, 2018-2020 Wonder Idea Technology Co., Ltd., © 2019 WorthPoint Corporation, © 2013, 2019 www.Dek-D.com, © 2019 Yakaboo, © 2009, 2011, 2019 yeddyurappa.in, © 2001, 2003-2004, 2009, 2012-2013, 2016, 2019-2020 Yumpu.com, © 2011-2019 ZeHabesha,© 2020, 2022 Trustees of the University of Pennsylvania

Type (DCMI): StillImage

Text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2022T07

DateStamp: 2023-12-05

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Arrigo, Michael; Strassel, Stephanie; Caruso, Christopher. 2022. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB country_IN country_JP country_KR country_PK country_RU country_TH country_VN dcmi_StillImage dcmi_Text iso639_ara iso639_cmn iso639_eng iso639_fas iso639_hin iso639_jpn iso639_kan iso639_kor iso639_rus iso639_tam iso639_tha iso639_urd iso639_vie

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2022T07
Up-to-date as of: Wed Oct 29 7:02:10 EDT 2025

Metadata
Title:		CAMIO Transcription Languages
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Arrigo, Michael, Stephanie Strassel, and Christopher Caruso. CAMIO Transcription Languages LDC2022T07. Web Download. Philadelphia: Linguistic Data Consortium, 2022
Contributor:		Arrigo, Michael
		Strassel, Stephanie
		Caruso, Christopher
Date (W3CDTF):		2022
Date Issued (W3CDTF):		2022-12-15
Description:		Introduction CAMIO Transcription Languages was developed by the Linguistic Data Consortium and contains nearly 70,000 images of machine printed text with corresponding annotations and transcripts in the following 13 languages: Arabic, Chinese, English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, and Vietnamese. This corpus is a subset of data created for a broader effort to support the development and evaluation of optical character recognition (OCR) and related technologies for 35 languages across 24 unique script types. The CAMIO (Corpus of Annotated Multilingual Images for OCR) collection was designed to address gaps in language and script coverage from existing corpora and to support future evaluation of OCR capabilities through a systematically constructed data set. Data Most images were annotated for text localization, resulting in over 2.3M line-level bounding boxes. For the 13 languages represented in this release, 1250 images per language were also annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in a comprehensive XML output format defined for this corpus. The script for each language is indicated in parentheses: Arabic (Arabic), Chinese (Simplified), English (Latin), Farsi (Arabic), Hindi (Devanagari), Japanese (Japanese), Kannada (Kannada), Korean (Hangul), Russian (Cyrillic), Tamil (Tamil), Thai (Thai), Urdu (Arabic), and Vietnamese (Latin). Data for each language is partitioned into test, train or validation sets. Samples Please view these samples: * Image Sample (png.ldcc) * Annotation Sample (xml) Updates None at this time.
Extent:		Corpus size: 12595989 KB
Identifier:		LDC2022T07
		https://catalog.ldc.upenn.edu/LDC2022T07
		ISLRN: 014-810-264-834-8
		DOI: 10.35111/r7ds-gy89
Language:		English
		Arabic
		Persian
		Hindi
		Japanese
		Kannada
		Korean
		Russian
		Tamil
		Thai
		Urdu
		Vietnamese
		Mandarin Chinese
Language (ISO639):		eng
		ara
		fas
		hin
		jpn
		kan
		kor
		rus
		tam
		tha
		urd
		vie
		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2022T07
Rights Holder:		Portions © 2007, 2015, 2017-2020 1399 picofiles, © 2015-2019 65tes-habeshamusic.com, © 2019-2020 Accessify.com, © 2019-2020 Adobe, © 2013, 2019-2020 Alamy Ltd., © 2010-2011, 2019-2020, Amazon.com, Inc. or its affiliates, © 2008, 2018-2019 ambebi.ge, © 2000, 2019-2020 A Medium Corporation, © 2019-2020 App Annie, © 2019 AppKiwi, © 2014, 2019 Armenian News - Tert.am, © 2012-2014, 2018-2019 ARMENPRESS, © 2002, 2006, 2008, 2010-2011, 2013-2014, 2019-2020 Assimba.org, © 2011-2019 Atv - Eritrean Satellite Television, © 2016-2017 AtYourService.pk, © 2018-2019 Aysor, © 2019 Bag, © 2002-2003, 2009-2019 Baidu, © 2017-2019 Bangla sms bengali shayari, © 2019 bbcode0.com, © 2014, 2019-2020 Benawa Network, © 2002, 2012-2019 Bennett Coleman & Co. Ltd., © 2013-2019 Best TV, © 2000, 2019-2020 BigCommerce Pty. Ltd., © 2019 Bnet Technologies, © 2017 BONDHU2U, © 2011, 2015-2018 BuzzFeed, Inc., © 2016-2017, 2019 CBSEPORTAL.COM, © 2019-2020 cinejosh.com, © 2019 Civic Network OPORA, © 2010, 2018-2019 Clipart.com, a division of Vital Imagery Ltd., © 2000, 2015, 2019-2020 Cloudinary, © 2012, 2019 CMS, © 2013, 2018-2019, COUNTRY.ua, © 2014-2019 CyberAgent, Inc., © 2019 Daily Hunt, © 2013-2020 Dehai.org, © 2015, 2019 Deutsche Welle, © 2016-2019 DF Marketplace Company Limited, © 2019 DigitalOcean, LLC, © 2017, 2019 DocPlayer.hu, © 2019 Dreamstime, © 2000, 2007-2019 DuckDuckGo Blog, © 2012-2014, 2016-2019 DVB Multimedia Group, © 2018-2019 DYODEKA SA, © 2012-2019 EastAFRO.com, © 2018-2020 eBay Inc., © 2017-2019 Electronic Database of Cultural Values, © 2019-2020 ePapersland.com, © 2010-2019 Eritrea-Chat.Com, © 2019 Ethiopian Press Agency, © 2019 Etsy, Inc., © 2015-2016, 2019-2020 Exotic India, © 2017-2018 Ezinemart, © 2019-2020 F5, Inc., © 2019 Fine Arts Department. Ministry of Culture, © 2013-2019 Free 4 Reader, © 2016-2019 FRESH NEWS, © 2016-2019 Global Publishers, © 2000-2001, 2003, 2005-2006, 2011, 2015-2020 Google Inc., © 2019-2020 Google LLC, © 2013-2018 Goolgule, © 2013-2014, 2016, 2019 Hetq, © 2010-2014, 2016 Himalayabon.com, © 2011, 2019-2020 Holding "Labyrinth", © 2019 Houshamadyan - Houshamadyan e.V., © 2014, 2019 HRAPARAK, © 2019-2020 Imgur, Inc, © 2016, 2019 Institute for Development of Freedom of Information, © 2013-2015, 2017-2019 IRAVABAN.NET, © 2018-2019 Islam land, © 2019 Jagran Prakashan Ltd, © 2013, 2019 Jofogas, © 2005-2006, 2019 Kapruka.com, © 2019 Kerala Niyamasabha, © 2019 Kesari Weekly, © 2019 Khamsat.com, subsidiary of Hsoub, © 2019 Kidzpark, © 2019 LEPL LEGISLATIVE HERALD OF GEORGIA, © 2019 LLC "Infourok", © 2012, 2019-2020 LLC "Yandex", © 2014, 2019 Magzter Inc, © 2003, 2019 Mahibere Kidusan, © 2006, 2017-2020 Mashreq News, © 2012, 2016-2019 Matichon Public Co., Ltd., © 2016-2018 MemeBuster, © 2019 Mereb Inc., © 2019 Microsoft, © 2014-2019 MillardAyo.com, © 2019 Minhaj-ul-Quran International, © 2015-2016, 2018-2019 MJ Innovations (Pvt) Ltd, © 2003, 2008, 2019-2020 Mohalla Tech Pvt. Ltd., © 2019 Mohsensoft, © 2019 MyShared Inc., © 2018 Nai, © 2014, 2017-2019 Newsroom Ltd., © 2019 Nikand, © 2002, 2019 nplg.gov.ge, © 2016, 2019 nuaodisha.com, © 2012, 2015-2016, 2018-2019 OdiaWeb, © 2019 Omedia Studio, © 2015-2019 online auction auction.ru, © 2019-2020 Owler Inc., © 2019 Oxford University Press, © 2019 "Paste.Pics", © 2010, 2014-2019 People's Daily Online, © 2017, 2019 Pinterest, © 2019 Prom.ua, © 2019 Qurango, © 2011, 2013, 2019-2020 ResearchGate GmbH, © 2019-2020 Reddit Inc, © 2014, 2016-2017, 2019-2020 RFE/RL, Inc., © 2019 Rozetka Online Store, © 2012, 2018-2019 Sambad, © 2016-2019 Satenaw News/Breaking News, © 2019 Scribd Inc., © 2018-2019 Semayat Book Store, © 2018-2019 Shant TV, © 2002, 2008, 2013, 2019 Share Your Essays, © 2018-2019 Shutterstock, Inc., © 2019 Simon Ager, © 2008, 2010-2013, 2019-2020 SlidePlayer.com Inc., © 2017, 2019-2020 SlideServe, © 2019 Slide-Share, © 2016-2017 Smart Doc Posters, © 2009, 2019-2020 SmugMug, Inc., © 2019 spotidoc.com, © 2009, 2019 Squarespace, © 2007, 2019 svitppt Inc., © 2013-2014, 2017, 2019 Tabula, © 2019 Teachers Pay Teachers\|Teacher Synergy LLC, © 2019-2020 TAMIL TEXTBOOKS, © 2019 Tanzania Educational Publishers Ltd, © 2019-2020 TeluguOne.com, © 2010, 2019 Text Book Centre Ltd, © 2016, 2018-2019 The Hankyoreh, © 2019 The News Minute, © 2019 The Samaja Epaper, © 2000, 2019 The University of Chicago, © 2010, 2018-2019 Tibet News, © 2015, 2017 Tibetan Community Health Network, © 2017-2019 Tigray Communication Affairs Bureau, © 2019-2020 TripAdvisor LLC, © 2015-2019 Tsanpo.com, © 2015-2019 Tsem Rinpoche, © 2017-2019 University of South-East Asia, © 2004-2007, 2011-2014, 2016-2019 Upali Newspapers (Pvt) Ltd., © 2019 VietTouch, © 2019 Vindad, © 2019 Vinh Phuc Newspaper, © 2012, 2019 Wasabi Technologies, Inc., © 2019-2020 WatKhemaraRatanaram.org, © 2011, 2016, 2018-2020 Wonder Idea Technology Co., Ltd., © 2019 WorthPoint Corporation, © 2013, 2019 www.Dek-D.com, © 2019 Yakaboo, © 2009, 2011, 2019 yeddyurappa.in, © 2001, 2003-2004, 2009, 2012-2013, 2016, 2019-2020 Yumpu.com, © 2011-2019 ZeHabesha,© 2020, 2022 Trustees of the University of Pennsylvania
Type (DCMI):		StillImage
Type (DCMI):		Text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2022T07
DateStamp:		2023-12-05
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Arrigo, Michael; Strassel, Stephanie; Caruso, Christopher. 2022. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB country_IN country_JP country_KR country_PK country_RU country_TH country_VN dcmi_StillImage dcmi_Text iso639_ara iso639_cmn iso639_eng iso639_fas iso639_hin iso639_jpn iso639_kan iso639_kor iso639_rus iso639_tam iso639_tha iso639_urd iso639_vie