OLAC Record
oai:www.ldc.upenn.edu:LDC2015T03

Metadata
Title:Avocado Research Email Collection
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Oard, Douglas, et al. Avocado Research Email Collection LDC2015T03. Web Download. Philadelphia: Linguistic Data Consortium, 2015
Contributor:Oard, Douglas
Webber, William
Kirsch, David A.
Golitsynskiy, Sergey
Date (W3CDTF):2015
Date Issued (W3CDTF):2015-02-16
Description:*Introduction* Avocado Research Email Collection consists of emails and attachments taken from 279 accounts of a defunct information technology company referred to as "Avocado". Most of the accounts are those of Avocado employees; the remainder represent shared accounts such as "Leads", or system accounts such as "Conference Room Upper Canada". The collection consists of the processed personal folders of these accounts with metadata describing folder structure, email characteristics and contacts, among others. It is expected to be useful for social network analysis, e-discovery and related fields. Users should be aware that malicious code (e.g., viruses) may be present in any email collection, including in the Avocado Research Email Collection. One user of the Avocado collection has reported the presence of the loveletter virus in about 27 of the messages in the collection. Users of the Avocado collection should avoid opening messages or attachments in an execution environment that might execute malicious code. Further information about the loveletter virus can be found here. *Data* The source data for the collection consisted of Personal Storage Table (PST) files for 282 accounts. A PST file is used by MS Outlook to store emails, calendar entries, contact details, and related information. Data was extracted from the PST files using libpst version 0.6.54. Three files produced no output and and are not included in the collection. Each account is referred to as a "custodian" although some of the accounts do not correspond to humans. The collection is divided into metadata and text. The metadata is represented in XML, with a single top-level XML file listing the custodians, and then one XML file per custodian listing all items extracted from that custodian's PST files. The full XML tree can be read by loading the top-level file with an XML parser that handles directives. All XML metadata files are encoded in UTF-8. The text contains the extracted text of the items in the custodians' folders, with the extracted text for each item being held in a separate file. The text files are then zipped into a zip file per custodian. *Licensing* Users are required to sign two license agreements in order to access this corpus, the Avocado Collection Organizational License Agreement and the Avocado Collection End User Agreement. Those agreements can be viewed in the License field of this catalog entry. *Updates* None at this time.
Extent:Corpus size: 4407648 KB
Identifier:LDC2015T03
https://catalog.ldc.upenn.edu/LDC2015T03
ISBN: 1-58563-704-1
ISLRN: 102-408-869-995-0
DOI: 10.35111/wqt6-jg60
Language:English
Language (ISO639):eng
License:Avocado Collection - Individual Agreement: https://catalog.ldc.upenn.edu/license/avocado-collection-individual-agreement.pdf
Avocado Collection - Organization Agreement: https://catalog.ldc.upenn.edu/license/avocado-collection-organization-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2015T03
Rights Holder:Portions © 2015 Sherwood Partners, © 2015 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2015T03
DateStamp:  2025-10-23
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Oard, Douglas; Webber, William; Kirsch, David A.; Golitsynskiy, Sergey. 2015. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2015T03
Up-to-date as of: Fri Oct 24 6:55:29 EDT 2025