OLAC Record
oai:www.ldc.upenn.edu:LDC2018T10

Metadata
Title:BOLT Arabic Discussion Forums
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Tracey, Jennifer, et al. BOLT Arabic Discussion Forums LDC2018T10. Web Download. Philadelphia: Linguistic Data Consortium, 2018
Contributor:Tracey, Jennifer
Lee, Haejoong
Strassel, Stephanie
Ismael, Safa
Date (W3CDTF):2018
Date Issued (W3CDTF):2018-03-15
Description:*Introduction* BOLT Arabic Discussion Forums was developed by the Linguistic Data Consortium (LDC) and consists of 813,080 discussion forum threads in Egyptian Arabic harvested from the Internet using a combination of manual and automatic processes. The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference. The material in this release represents the unannotated Arabic source data in the discussion forum genre. *Data* Collection was seeded based on the results of manual data scouting by native speaker annotators. Scouts were instructed to seek content in Egyptian Arabic that was original, interactive and informal. Upon locating an appropriate thread, scouts submitted the URL and some simple judgments about it to a database, via a web browser plug-in. When multiple threads from a forum were submitted, the entire forum was automatically harvested and added to the collection. The scale of the collection precluded manual review of all data. Only a small portion of the threads included in this release were manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-Arabic content. Language identification was performed on all threads in this corpus (using CLD2), and threads for which the results indicate a high probability of largely non-Arabic content are listed in arz_suspect_LID.txt in the docs directory of this package. It should also be noted that many threads may contain a mixture of Egyptian and other varieties of Arabic, even among the threads that are primarily Arabic. The corpus is comprised of zipped HTML and XML files. The HTML files are a raw HTML file downloaded from the discussion thread. If the thread spanned multiple URLs, it was stored as a concatenation of the downloaded HTML files. The XML files were converted from the raw HTML. *Acknowledgement* This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. *Samples* Please view this html sample and xml sample. *Updates* None at this time.
Extent:Corpus size: 31595416 KB
Identifier:LDC2018T10
https://catalog.ldc.upenn.edu/LDC2018T10
ISBN: 1-58563-839-0
ISLRN: 663-919-074-680-5
DOI: 10.35111/9fjp-2a75
Language:Egyptian Arabic
Language (ISO639):arz
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2018T10
Rights Holder:Portions © 2018 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2018T10
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Tracey, Jennifer; Lee, Haejoong; Strassel, Stephanie; Ismael, Safa. 2018. Linguistic Data Consortium.
Terms: area_Africa country_EG dcmi_Text iso639_arz olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2018T10
Up-to-date as of: Thu Oct 24 7:31:04 EDT 2024