OLAC Record: Annotated tweet corpus in Arabizi, French and English

OLAC Record
oai:catalogue.elra.info:ELRA-W0323

Metadata

Title: Annotated tweet corpus in Arabizi, French and English

Access Rights: Rights available for: nonCommercialUse, commercialUse

Date Available (W3CDTF): 2022-04-05

Date Issued (W3CDTF): 2022-04-05

Description: The annotated tweet corpus in Arabizi, French and English was built by ELDA on behalf of INSA Rouen Normandie (Normandie Université, LITIS team), in the framework of the SAPhIRS project (System for the Analysis of Information Propagation in Social Networks), funded by the DGE (Direction Générale des Entreprises, France) through the RAPID programme (2017-2020). This project aimed at studying the mechanisms of information and opinion propagation within social networks: identifying influential leaders, detecting channels for disseminating information and opinion. The purpose of the corpus constitution, completed in 2020, was to collect and annotate tweets in 3 languages (Arabizi, French and English) for 3 predefined themes (Hooliganism, Racism, Terrorism).For the collection, a tool has been developed in Python (based on the “GetOldTweets3” library) which used information such as the language (EN/FR) and a keyword list as input. With this tool, a maximum of 10,000 tweets per (keyword, language) pair were collected for English and French. For Arabizi, a specific process was setup, consisting in creating a vocabulary list in Arabizi from a corpus of Arabizi SMS (for Moroccan and Tunisian) and Training and test data for Arabizi detection and transliteration (available from ELRA under reference ELRA-W0126, ISLRN ID: 986-364-744-303-9) by selecting the 1000 most frequent words, and downloading the tweets containing each word from this vocabulary and keyword list (places = Morocco, Tunisia, Algeria). The tweets that were kept had to contain at least 5 words in Arabizi.For the annotation, a tool running on Django has been developed in order to provide the following annotations for each tweet in a given sequence:•Theme: with 5 possible annotations (Hooliganism, Racism, Terrorism, Others, Incomprehensible)•Topic: the annotator can add a new topic if it does not exist in the proposed list•Opinion: 3 possible annotations (Negative, Neutral, Positive)In total, 17,103 sequences were annotated from 585,163 tweets (196,374 in English, 254,748 in French and 134,041 in Arabizi), including the themes “Others” and “Incomprehensible”. Among these sequences, 4,578 sequences having at least 20 tweets annotated with the 3 predefined themes (Hooliganism, Racism and Terrorism) were obtained, including 1,866 sequences with an opinion change. They are distributed as follows: 2,141 sequences in English (57,655 tweets), 1,942 sequences in French (48,854 tweets) and 495 sequences in Arabizi (21,216 tweets). A sub-corpus of 8,733 tweets (1,209 in English, 3,938 in French and 3,585 in Arabizi) annotated as “hateful”, according to topic/opinion annotations and by selecting tweets that contained insults, is also provided. The data are provided in CSV format.Remark: this corpus includes only tweet IDs and corresponding annotations. Original tweets may be obtained by using the Twitter API.

Identifier: ELRA-W0323

ISLRN: 482-848-308-105-6

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-W0323/

Language: English

Arabic

French

Language (ISO639): eng

ara

fra

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-W0323

DateStamp: 2022-04-05

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2022. ELRA (European Language Resources Association).
Terms: area_Europe country_FR country_GB dcmi_Text iso639_ara iso639_eng iso639_fra olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-W0323
Up-to-date as of: Wed Jul 15 7:05:33 EDT 2026

Metadata
Title:		Annotated tweet corpus in Arabizi, French and English
Access Rights:		Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):		2022-04-05
Date Issued (W3CDTF):		2022-04-05
Description:		The annotated tweet corpus in Arabizi, French and English was built by ELDA on behalf of INSA Rouen Normandie (Normandie Université, LITIS team), in the framework of the SAPhIRS project (System for the Analysis of Information Propagation in Social Networks), funded by the DGE (Direction Générale des Entreprises, France) through the RAPID programme (2017-2020). This project aimed at studying the mechanisms of information and opinion propagation within social networks: identifying influential leaders, detecting channels for disseminating information and opinion. The purpose of the corpus constitution, completed in 2020, was to collect and annotate tweets in 3 languages (Arabizi, French and English) for 3 predefined themes (Hooliganism, Racism, Terrorism).For the collection, a tool has been developed in Python (based on the “GetOldTweets3” library) which used information such as the language (EN/FR) and a keyword list as input. With this tool, a maximum of 10,000 tweets per (keyword, language) pair were collected for English and French. For Arabizi, a specific process was setup, consisting in creating a vocabulary list in Arabizi from a corpus of Arabizi SMS (for Moroccan and Tunisian) and Training and test data for Arabizi detection and transliteration (available from ELRA under reference ELRA-W0126, ISLRN ID: 986-364-744-303-9) by selecting the 1000 most frequent words, and downloading the tweets containing each word from this vocabulary and keyword list (places = Morocco, Tunisia, Algeria). The tweets that were kept had to contain at least 5 words in Arabizi.For the annotation, a tool running on Django has been developed in order to provide the following annotations for each tweet in a given sequence:•Theme: with 5 possible annotations (Hooliganism, Racism, Terrorism, Others, Incomprehensible)•Topic: the annotator can add a new topic if it does not exist in the proposed list•Opinion: 3 possible annotations (Negative, Neutral, Positive)In total, 17,103 sequences were annotated from 585,163 tweets (196,374 in English, 254,748 in French and 134,041 in Arabizi), including the themes “Others” and “Incomprehensible”. Among these sequences, 4,578 sequences having at least 20 tweets annotated with the 3 predefined themes (Hooliganism, Racism and Terrorism) were obtained, including 1,866 sequences with an opinion change. They are distributed as follows: 2,141 sequences in English (57,655 tweets), 1,942 sequences in French (48,854 tweets) and 495 sequences in Arabizi (21,216 tweets). A sub-corpus of 8,733 tweets (1,209 in English, 3,938 in French and 3,585 in Arabizi) annotated as “hateful”, according to topic/opinion annotations and by selecting tweets that contained insults, is also provided. The data are provided in CSV format.Remark: this corpus includes only tweet IDs and corresponding annotations. Original tweets may be obtained by using the Twitter API.
Identifier:		ELRA-W0323
Identifier:		ISLRN: 482-848-308-105-6
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-W0323/
Language:		English
		Arabic
		French
Language (ISO639):		eng
		ara
		fra
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-W0323
DateStamp:		2022-04-05
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2022. ELRA (European Language Resources Association).
Terms:		area_Europe country_FR country_GB dcmi_Text iso639_ara iso639_eng iso639_fra olac_primary_text