OLAC Record: MLASK: Multimodal Summarization of Video-based News Articles

OLAC Record
oai:lindat.mff.cuni.cz:11234/1-5135

Metadata

Title: MLASK: Multimodal Summarization of Video-based News Articles

Bibliographic Citation: http://hdl.handle.net/11234/1-5135

Creator: Krubiński, Mateusz

Pecina, Pavel

Date (W3CDTF): 2023-11-02T15:01:55Z

Date Available: 2023-11-02T15:01:55Z

Description: The MLASK corpus consists of 41,243 multi-modal documents – video-based news articles in the Czech language – collected from Novinky.cz (https://www.novinky.cz/) and Seznam Zprávy (https://www.seznamzpravy.cz/). It was introduced in "MLASK: Multimodal Summarization of Video-based News Articles" (Krubiński & Pecina, EACL 2023). The articles' publication dates range from September 2016 to February 2022. The intended use case of the dataset is to model the task of multimodal summarization with multimodal output: based on a pair of a textual article and a short video, a textual summary is generated, and a single frame from the video is chosen as a pictorial summary. Each document consists of the following: - a .mp4 video - a single image (cover picture) - the article's text - the article's summary - the article's title - the article's publication date All of the videos are re-sampled to 25 fps and resized to the same resolution of 1280x720p. The maximum length of the video is 5 minutes, and the shortest one is 7 seconds. The average video duration is 86 seconds. The quantitative statistics of the lengths of titles, abstracts, and full texts (measured in the number of tokens) are below. Q1 and Q3 denote the first and third quartiles, respectively. / - / mean / Q1 / Median / Q3 / / Title / 11.16 ± 2.78 / 9 / 11 / 13 / / Abstract / 33.40 ± 13.86 / 22 / 32 / 43 / / Article / 276.96 ± 191.74 / 154 / 231 / 343 / The proposed training/dev/test split follows the chronological ordering based on publication data. We use the articles published in the first half (Jan-Jun) of 2021 for validation (2,482 instances) and the ones published in the second half (Jul-Dec) of 2021 and the beginning (Jan-Feb) of 2022 for testing (2,652 instances). The remaining data is used for training (36,109 instances). The textual data is shared as a single .tsv file. The visual data (video+image) is shared as a single archive for validation and test splits, and the one from the training split is partitioned based on the publication date.

Identifier (URI): http://hdl.handle.net/11234/1-5135

Language: Czech

Language (ISO639): ces

Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)

Rights: Seznam Dataset Licence

https://lindat.mff.cuni.cz/repository/xmlui/page/szn-dataset-licence

Subject: Multimodal Summarization

Summarization

Video

Image

Type: corpus

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Description: http://www.language-archives.org/archive/lindat.mff.cuni.cz

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:lindat.mff.cuni.cz:11234/1-5135

DateStamp: 2023-11-07

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Krubiński, Mateusz; Pecina, Pavel. 2023. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text

http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-5135
Up-to-date as of: Sun May 4 0:13:55 EDT 2025

Metadata
Title:		MLASK: Multimodal Summarization of Video-based News Articles
Bibliographic Citation:		http://hdl.handle.net/11234/1-5135
Creator:		Krubiński, Mateusz
Creator:		Pecina, Pavel
Date (W3CDTF):		2023-11-02T15:01:55Z
Date Available:		2023-11-02T15:01:55Z
Description:		The MLASK corpus consists of 41,243 multi-modal documents – video-based news articles in the Czech language – collected from Novinky.cz (https://www.novinky.cz/) and Seznam Zprávy (https://www.seznamzpravy.cz/). It was introduced in "MLASK: Multimodal Summarization of Video-based News Articles" (Krubiński & Pecina, EACL 2023). The articles' publication dates range from September 2016 to February 2022. The intended use case of the dataset is to model the task of multimodal summarization with multimodal output: based on a pair of a textual article and a short video, a textual summary is generated, and a single frame from the video is chosen as a pictorial summary. Each document consists of the following: - a .mp4 video - a single image (cover picture) - the article's text - the article's summary - the article's title - the article's publication date All of the videos are re-sampled to 25 fps and resized to the same resolution of 1280x720p. The maximum length of the video is 5 minutes, and the shortest one is 7 seconds. The average video duration is 86 seconds. The quantitative statistics of the lengths of titles, abstracts, and full texts (measured in the number of tokens) are below. Q1 and Q3 denote the first and third quartiles, respectively. / - / mean / Q1 / Median / Q3 / / Title / 11.16 ± 2.78 / 9 / 11 / 13 / / Abstract / 33.40 ± 13.86 / 22 / 32 / 43 / / Article / 276.96 ± 191.74 / 154 / 231 / 343 / The proposed training/dev/test split follows the chronological ordering based on publication data. We use the articles published in the first half (Jan-Jun) of 2021 for validation (2,482 instances) and the ones published in the second half (Jul-Dec) of 2021 and the beginning (Jan-Feb) of 2022 for testing (2,652 instances). The remaining data is used for training (36,109 instances). The textual data is shared as a single .tsv file. The visual data (video+image) is shared as a single archive for validation and test splits, and the one from the training split is partitioned based on the publication date.
Identifier (URI):		http://hdl.handle.net/11234/1-5135
Language:		Czech
Language (ISO639):		ces
Publisher:		Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Rights:		Seznam Dataset Licence
Rights:		https://lindat.mff.cuni.cz/repository/xmlui/page/szn-dataset-licence
Subject:		Multimodal Summarization
		Summarization
		Video
		Image
Type:		corpus
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:		http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:lindat.mff.cuni.cz:11234/1-5135
DateStamp:		2023-11-07
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Krubiński, Mateusz; Pecina, Pavel. 2023. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms:		area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text