Authors:
(1) Rui Cao, Singapore Management University;
(2) Ming Shan Hee, Singapore University of Design and Technology;
(3) Adriel Kuek, DSO National Laboratories;
(4) Wen-Haw Chong, Singapore Management University;
(5) Roy Ka-Wei Lee, Singapore University of Design and Technology
(6) Jing Jiang, Singapore Management University.
Table of Links
2 RELATED WORK
Memes, typically intended to be humorous or sarcastic, are increasingly being exploited for the proliferation of hateful content, leading to the challenging task of online hateful meme detection [5, 12, 27]. To combat the spread of hateful memes, one line of work regards the hateful meme detection as a multimodal classification task. Researchers have applied pre-trained vision language models (PVLMs) and fine-tune them based on meme detection data [20, 26, 34, 37]. To improve performance, some have tried model ensembling [20, 26, 34]. Another line of work considers combining pre-trained models (e.g., BERT [4] and CLIP [29]) with task specific model architectures and tunes them end-to-end [13, 14, 28]. Recently, authors in [2] have tried converting all meme information into text and prompting language models to better leverage the contextual background knowledge present in language models. This approach achieves the state-of-the-art results on two hateful meme detection benchmarks. However, it adopts a generic method for describing the image through image captioning, often ignoring important factors necessary for hateful meme detection. In this work, we seek to address this issue through probe-based captioning by prompting pre-trained vision-language models with hateful content-centric questions in a zero-shot VQA manner.
This paper is available on arxiv under CC 4.0 license.