Overview of Memotion 3: Sentiment & Emotion Analysis of Codemixed Hinglish -Results

cover
21 Feb 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Shreyash Mishra has an equal contribution from IIIT Sri City, India;

(2) S Suryavardan has an equal contribution from IIIT Sri City, India;

(3) Megha Chakraborty, University of South Carolina, USA;

(4) Parth Patwa, UCLA, USA;

(5) Anku Rani, University of South Carolina, USA;

(6) Aman Chadha, work does not relate to a position at Amazon from Stanford University, USA, or Amazon AI, USA;

(7) Aishwarya Reganti, CMU, USA;

(8) Amitava Das, University of South Carolina, USA;

(9) Amit Sheth, University of South Carolina, USA;

(10) Manoj Chinnakotla, Microsoft, USA;

(11) Asif Ekbal, IIT Patna, India;

(12) Srijan Kumar, Georgia Tech, USA.

Abstract & Introduction

Related Work

Task Details

Participating systems

Results

Conclusion and Future Work and References

5. Results

The performance in Task A i.e. sentiment classification is presented in Table 1. Our proposed baseline achieved a score of 33.28%. Out of the five final submissions, four teams managed to surpass the baseline. The top two teams, namely, NUAA-QMUL-AIIT [73] and NYCU_TWO [71] improved on the baseline by 1.1% and 0.9% respectively. The difference in F1 scores for this task is minimal across all participants.

Table 2 shows the leaderboard for Task B. Three teams out-performed the baseline for this task and top performing team wentaorub [68] improves on the final score by 5.4%. Team CUFE performs lower than the baseline score by 3.4%, however, they present the highest score on the

Table 2

Leaderboard of teams on Task B: Emotion analysis {H:Humor, S:Sarcasm, O:Offense, M:Motivation}. All the scores are in percentage. The teams are ranked by their average F1 scores (overall) across all the four emotions. Motivation emotion is the easiest to detect while offense is the most difficult to detect.

Offensive sub-task and on the Sarcasm sub-task, jointly with NYCU_TWO [71]. wentaorub [68], NYCU_TWO, and NUAA-QMUL-AIIT achieved the same score on Motivation, which is 4% higher than the baseline. NYCU_TWO also scored the highest in the Humour sub-task, improving on the baseline by 4.5%. Based on these F1 scores, we can deduce that the Motivation class is the easiest to detect, similar to Memotion 2.0 [59], despite some teams performing poorly on this class. However, in this iteration, the Offensive class is the hardest to detect, instead of the Sarcasm class. No single team performs the best on all the classes.

Table 3

Leaderboard of teams on Task C: Emotion intensity detection {H:Humor, S:Sarcasm, O:Offense, M:Motivation}. All scores are in percentage. The teams are ranked by average F1 scores (overall) across all the four emotions. Intensity of sarcasm is by far the most difficult to detect. Motivation has only two intensities as opposed to four intensity levels for other emotions.

As shown in the leaderboard for Task C in Table 3, all the participating teams outperform the baseline in this task. The minimum improvement in F1 score is 0.8% by team CUFE and the maximum improvement in the final score is 7.5% by wentaorub [68]. The submission by wentaorub exceeds all other teams in the sub-tasks Humour, Offensive and Motivation. NUAA-QMUL-AIIT [73] matched the highest score of wentaorub in the Motivation category, while CUFE achieved the highest score in the Sarcasm sub-task. The Motivation class has significantly higher F1 scores than other classes, due to it having only 2 intensities whereas other classes have 4 intensities each. The best overall score is only 59.82%, which shows the difficulty of the task.

Figure 1: Some samples for Task A from the 121 memes that were predicted incorrectly by all participating teams. Over half of the mis-classified memes are true negatives.

Figure 2: 242 memes from the test set were predicted incorrectly by all participants. Most of such memes are code-mixed and related to humor or sarcasm.

A major observation when comparing performances with the previous iteration of the task [7] is that the performance on Task A is much worse in Memotion 3. Overall Scores in Tasks B and C have remained about the same compared to Memotion 2.

For each individual task, we analyze the memes from the test set that all participants made wrong predictions on. For Task A, there are 121 memes where all participants mis-classified the label, out of which 66 memes were true negative sentiment followed closely by true positive memes. For task B, there are 231 such memes, majority of which belong to "humor" and "sarcasm" class. Finally, for Task C, 421 memes are mis-classified by all the systems - most memes mis-classified by all teams are "Very Funny", "Very Sarcastic", "Slightly Offensive". Some such examples for Task A, B and C are shown in figures 1, 2, and 3 respectively. Further, we note that most of such difficult examples have code-mixed text.

Figure 3: For Task C, none of the teams made correct predictions on over 400 memes from the test set. A large share of these memes is code-mixed, thus indicating the added challenge to the task.

As for the overall performances, only two teams - NUAA-QMUL-AIIT and NYCU_TWO [73, 71] - perform better than the baseline in all tasks.