Abstract

We present a multimodal problem to order text in a comic’s pages. We modified two state of the art models to include multimodal inputs, the first of which is a transformer-based model, and the second learns relative orders which can be ordered using topological sort. These models are tested on Japanese Wikipedia and the Manga109 dataset. We show that a multimodal model performs better than a text-only model, and verify that relative order-based models outperform transformer models in the sentence ordering task.

This was a project done for a class CS291A: Deep Learning. Check out my repository of the project and the paper I wrote for the class.