This paper presents multimodal markup document models (MarkupDM) that can generate both markup language and images within interleaved multimodal documents. Unlike existing vision-and-language multimodal models, our MarkupDM tackles unique challenges critical to graphic design tasks: generating partial images that contribute to the overall appearance, often involving transparency and varying sizes, and understanding the syntax and semantics of markup languages, which play a fundamental role as a representational format of graphic designs. To address these challenges, we design an image quantizer to tokenize images of diverse sizes with transparency and modify a code language model to process markup languages and incorporate image modalities. We provide in-depth evaluations of our approach on three graphic design completion tasks: generating missing attribute values, images, and texts in graphic design templates. Results corroborate the effectiveness of our MarkupDM for graphic design tasks. We also discuss the strengths and weaknesses in detail, providing insights for future research on multimodal document generation.
(a) We first train an image quantizer and decoder designed for RGBA images of different sizes using reconstruction loss. (b) We then train MarkupDM, which is an extension of the base code LLM with an embedding module and a prediction head for images. We encode the input images into discrete tokens using the pre-trained quantizer, and decode the output image tokens and sizes into images using the pre-trained decoder.
Each triplet shows the input, the predicted completion, and the original design from left to right or top to bottom. The [MASK] or [M] indicates the missing part to be completed. MarkupDM can generate plausible images by copying from other images based on repetition patterns and symmetry, and successfully creates typical decorations and harmonious background images.
The green arrows point to the missing text, and the green boxes indicate the zoomed-in areas. MarkupDM can generate grammatically correct text that connects with preceding or following lines and matches surrounding texts. Even in cases with weak textual context, it successfully generates text with a typical role, using the position of the target element and visual decoration as hints.
MarkupDM struggles with generating images due to lack of context, and sometimes fails to complete text due to image understanding errors. It is also difficult to visually harmonize images and text with other decorations.
@misc{kikuchi2024,
title = {Multimodal Markup Document Models for Graphic Design Completion},
author = {Kotaro Kikuchi and Naoto Inoue and Mayu Otani and Edgar Simo-Serra and Kota Yamaguchi},
year = 2024,
url = {https://arxiv.org/abs/2409.19051}
}