MG-Gen: Layer Decomposition for Single Image to Motion Graphics Generation

Takahiro Shirakawa

Tomoyuki Suzuki

Takuto Narumoto

Daichi Haraguchi

CyberAgent, Japan

Paper

Code

Motion Graphics Generation from a Single Raster Image

Abstract & Method

This study proposes MG-Gen, a framework that generates motion graphics directly from a single raster image preserving input content consistency with dynamic text motion. MG-Gen decomposes a single raster image into layered structures represented as HTML, generates animation scripts for each layer, and then renders them into a video. We first decompose an input image into layer elements such as text, objects, and background components by performing text detection, object detection, object segmentation, and image inpainting, then reconstruct the input image as layered HTML data. Subsequently, we generate an executable animation script for these HTML data using a Large Multimodal Model (LMM). MG-Gen fundamentally differs from the general image-to-video methods and is naturally suited for generating motion graphics, which require text readability and input consistency. We confirmed MG-Gen generates dynamic motion graphics while preserving text readability and fidelity to the input conditions, whereas state-of-the-art image-to-video generation methods struggle with them.

MG-Gen first decomposes an input image into layer-wise components using an OCR, object detection, object segmentation, and image inpainting then reconstructs the extracted layers as HTML data, which appears identical to the input image. Subsequently, it generates executable animation script for the HTML using a large multimodal model, specifically Gemini, which has shown high coding capabilities, coupled with its advanced reasoning abilities. The details of layer decomposition are as follows. First, only text layers are extracted using an OCR model and a text stroke segmentation model, Hi-SAM. Next, non-text layers are extracted using object detection model, YOLO, and segmentation model, SAM. Finally, an image inpainting model is applied to generate the background behind the extracted components, creating a separate background layer. After the all layer decomposition, we reconstruct the layer-decomposed data into HTML.

MG-Gen also generates an executable JavaScript animation script for the reconstructed HTML data using an LMM. Animation script generation pipeline consist of three main process: layer grouping, animation planning, and script coding. Grouping: The model first organizes the input HTML's layers into distinct clusters to understand its structure. Planning: Next, the model creates a detailed animation plan for each group, outlining the sequence and style of animations. Coding: Finally, the model generates an executable JavaScript animation script using Anime.js based on the plan. Once the animation script is generated, the animation script and HTML are rendered on a web browser to obtain a video file.

Experimental Results

1. Generated Motion Graphics

Some examples of generated motion graphics by MG-Gen. These examples demonstrate that MG-Gen generates motion graphics with active text and object motion while preserving text readability and avoiding object distortions.

2. Visual Comparison

Some examples of visual comparison generated by MG-Gen and general image-to-video generation methods. These examples demonstrate that MG-Gen produces superior motion graphics compared to them.

Input image

Ours

Ray2

Wan2.1

Hunyuan

Input image

Ours

Ray2

Wan2.1

Hunyuan

Input image

Ours

Ray2

Wan2.1

Hunyuan

Input image

Ours

Ray2

Wan2.1

Hunyuan

3. Integration with Gen-3 (Application example)

We integrate MG-Gen with generatl image-to-video generation model, Runway Gen-3. These results demonstrate that by integrating MG-Gen with Gen-3, we can move text more dynamically while preserving readability and animate backgrounds and objects with arbitrary motion. Additionally, by first performing layer decomposition and animating each layer individually, we achieve more dynamic motion generation for each object.

Input Image

Ours with Gen-3

Gen-3 only

Input Image

Ours with Gen-3

Gen-3 only

Citation

          @article{shirakawa2025mg,
            title={MG-Gen: Single Image to Motion Graphics Generation},
            author={Shirakawa, Takahiro and Suzuki, Tomoyuki and Narumoto, Takuto and Haraguchi, Daichi},
            journal={arXiv preprint arXiv:2504.02361},
            year={2025}
          }