Significant advancements in large language models (LLMs) have inspired the development of multimodal large language models (MLLMs). Early MLLM efforts, such as LLaVA, MiniGPT-4, and InstructBLIP, demonstrate notable multimodal understanding capabilities. To integrate LLMs into multimodal domains, these studies explored projecting features from a pre-trained modality-specific encoder, such as CLIP, into the input space of LLMs, enabling multimodal understanding and reasoning within the transformer backbone. Although there are various design choices for MLLMs, such as vision encoders, feature alignment adapters, and datasets, the training for most of these models adheres to the autoregressive generation paradigm, which has proven effective for text generation in LLMs. Despite their strong multimodal understanding capabilities, these models primarily focus on visual perception and lack the ability to generate multimodal outputs beyond text.
Transformer models have demonstrated great success in autoregressive modeling in natural language processing. Inspired by such progress, previous studies have directly applied the same autoregressive modeling to learn the dependency of image pixels for image and video generation. For instance, VideoPoet employs a decoder-only transformer architecture to synthesize high-quality videos from multimodal inputs. More recently, LlamaGen has shown that a large language model architecture like Llama can autoregressively model image tokens, achieving decent performance in class-conditional image generation.
In this article, we will discuss Show-O, a unified transformer that integrates multimodal understanding and generation. Unlike fully autoregressive models, Show-O unifies autoregressive and discrete diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks, including visual question answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, Show-O demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters, highlighting its potential as a next-generation foundation model.
In this framework, the model is tasked with predicting Gaussian noise added to the continuous latent representations. In contrast, other models like D3PM, Mask-predict, ARDM, and MaskGIT use a discrete corruption process as an alternative to Gaussian diffusion. Specifically, an image is represented as a sequence of discrete tokens using image tokenizers, with each token associated with a categorical label. The token-wise distribution is transformed into a uniform distribution through a stochastic sampling process. During training, a portion of these tokens is randomly masked, and the model is trained to predict the original values of the masked tokens. In this work, Show-O adopts discrete diffusion modeling for visual generation.
Over the past few years, significant advancements have emerged in the two key pillars of multimodal intelligence: understanding and generation. For multimodal understanding, Multimodal Large Language Models (MLLMs) like LLaVA have demonstrated exceptional capabilities in vision-language tasks such as visual question-answering (VQA). For visual generation, denoising diffusion probabilistic models (DDPMs) have revolutionized traditional generative paradigms, achieving unprecedented performance in text-to-image/video generation.
Given these achievements in individual fields, it is natural to explore the potential of connecting them. Recent works have tried to assemble expert models from these two different domains to form a unified system that can handle both multimodal understanding and generation. However, existing attempts often involve separate models for understanding and generation. For instance, NExT-GPT employs a base language model for multimodal understanding but requires an additional pre-trained diffusion model for image generation. This raises the question: can one single transformer handle both multimodal understanding and generation?
Recently, Chameleon demonstrated that this is possible. Specifically, Chameleon enables the fusion of different modalities to generate both text and image tokens through autoregressive modeling. While it makes sense to model text tokens autoregressively, it is less clear whether modeling image patches or pixels in the same way is optimal. A key bottleneck of autoregressively predicting an image is the large number of sampling steps required, especially when dealing with higher resolution images. Continuous diffusion models have shown superior performance in visual generation compared to autoregressive ones.
This leads us to explore whether a single transformer can integrate both autoregressive and diffusion modeling. Show-O envisions a new paradigm where text is represented as discrete tokens and modeled autoregressively, while continuous image pixels are modeled using denoising diffusion. However, integrating these two distinct techniques into a single network is non-trivial due to the differences between discrete text tokens and continuous image representations. Additionally, diffusion models typically rely on two distinct models: a text encoder and a denoising network.
To address this, Show-O introduces a novel unified model capable of handling both multimodal understanding and generation tasks using mixed autoregressive and diffusion modeling. Show-O is built upon a pre-trained LLM and leverages its autoregressive modeling capabilities for text-based reasoning. Inspired by other works, Show-O employs discrete denoising diffusion to model image tokens instead of continuous representations. Moreover, Show-O inherently encodes text conditional information, eliminating the need for additional text encoders. By utilizing text and image tokenizers, Show-O can process diverse input data and tasks, providing answers autoregressively for vision-language tasks and generating images using discrete denoising diffusion.
Show-O demonstrates comparable, and in some cases better, performance than individual models with an equivalent or larger number of parameters across various benchmarks. Unlike autoregressive image generation, the Show-O framework requires about 20 times fewer sampling steps, making it inherently faster. Additionally, the Show-O framework supports downstream applications like text-guided inpainting and extrapolation without requiring fine-tuning, as demonstrated in the following image.
Show-O also has the potential for mixed-modality generation, such as interleaved video keyframe generation with text descriptions, showing promise for long-form video generation. Furthermore, the Show-O framework investigates the impact of discrete and continuous image representations on multimodal understanding, offering insights for future unified model designs.
The following figure presents a comparison of model characteristics between the Show-O framework and existing methods across various domains. Show-O stands out as a unified model that integrates advanced techniques for both multimodal understanding and generation.
In summary, the main contributions of this paper are as follows:
- Show-O is a unified model that integrates multimodal understanding and generation using a single transformer.
- Show-O unifies autoregressive and discrete diffusion modeling within one transformer, handling both text and images effectively.
- The Show-O framework outperforms or matches individual baseline models with equivalent or larger parameters across multimodal understanding and generation benchmarks.
- Show-O supports downstream applications like text-based inpainting and extrapolation without fine-tuning and demonstrates potential for mixed-modality generation.
- Show-O explores the impact of different types of representations, providing valuable insights for improving multimodal understanding in unified models.
In recent years, an increasing number of studies have focused on unified multimodal language models capable of both comprehension and generation. Some efforts use continuous representations interleaved with text tokens for autoregressive modeling to generate images. SEED-X proposes a unified and versatile foundation system capable of handling both multimodal understanding and generation tasks. In this approach, continuous image representations from the CLIP ViT encoder are combined with text tokens and fed into a large language model (LLM) to perform next-word prediction and image representation regression. Chameleon introduces a family of token-based mixed-modal models capable of both comprehending and generating images. This approach represents all modalities as discrete tokens, utilizing a unified transformer-based architecture and training the model from scratch in an end-to-end manner. In comparison, Show-O also adopts discrete tokens to represent all modalities but utilizes a discrete diffusion process instead of autoregressive modeling for visual generation.
SHOW-O: Methodology and Architecture
The primary objective behind the Show-O framework is to develop a unified model that integrates autoregressive and diffusion modeling for joint multimodal understanding and generation. Developing such a unified model poses significant challenges, with core issues revolving around: i) defining the model’s input/output space; ii) unifying various types of input data from different modalities; iii) integrating both autoregressive and diffusion modeling into a single transformer; and iv) effectively training such a unified model.
Show-O addresses these challenges with the following solutions:
- Show-O constructs the input/output space by tokenizing text and image data into discrete tokens.
- Show-O introduces its default architecture and a unified prompting strategy to structure input data and modalities.
- Show-O demonstrates how to incorporate both autoregressive and diffusion modeling within a single transformer.
- Show-O presents a three-stage training pipeline to effectively train the unified model.
Tokenization
Given that the proposed Show-O is built upon pre-trained LLMs, it is natural to perform unified learning in the discrete space. By maintaining a unified vocabulary that includes discrete text and image tokens, Show-O is tasked with the same learning objective: predicting discrete tokens.
Text Tokenization
Show-O is based on a pre-trained LLM, and the same tokenizer is used for text data tokenization without any modifications.
Image Tokenization
Following MAGVIT-v2, Show-O trains a lookup-free quantizer using around 35M image data. The quantizer maintains a codebook of size 8,192 and encodes images of 256×256 resolution into 16×16 discrete tokens. MAGVIT-v2 is chosen for its ease of fine-tuning, making it suitable as a video tokenizer with temporal compression capability, an aspect Show-O plans to explore in the future. An alternative approach is to use different tokenizers for understanding and generation, respectively. Inspired by existing studies, Show-O also extracts continuous image representations from the pre-trained MAGVIT-v2 and CLIP-ViT encoder to explore improvements in multimodal understanding capabilities.. In the following sections, the default Show-O employs discrete image tokens as input for both multimodal understanding and generation. For simplicity, the methodology sections will elaborate only on the default Show-O.
Architecture
Show-O inherits the architecture of existing LLMs without any architecture modifications, except for prepending a QK-Norm operation to each attention layer. Show-O is initialized with the weights of a pre-trained LLM and expands the size of the embedding layer by incorporating 8,192 new learnable embeddings for discrete image tokens. Unlike state-of-the-art diffusion models that require an additional text encoder, Show-O inherently encodes text conditional information for text-to-image generation.
Unified Prompting
To perform unified learning on multimodal understanding and generation, Show-O utilizes a unified prompting strategy to format various kinds of input data. Given an image-text pair (x, y), it is first tokenized into M image tokens and N text tokens by the image and text tokenizers, respectively. The tokens are then formed into an input sequence according to the task type, as illustrated in the following figure.
By employing this prompt design, Show-O can effectively encode various input data for multimodal understanding, text-to-image generation, and mixed-modality generation as sequential data. This setup enables unified learning to operate seamlessly across sequences for these various tasks. Once trained, Show-O can be prompted to handle a wide range of vision-language tasks, including visual question answering and text-to-image generation.
Omni-Attention Mechanism
Unlike existing works that model sequences autoregressively only, Show-O introduces an omni-attention mechanism, enabling it to model various types of signals in distinct ways. This comprehensive attention mechanism adaptively switches between causal and full attention based on the format of the input sequence. The following figure illustrates examples of omni-attention for different input sequences.
Specifically, Show-O processes text tokens within the sequence via causal attention, while image tokens are handled using full attention, allowing each token to comprehensively interact with all others. In multimodal understanding, text tokens can attend to all previous image tokens, while in text-to-image generation, image tokens can interact with all preceding text tokens. Omni-attention retains the text reasoning knowledge from the pre-trained LLM and enhances the efficiency of image generation by reducing sampling steps. Furthermore, it supports various downstream applications, such as inpainting and extrapolation, without requiring fine-tuning. When given only text tokens, the mechanism defaults to causal attention.
SHOW-O: Experiments and Results
The following table presents the multimodal understanding capability of Show-O on public benchmarks, such as image captioning and visual question-answering tasks.
The current version of Show-O is built upon Phi-1.5, and therefore, Show-O’s understanding-only counterpart, LLaVA-v1.5-Phi-1.5, serves as the direct baseline. Show-O exhibits comparable performance in all evaluation metrics to the baseline LLaVA-v1.5-Phi-1.5, which is dedicated solely to multimodal understanding. This demonstrates the great potential of the Show-O framework to unify multimodal understanding and generation within a single transformer. When compared to understanding-only models like InstructBLIP, Qwen-VL-Chat, and mPLUG-Owl2, Show-O, despite having a much smaller model size, achieves competitive performance on the POPE, MME, Flickr30k, and VQAv2 benchmarks, and performs better on the GQA benchmark. When compared to unified models with significantly more parameters, such as NExT-GPT-13B and Chameleon-34B, Show-O also achieves strong performance on the Flickr30k benchmark and performs much better on the VQAv2 benchmark.
Given these promising results, Show-O is envisioned as a potential next-generation foundation model for unifying understanding and generation. These results also demonstrate the potential of scaling Show-O to achieve state-of-the-art performance.
Qualitative Comparisons
We present qualitative comparisons with diffusion-based models, such as SDv1.5, SDXL, and the autoregressive-based model LlamaGen, alongside unified models like LWM and SEED-X, as demonstrated in the following figure.
Show-O demonstrates the ability to generate realistic images with consistent content described in both short and long text prompts. Compared to SDv1.5 and LlamaGen, Show-O exhibits better visual quality and stronger image-text alignment. For instance, in the second column, both SDv1.5 and LlamaGen fail to fully comprehend the text prompt and miss attributes such as “sunset” and “blue domes” in the generated images. In comparison to SDXL, Show-O provides comparable visual quality and alignment, as seen in examples like “a rally car race” and “stunning contrast against the vibrant sunset.”
Text-Guided Inpainting and Extrapolation
Show-O naturally supports text-based inpainting and extrapolation without requiring any fine-tuning. The following figure illustrates several examples.
At the top of the figure, given an input image and an inpainting mask, Show-O can transform a red trolley car into a blue sports car with sleek curves and tinted windows based on a user-provided text prompt. Show-O can also extrapolate the original image horizontally or vertically based on the given text prompt. For instance, in the second row, Show-O extrapolates an image by adding new objects, like “red wildflowers.” The pixels in both the in-painted and extrapolated regions remain consistent with the original image. These examples clearly demonstrate the inherent advantages of Show-O over autoregressive models for downstream applications.
Final Thoughts
In this article we have talked about Show-O, a unified transformer that integrates multimodal understanding and generation. Unlike fully autoregressive models, Show-O unifies autoregressive and discrete diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks, including visual question answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, Show-O demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters, highlighting its potential as a next-generation foundation model. In this framework, the model is tasked with predicting Gaussian noise added to the continuous latent representations. In contrast, other models like D3PM, Mask-predict, ARDM, and MaskGIT use a discrete corruption process as an alternative to Gaussian diffusion. Show-O is the first to unify autoregressive and discrete diffusion modeling, enabling it to handle different modalities in distinct ways. Extensive experimental results demonstrate that Show-O is comparable to, or even better than, individual expert models across a wide range of vision-language tasks. This highlights its potential as a next-generation foundation model.
Credit: Source link