Summarize by Aili

EvalAlign: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models

🌈 Abstract

The recent advancements in text-to-image generative models have been remarkable, but the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. This paper proposes EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. The approach leverages the capabilities of Multimodal Large Language Models (MLLMs) pre-trained on extensive datasets. The authors develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment, with detailed, fine-grained instructions linked to specific scoring options, enabling precise manual scoring of the generated images. They Supervised Fine-Tune (SFT) the MLLM to align closely with human evaluative judgments, resulting in a robust evaluation model. Comprehensive tests across 24 text-to-image generation models demonstrate that EvalAlign not only provides superior metric stability but also aligns more closely with human preferences than existing metrics.

🙋 Q&A

[01] Introduction

1. What are the key limitations of existing benchmarks for text-to-image generation models? The key limitations of existing benchmarks include:

Limited model parameters: Current evaluation models have too few parameters, restricting their ability to accurately represent images and leading to significant discrepancies compared to human evaluations.
Training data limitations: Some evaluation methods use models that have not been trained with synthesized images, which may introduce training bias and flaws the evaluation.
High annotation costs: Some methods rely heavily on extensive human annotations, significantly increasing the cost of labeling.
Lack of detailed evaluation metric: The evaluation metrics do not provide fine-grained interpretability, preventing them from guiding model optimization effectively.
Computational inefficiency: The evaluation models require substantial computational resources, making them inefficient.

2. How does the proposed EvalAlign metric address these limitations? EvalAlign offers low-cost, accurate, and efficient model evaluations while providing fine-grained, interpretable metrics. It leverages the capabilities of Multimodal Large Language Models (MLLMs) pre-trained on extensive datasets and employs Supervised Fine-Tuning (SFT) to align the MLLM with human annotations for text-to-image generation.

[02] Related Work

1. What are the key limitations of existing text-to-image evaluation methods? Existing text-to-image evaluation methods contain various limitations, including:

Inconsistency with human perception (e.g., IS, FID, CLIPScore)
Coarse and general scoring that cannot serve as an indication for model evolution (e.g., HPS series, PickScore, ImageReward)
Heavy reliance on human labor, limiting their application within budget-limited research groups (e.g., HEIM)

2. How does the proposed EvalAlign approach differ from related work? EvalAlign takes both text-image alignment and image faithfulness into consideration, while existing approaches like TIFA, Gecko, and LLMScore mainly focus on text-image alignment. Additionally, the evaluation of LLMScore requires an object detection stage, which introduces significantly extra inference latency to the evaluation pipeline.

[03] EvalAlign Dataset Construction

1. What are the key steps in the construction of the EvalAlign dataset? The EvalAlign dataset construction process includes:

Prompt collection: Collecting, filtering, and cleaning prompts from existing evaluation datasets and generated prompts based on LLM.
Image generation: Using a diverse set of images generated by various models using the collected prompts.
Data annotation: Annotating the prompts for text-image alignment and the images for image faithfulness.

2. How does the dataset enable the training and evaluation of the MLLM? The dataset includes a variety of images generated by different text-to-image models, which allows for detailed human annotation. This diversity not only tests the MLLM's generalization capabilities but also aids in developing a model with broader applicability.

[04] Training and Evaluation Methods

1. How does the Supervised Fine-Tuning (SFT) process work? The SFT training sample is a triplet consisting of a question about a fine-grained aspect of the generated image, the multimodal input (mainly the image and necessary textual description), and the human-annotated answer. The optimization objective is the original autoregressive loss function, calculated only on the question-answer pair.

2. How is the EvalAlign metric calculated? During inference, the MLLM generates a response to the given question in an autoregressive way. A rule-based filtering and regular expression are used to extract the chosen option and its corresponding score, which are then summed to obtain the final EvalAlign metric.

[05] Experimental Results

1. How does EvalAlign perform compared to existing evaluation methods? Extensive experiments demonstrate that EvalAlign outperforms other methods in evaluating text-to-image model performance. The rankings of the top and bottom 10 models by both EvalAlign and human evaluation scores show remarkable consistency, confirming that the EvalAlign metric closely mirrors human evaluation.

2. What are the key findings from the ablation and analysis experiments? The ablation studies show that:

SFT significantly enhances the performance of MLLMs on evaluation tasks, closely aligning their predictions with human evaluations.
Increasing the MLLM model size from 7B to 34B results in notable improvements in performance.
Training with a small amount of annotated data (500 samples) nearly maximizes accuracy, highlighting the cost-effectiveness of the approach.

Shared by Daniel Chen ·

Install fromChrome Web Store