T2IScoreScore (): Objectively assessing text-to-image prompt faithfulness metrics

Michael Saxon Mahsa Khoshnoodi Fatima Jahara Yujie Lu Aditya Sharma William Yang Wang

Co-first authors University of California, Santa Barbara Fatima Al-Fihri Predoctoral Fellowship

T2IScoreScore (TS2) assesses how well a T2I metric can correctly order and separate clusters of images with different error counts with respect to a single prompt using a semantic error graph.

Surprisingly, we find that the state-of-the-art LM-based metrics (in terms of human preference correlations) struggle to outperform simple feature-space metrics like CLIPScore at ordering or separating closely related sets of images. This surprising result is contrary to claims in many papers presenting novel metrics; the benefits of using more complicated systems become much less pronounced in more challenging and realistic evaluation settings such as ours.

Our benchmark dataset consists of 165 prompts, each with a connected Semantic Error Graph (SEG) of 4 to 76 images arranged by specific semantic errors relative to the prompt (see above image for example). Evaluating text-to-image prompt faithfulness metrics in this way is more rigorous than human preference evaluations which don't consider fine-grained, near neighbor images against the same prompt (which is what T2I metrics are supposed to measure). TS2 can evaluate any T2I faithfulness metric as a black-box.

Check out our Interactive Leaderboard!

Click on any column title to sort the table by that variable. Deeper method details follow table. Explanation of columns:

Ord is a measure of how well a metric correctly orders images by semantic error count based on Spearman's correalation.
Sep is a measure of how well a metric separates pairs of clusters that differ in semantic errors based on the two-sample Kolmogorov–Smirnov statistic.
*_Avg is average of measure * over all Semantic Error Graphs (SEGs) in TS2.
*_Synth is the average of measure * over our artificially designed SEGs. These are the easiest types of errors to detect.
*_Nat is the average of measure * over SEGs using real stock photos rather than generated images, with artificially constructed "prompts" to place them in an error graph.
*_Real is the average over SEGs containing natural errors from real text-to-image model generation attempts. This is both the hardest to score and the most construct-valid.
Because we tested several different backend LLMs for DSG and TIFA, their rows are color coded for readability.

	Average (All Images)		Synthetic Errors		Natural Images		Natural Errors
Method	Ord_Avg	Sep_Avg	Ord_Synth	Sep_Synth	Ord_Nat	Sep_Nat	Ord_Real	Sep_Real
DSG + InstructBLIP	0.802	0.843	0.861	0.888	0.702	0.815	0.658	0.689
DSG + LLaVA-1.5	0.800	0.825	0.838	0.855	0.749	0.751	0.696	0.768
DSG + BLIP1	0.769	0.806	0.817	0.841	0.710	0.751	0.628	0.714
TIFA + InstructBLIP	0.765	0.850	0.802	0.867	0.651	0.828	0.716	0.805
DSG + LLaVA-1.5 (w/prompt eng)	0.756	0.805	0.821	0.838	0.689	0.772	0.559	0.706
TIFA + LLaVA-1.5	0.745	0.843	0.792	0.875	0.628	0.834	0.667	0.727
TIFA + LLaVA-1.5 (w/prompt eng)	0.744	0.819	0.792	0.852	0.640	0.756	0.645	0.744
ALIGNScore	0.739	0.928	0.776	0.941	0.702	0.926	0.626	0.879
TIFA+ BLIP1	0.738	0.818	0.788	0.841	0.622	0.779	0.640	0.764
CLIPScore	0.714	0.907	0.750	0.905	0.580	0.915	0.693	0.903
TIFA + MPlug	0.710	0.806	0.726	0.806	0.669	0.842	0.682	0.774
DSG + MPlug	0.688	0.755	0.735	0.771	0.619	0.706	0.564	0.731
LLMScore Over	0.577	0.735	0.616	0.728	0.444	0.767	0.541	0.736
LLMScore EC	0.488	0.736	0.502	0.711	0.362	0.805	0.544	0.773
TIFA + Fuyu	0.387	0.672	0.445	0.673	0.235	0.757	0.297	0.593
VIEScore + LLaVA-1.5	0.378	0.518	0.425	0.537	0.224	0.445	0.332	0.507
DSG + Fuyu	0.358	0.660	0.455	0.687	0.215	0.710	0.100	0.508

We evaluate the CLIPScore, ALIGNScore, TIFA, DSG, LLMScore, and VIEScore metrics, using a variety of VQA and VLM models where appropriate. On all separation scores, the simplest CLIP and ALIGNScore methods outperform the more complex LLM-based metrics. On the hardest partition for the ordering score, Ord_Real, CLIPScore comes within 2% the performance of the best metric.

Details on the TS2 Semantic Error Graphs (SEGs)

Images in TS2 are collected from a variety of sources, including multiple Stable Diffusion variants, DALL-E 2, and the free stock image website Pexels. Prompts that weren't hand-written were sampled from MS COCO and PartiPrompts. The SEGs contain a variety of errors, including missing objects, incorrect object properties, composition errors, and verb errors. SEG distributions depicted:

The distribution of errors in our dataset.

As mentioned above, TS2 contains three partitions of SEGs; Synthetic Errors, Natural Images, and Natural Errors. They primarily differ in the order in which the images, prompt, and error graph were produced. For the Synth group of SEGs, from an initial prompt an error graph was designed, and images were generated to fill the nodes in the graph. (left side of below figure). For the Nat group of SEGs, images were collected from stock photo websites and prompts were designed to place them in a constructed error graph (middle of below figure). For the Real group of SEGs, a single prompt was used to produce many images, including real T2I generation errors. These were then sorted into an error graph (right of below figure).

Details on the TS2 meta-metrics

If you really want to know how the meta-metrics work, you should probably read the paper! But basically, they are assessed by having a metric score each image through every walk down each error graph, using Spearman's rank correlation coefficient against expected error count within each walk to assess correctness of ordering, and the two-sample Kolmogorov–Smirnov statistic is assessed between every adjacent pair of nodes in a walk to assess how well a metric separates semantically different populations of images relative to a specific prompt.

The dynamic range of both metrics ranges from 0 to 1 (in the paper we scale this to 100 for readability). Although the scores may look high already, in principle this should be a very easy task. Although the semantic errors we mark are "objective", this objectivity is implictly a human judgement; thus the human performance barrier on this task is near 100%. For a quality T2I faithfulness metric, this effect is really the most important one to capture.

Related Work

Our work is fundamentally based on evaluating other groups' metrics. SeeTrue is the closest work to our philosophy, with its similar focus on using near-neighbor images to assess T2I models, and investigating mutli-image to single prompt and multi-prompt to single image settings, and also use images. However, they focus more using these examples to build a faithfulness metric that can be used to assess VNLI and VQA tasks directly, rather than using them to evaluate T2I metrics themselves.

The main metrics we analyzed, TIFA and DSG from Yushi Hu and Jaemin Cho are both very influential. We hope that TS2-guided evaluation will enable better metrics inspired from their work.

The captioning based metrics we evaluated, LLMScore and VIEScore both stand to benefit a lot from advancements in VLMs and captioning models.

Publication

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)

Michael Saxon, Fatima Jahara, Mahsa Khoshnoodi, Yujie Lu, Aditya Sharma, William Yang Wang

@misc{saxon2024evaluates,
      title={Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)}, 
      author={Michael Saxon and Fatima Jahara and Mahsa Khoshnoodi and Yujie Lu and Aditya Sharma and William Yang Wang},
      year={2024},
      eprint={2404.04251},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2404.04251}
}