AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models

TL;DR

Text-to-Image models excel at 1-hop composition but often miss who-does-what-to-whom. We contribute two parts:
(1) AcT2I, an action-centric benchmark with 125 prompts (25 actions × 100 animals) that stresses relational reasoning and on which the best model (SD-3.5 Large) reaches only 48% human acceptance.
(2) LLM-guided Knowledge Distillation, a training-free technique that augments prompts with temporal, spatial, and emotional cues, yielding large gains (e.g., SD-3.5 Large +73%).

Action depiction performance across models on AcT2I — Current SOTA performance on AcT2I — best **48%** acceptance; substantial headroom.

Knowledge Distillation teaser — LLM-guided Knowledge Distillation (training-free) — use it to stress-test and improve generations.

Abstract

Text-to-Image (T2I) models have recently achieved remarkable success in generating images from textual descriptions. However, challenges still persist in accurately rendering complex scenes where actions and interactions form the primary semantic focus. Our key observation in this work is that T2I models frequently struggle to capture nuanced and often implicit attributes inherent in action depiction, leading to generating images that lack key contextual details. To enable systematic evaluation, we introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts. We experimentally validate that leading T2I models do not fare well on AcT2I. We further hypothesize that this shortcoming arises from the incomplete representation of the inherent attributes and contextual dependencies in the training corpora of existing T2I models. We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation. Specifically, we enhance prompts by incorporating dense information across three dimensions, observing that injecting prompts with temporal details significantly improves image generation accuracy, with our best model achieving an increase of 72%. Our findings highlight the limitations of current T2I methods in generating images that require complex reasoning and demonstrate that integrating linguistic knowledge in a systematic way can notably advance the generation of nuanced and contextually accurate images.

Contributions

We develop the AcT2I benchmark to evaluate action depiction in T2I models across 25 actions and 100 animals (125 prompts total).
Across 5 state-of-the-art T2I models, we find significant limitations in generating accurate and realistic action depictions.
We propose a training-free, LLM-guided knowledge distillation that injects spatial, temporal, and emotional cues, yielding large gains (e.g., SD-3.5 Large +73%).

Schema & Labels

Schema: [animal] [action] [animal] (two entities, one action)
Labels: rarity, emotion, spatial topology, temporal extent; Shannon > 0.82 indicates balance
Why animals: high action affordance, fewer human-image artifacts; isolates action depiction
Evaluation: 3 raters/image; acceptance = majority “Yes”

Axis	Levels	Why It Matters	Example (1-liner)
Rarity	1) Frequent 2) Rare 3) Very Rare	Stress-tests long-tail beyond memorized patterns	1) F: a Snake attacking a Possum 2) R: a Moose attacking a Duck 3) VR: an Iguana eating a Cougar
Emotional	1) Aggressive 2) Defensive 3) Affiliative 4) Communicative	Affects pose, gaze, and tension cues	1) Agg: a Camel eating a Swan 2) Def: a Fox fleeing from a Cougar 3) Aff: a Rat playing with a Possum 4) Com: a Dog barking at a Cat
Spatial	1) Proximal-contact 2) Pursuit/Avoid 3) Distal	Contact & pursuit change layouts	1) Prox: a Moose attacking a Duck 2) P/A: a Kangaroo chasing a Raccoon 3) Distal: a Dog barking at a Cat
Temporal	1) Instantaneous 2) Extended	Motion/phase disambiguates actions	1) Inst: a Goose pecking at a Hamster 2) Ex: a Monkey interacting with an Otter

Performance of SOTA T2I models in AcT2I benchmark

Best overall acceptance: 48% (SD 3.5 Large); none exceed 50%.
Niche strengths across models; reptile prompts are hardest.
Under Knowledge Distillation, the temporal dimension delivers the largest gains.

Frequent Error Patterns

Incomplete depictions

Hybridization

Role/context errors

Spatial/scale errors

Emotional errors

Temporal errors

LLM-guided Knowledge Distillation

User preference win rates by knowledge distillation dimension and model — Results of our technique: Temporal > Emotional > Spatial.

Emotional

Facial/body expressions to convey intent
Improves close-range and tensioned scenes

Spatial

Relative position, scale, and depth
Helps stationary or vantage-point actions

Temporal

Freeze-frame selection and motion cues
Largest gains on competitive and dynamic actions

BibTeX

@article{malaviya2025act2i,
  author    = {Malaviya, Vatsal and Chatterjee, Agneet and Patel, Maitreya and Yang, Yezhou and Baral, Chitta},
  title     = {AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models},
  journal   = {arXiv preprint arXiv:2509.16141},
  year      = {2025}
}