AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models

Arizona State University

TL;DR

Text-to-Image models excel at 1-hop composition but often miss who-does-what-to-whom. We contribute two parts:
(1) AcT2I, an action-centric benchmark with 125 prompts (25 actions × 100 animals) that stresses relational reasoning and on which the best model (SD-3.5 Large) reaches only 48% human acceptance.
(2) LLM-guided Knowledge Distillation, a training-free technique that augments prompts with temporal, spatial, and emotional cues, yielding large gains (e.g., SD-3.5 Large +73%).

Action depiction performance across models on AcT2I
Current SOTA performance on AcT2I — best 48% acceptance; substantial headroom.
Knowledge Distillation teaser
LLM-guided Knowledge Distillation (training-free) — use it to stress-test and improve generations.

Abstract

Text-to-Image (T2I) models have recently achieved remarkable success in generating images from textual descriptions. However, challenges still persist in accurately rendering complex scenes where actions and interactions form the primary semantic focus. Our key observation in this work is that T2I models frequently struggle to capture nuanced and often implicit attributes inherent in action depiction, leading to generating images that lack key contextual details. To enable systematic evaluation, we introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts. We experimentally validate that leading T2I models do not fare well on AcT2I. We further hypothesize that this shortcoming arises from the incomplete representation of the inherent attributes and contextual dependencies in the training corpora of existing T2I models. We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation. Specifically, we enhance prompts by incorporating dense information across three dimensions, observing that injecting prompts with temporal details significantly improves image generation accuracy, with our best model achieving an increase of 72%. Our findings highlight the limitations of current T2I methods in generating images that require complex reasoning and demonstrate that integrating linguistic knowledge in a systematic way can notably advance the generation of nuanced and contextually accurate images.

Contributions

  • We develop the AcT2I benchmark to evaluate action depiction in T2I models across 25 actions and 100 animals (125 prompts total).
  • Across 5 state-of-the-art T2I models, we find significant limitations in generating accurate and realistic action depictions.
  • We propose a training-free, LLM-guided knowledge distillation that injects spatial, temporal, and emotional cues, yielding large gains (e.g., SD-3.5 Large +73%).

Schema & Labels

  • Schema: [animal] [action] [animal] (two entities, one action)
  • Labels: rarity, emotion, spatial topology, temporal extent; Shannon > 0.82 indicates balance
  • Why animals: high action affordance, fewer human-image artifacts; isolates action depiction
  • Evaluation: 3 raters/image; acceptance = majority “Yes”
Axis Levels Why It Matters Example (1-liner)
Rarity 1) Frequent
2) Rare
3) Very Rare
Stress-tests long-tail beyond memorized patterns 1) F: a Snake attacking a Possum
2) R: a Moose attacking a Duck
3) VR: an Iguana eating a Cougar
Emotional 1) Aggressive
2) Defensive
3) Affiliative
4) Communicative
Affects pose, gaze, and tension cues 1) Agg: a Camel eating a Swan
2) Def: a Fox fleeing from a Cougar
3) Aff: a Rat playing with a Possum
4) Com: a Dog barking at a Cat
Spatial 1) Proximal-contact
2) Pursuit/Avoid
3) Distal
Contact & pursuit change layouts 1) Prox: a Moose attacking a Duck
2) P/A: a Kangaroo chasing a Raccoon
3) Distal: a Dog barking at a Cat
Temporal 1) Instantaneous
2) Extended
Motion/phase disambiguates actions 1) Inst: a Goose pecking at a Hamster
2) Ex: a Monkey interacting with an Otter

Performance of SOTA T2I models in AcT2I benchmark

  • Best overall acceptance: 48% (SD 3.5 Large); none exceed 50%.
  • Niche strengths across models; reptile prompts are hardest.
  • Under Knowledge Distillation, the temporal dimension delivers the largest gains.

Frequent Error Patterns

Incomplete depictions

Incomplete depictions: a Beaver grooming a Dog
a Beaver grooming a Dog

Hybridization

Hybridization: a Draco camouflaging near a Cow
a Draco camouflaging near a Cow

Role/context errors

Role/context errors: a Buffalo competing for dominance with a Hippopotamus
a Buffalo competing for dominance with a Hippopotamus

Spatial/scale errors

Spatial/scale errors: a Boar retaliating against a Giraffe
a Boar retaliating against a Giraffe

Emotional errors

Emotional errors: a Turkey pecking at a Cougar
a Turkey pecking at a Cougar

Temporal errors

Temporal errors: a Badger retaliating against an Alpaca
a Badger retaliating against an Alpaca

LLM-guided Knowledge Distillation

User preference win rates by knowledge distillation dimension and model
Results of our technique: Temporal > Emotional > Spatial.

Emotional

  • Facial/body expressions to convey intent
  • Improves close-range and tensioned scenes

Spatial

  • Relative position, scale, and depth
  • Helps stationary or vantage-point actions

Temporal

  • Freeze-frame selection and motion cues
  • Largest gains on competitive and dynamic actions
Word clouds for Knowledge Distillation prompts: Emotional, Spatial, Temporal
Semantic cues emphasized across dimensions.

BibTeX

@article{malaviya2025act2i,
  author    = {Malaviya, Vatsal and Chatterjee, Agneet and Patel, Maitreya and Yang, Yezhou and Baral, Chitta},
  title     = {AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models},
  journal   = {arXiv preprint arXiv:2509.16141},
  year      = {2025}
}