- We develop the AcT2I benchmark to evaluate action depiction in T2I models across 25 actions and 100 animals (125 prompts total).
- Across 5 state-of-the-art T2I models, we find significant limitations in generating accurate and realistic action depictions.
- We propose a training-free, LLM-guided knowledge distillation that injects spatial, temporal, and emotional cues, yielding large gains (e.g., SD-3.5 Large +73%).