Evaluating text-to-image models is hard, but here are some potentially interesting dimensions:
Depth and breadth of knowledge
- easier: Icelandic wedding
- harder: 1940's wedding in Reykjavík's oldest theater
Compositionality for rare or unseen combinations
- easier: a scarecrow juggling purple birds
- harder: remorseful blue lasagna on a leather pentagonal table with furry elephant legs
Coherence of complex outputs
- easier: hands, letters
- harder: full website mockups, wiring diagrams