Images were extracted from 10 films, every Nth frame to get ~100 images each. Dataset was then duplicated; with one copy captioned with gemini-2.5-flash-lite, and the other using empty captions. Limited the training steps to 500.
Obviously since there was little human effort put into this version the model is not that great.
I need to re-extract the frames by hand so everything seen is fairly even (and not accidentally some smear frame or whatever), as well as editing the captions to fix any problems and include character names.
v1.0 was identical, but did not use dropout. v1.1 uses 5% dropout.