This version is after three epochs. A lot less overfitting but more generic.
The model is trained on photorealistic images of fit and attractive females. It should work on all types of captions but the training dataset did not have any nudity.