GPT-4: Game-Changer for Image-to-Text?

GPT-4 tells fewer lies. For example, it correctly lists Google search operators. What I find most interesting though at the first glance is its exceptional ability to interpret images based on URLs. There is no restriction for the creation time. I gave it the image above and it produced a perfect output:

I would do face recognition, too, and tell you the person’s bio – try it on someone’s photo.

OK, it was wishful thinking for now! After a few colleagues’ comments and a few more tests, I realize that it is hallucinating based on the image URL. So what does multi-modal mean?

  1. I can’t see how that is the perfect output, given that there is no mortarboard in the image. In my experience, the word mortarboard is used in two contexts – it is the cap worn by students at graduation, and it is the piece of equipment used to hold cement when laying bricks. I don’t see either of those (although the “wearing glasses and a mortarboard” output suggests that ChatGPT4 thinks that it sees the cap). Is there another meaning?

