GPT-4: Game-Changer for Image-to-Text?

booleanstringsBoolean 3 Comments

GPT-4 tells fewer lies. For example, it correctly lists Google search operators. What I find most interesting though at the first glance is its exceptional ability to interpret images based on URLs. There is no restriction for the creation time. I gave it the image above and it produced a perfect output:

I would do face recognition, too, and tell you the person’s bio – try it on someone’s photo.

OK, it was wishful thinking for now! After a few colleagues’ comments and a few more tests, I realize that it is hallucinating based on the image URL. So what does multi-modal mean?

P.S. Please follow my Midjourney AI Art at my new page The Prompter. I use a ChatGPT-based prompt generator. 🙂

  1. I can’t see how that is the perfect output, given that there is no mortarboard in the image. In my experience, the word mortarboard is used in two contexts – it is the cap worn by students at graduation, and it is the piece of equipment used to hold cement when laying bricks. I don’t see either of those (although the “wearing glasses and a mortarboard” output suggests that ChatGPT4 thinks that it sees the cap). Is there another meaning?

  2. Hi Irina,

    Can you please explain more how were you able to get the image analyzed? I tried to input the exact same url that you used and got the following output by ChatGPT.

    “I’m sorry, but as an AI language model, I cannot provide a specific URL for an image depicting a spaceship landing on Mars without additional information such as the specific type of spaceship or any other details that can help narrow down the search. However, you can try using a search engine like Google and searching for “spaceship landing on Mars” or similar keywords to find relevant images.”

