ChatGPT Can Finally See

Posted on December 13, 2024 by resourcefulman

After months of testing , OpenAI introduced an “enhanced voice” mode for ChatGPT back in September . This feature allows you to communicate with ChatGPT in real time: you can interrupt the bot’s “talking” to ask another question, and it understands your tone of voice, which it uses to both inform its answers and its intonation. uses. (It’s very creepy when it laughs.)

However, one feature that has been missing since launch is Enhanced Voice Mode. When OpenAI first announced this feature back in May, it demonstrated how ChatGPT would be able to access your camera and “see” the world around you. While communicating with the bot, you could point the camera at something, ask a question, and ChatGPT would answer as best it could. Seven months later, this opportunity has arrived, and it’s frighteningly impressive.

To access it, you’ll need a paid subscription to ChatGPT – either Plus ($20 per month) or Pro ($200 per month). ChatGPT Team subscribers are also eligible to participate. This feature may not be available on your end right away, even if you pay, as OpenAI will roll it out over time.

ChatGPT Enhanced Voice Mode Vision Function Testing

Accessing the camera is quite easy once it appears in your account. You launch the advanced voice mode the same way you always do, using the waveform icon in the bottom right corner of the chat. From here you’ll see a new camera icon, which of course starts a live feed from the camera. This doesn’t interrupt the chat: you can be in the middle of a conversation with ChatGPT, open the camera and continue chatting, only now the camera image is part of the conversation.

The first time I used this, I pointed the camera at the Nintendo Switch box that was sitting next to it, with the iPhone cable and my Magic Trackpad resting on it, and asked, “What is this?” ChatGPT said: “It looks like an OLED Nintendo Switch box with cables and a laptop on top. Are you planning to install it? Two out of three correct because he mistook my trackpad for a laptop, but hey, close enough. I then pointed to my water bottle and asked him to identify what I highlighted: “It looks like a black Hydro Flask bottle. It’s great for keeping drinks cold or hot! Do you often take it with you?”

I asked back, “Do you know what Hydro Flask model this is?” ChatGPT: “I can’t be sure of the exact model, but it looks like one of their wide mouth bottles, probably around 32 oz. It’s definitely designed to keep your drinks at the right temperature for hours.” This is…in principle true. I’m also not very comfortable with ChatGPT guessing the size correctly.

I went to my keyboard, which ChatGPT accurately stated was the Apple Magic Keyboard. I asked what keys he saw and named several, but not all, of the keys that were in my frame. So I asked how many keys he could see and he said “about 30” even though there were 26. So close again.

He was able to identify the MagSafe port on my MacBook, as well as two USB ports and a headphone jack to the right of it. He recognized the vent in my ceiling and the specific type of boots that sat by the front door. In general, it recognized everything I tested it on, with the exception of the trackpad.

In advanced voice mode, vision is fast

But beyond recognition, I think what struck me most was the speed of these responses. You ask ChatGPT to identify something and it does it, sometimes faster than if you asked a real person to do it. Sometimes the bot will hold a word for a moment (for example, “I think this is…”), which is likely a trick to allow ChatGPT to process the rest of what it wants to say. I also noticed that he was less confident in his first reaction: I pointed him at my Magic Mouse and for the first time he guessed what kind of computer mouse it was. But when I asked what brand it was, not only did it say Apple, it said it was the Apple Magic Mouse, known for its “sleek design” and “touch surface.”

However, all things considered, these reactions are often almost instantaneous, which speaks volumes about how powerful OpenAI’s models are these days. I’m still largely skeptical of AI, but this was the first development in a while that impressed me – and I’m torn about how I feel about it.

On the one hand, I saw this technology being used for good. Imagine how useful a solution like this could be for blind or visually impaired users, especially in a convenient device like smart glasses . Someone might ask their AI assistant what direction they’re facing, read a menu at a restaurant, or whether it’s safe to cross the street. Technologies like this could change search for the better and make it easy to learn new things about the world by pointing your smartphone camera at an object.

On the other hand, my thoughts turn to the negative, especially since the AI is still prone to hallucinations . As more and more people use this technology, they will inevitably encounter mistakes that AI can make, and if they rely on a bot to help them with tasks, especially those related to their safety, hallucinations may be dangerous. I didn’t have any major bugs; Just some confusion with the trackpad. Anderson Cooper discovered that the bot made a mistake on a geometry problem (again, not a big deal). But it’s a good reminder that as this technology rapidly improves, its inherent shortcomings increase the likelihood of failure.

This may be why every live camera session warns you not to use this feature for anything security related.

More…

ChatGPT Can Finally See

ChatGPT Enhanced Voice Mode Vision Function Testing

In advanced voice mode, vision is fast

Leave a Reply Cancel reply