Microsoft Copilot Vision: A Promising but Flawed AI Assistant
Microsoft’s Copilot Vision for Windows aims to be an AI assistant that watches your screen and offers helpful suggestions as you work. The initial promise is compelling: a real-time, context-aware assistant capable of understanding your actions and providing relevant guidance. However, as with many emerging technologies, the reality is a mixed bag of potential and frustration. While Copilot Vision occasionally demonstrates flashes of brilliance, it frequently stumbles, highlighting the challenges of creating a truly useful and reliable AI assistant.
The announcement of Copilot Vision was a major highlight of Microsoft’s 50th-anniversary celebration. It’s a forward-thinking concept that grants Windows Copilot access to your screen, allowing it to interpret what you’re seeing and respond to your questions and requests. I had a brief hands-on experience with Copilot Vision at Microsoft’s headquarters, but the demonstrations were carefully curated. Now, through the Windows Insider program, users can experience it firsthand.
Currently, Copilot Vision is only available for testing. Despite Microsoft’s initial indication that it would be accessible across all beta channels, only two of my test laptops received the build – one on the Dev Channel and the other on the Canary Channel.
Performance varies significantly depending on the hardware. The Acer Swift Edge laptop, powered by a Ryzen 7840U, exhibited slow response times, sometimes taking up to half a minute to react. While the response time eventually improved to a few seconds, the experience was much smoother on a Surface Laptop 7 with a Qualcomm Snapdragon X Elite chip. The near-instantaneous responses on the Surface Laptop 7 were likely due to the more powerful Neural Processing Unit (NPU).
Using Copilot Vision is straightforward. After launching the Copilot app and tapping the "eyeglasses" icon, you’re presented with a list of apps to "share" with Copilot Vision. Once you grant access, it can only see and interact with that specific application.
To evaluate Copilot Vision’s capabilities, I subjected it to seven different scenarios: interpreting a PCWorld article, analyzing competing airfares, playing Balatro and Solitaire, identifying photos, examining airfare options, and assisting with Adobe Photoshop. The results were inconsistent.
The most important thing to understand about Copilot Vision is that it only sees what you see. Unlike other AI models like Copilot, Google Gemini, or ChatGPT, it doesn’t automatically ingest entire documents or web pages. If you scroll down a page, it can "read" along, but it doesn’t retain information that’s not currently visible. This significantly limits its utility.
While it can answer conversational questions, it struggles to provide broader context. For example, when examining an article about tariffs, it could calculate price increases based on different tariff rates, but it wouldn’t offer insights into the current tariff situation.
The Minecraft demo that Microsoft used to showcase Copilot Vision’s capabilities felt suspiciously scripted. It appeared that the scenarios were carefully chosen to highlight Copilot Vision’s strengths and minimize its weaknesses.
I hoped that Copilot Vision would be more helpful with the indie game Balatro, a card game that combines elements of video poker with strategic twists. However, I learned that Copilot Vision wouldn’t spontaneously offer suggestions; it only responds when asked.
While Copilot Vision correctly identified that I was playing Balatro and displayed the available options, its analysis was flawed. It failed to recognize the cards I had and provided inaccurate advice. For instance, it incorrectly claimed I had a pair of queens when I didn’t and misidentified other cards.
I then tried a simpler game: Windows Solitaire, specifically FreeCell. I assumed Copilot Vision could understand the basic rules and provide assistance. Unfortunately, it suffered from the same object recognition problems as with Balatro. It repeatedly invented cards that weren’t on the board, even though it seemed to grasp the concept of moving cards between columns.
After failing miserably at Solitaire, Copilot offered some banter, stating, "Fair point! It’s all about having fun, though. If nothing else, I’ve got your back for the banter. Let’s keep playing and see where it goes. Ready for another move?"
Next, I asked Copilot to review a complaint letter drafted by Google Gemini. The initial version was deemed acceptable, but when I added an offensive line, Copilot Vision failed to recognize the inappropriate language. This highlights its limitations in providing career advice.
I then tested Copilot Vision’s ability to identify actors from a promotional still from "The Breakfast Club." Initially, it refused to identify specific people, citing privacy concerns. However, after prompting it to acknowledge that they were famous figures, it correctly identified the five main cast members.
Copilot Vision seemed to be able to recognize images, especially with context. When I asked it to identify a photo of Rodney Dangerfield, it correctly identified him, noting that the window title mentioned "15 intriguing facts about Rodney Dangerfield."
Predictably, Copilot Vision struggled with comparing flights. Because it can only see what I can see on the screen, it couldn’t provide comprehensive recommendations based on price, stopovers, or personal preferences. The lack of support for full-page screenshots further hindered its ability to analyze flight options effectively.
Despite these shortcomings, I believe Copilot Vision has potential in assisting with tasks like photo editing. While it doesn’t visually highlight elements on the screen, it can provide verbal guidance. In Photoshop, it could assist with navigating the interface, even referring to the "Move tool" as a "four-point arrow."
Copilot Vision’s real-time assistance could be especially beneficial for users unfamiliar with complex software. Although it doesn’t replace existing tutorials, it offers the advantage of being constantly up-to-date with the latest software versions and interfaces.
The value of AI is debatable. Copilot Vision sometimes feels competent, but often it’s a waste of time. Its current state feels tentative.
While the potential is enormous, Microsoft appears cautious about entering the consumer space aggressively. The future might involve AI assistants like Google Gemini constantly monitoring Chromebook users. The competitive pressures between these AI assistants will hopefully lead to better, privacy-preserving tools that provide real-time assistance.