Microsoft Copilot Vision: A Promising But Flawed AI Assistant
Microsoft’s Copilot Vision for Windows aims to be an AI assistant, observing your screen and providing context-aware suggestions. The idea is compelling: a helpful AI companion that understands your tasks and offers guidance. However, in its current testing phase, Copilot Vision presents a mixed bag of impressive potential and frustrating limitations.
The initial announcement of Copilot Vision was a high point during Microsoft’s 50th anniversary celebration. It embodies a visionary concept: granting Windows Copilot access to your screen for real-time interpretation and allowing you to interact with your PC through natural language. While hands-on demos at Microsoft HQ were promising, the true test lies in real-world usage.
Currently, Copilot Vision is available only to Windows Insiders on specific channels. Performance varies significantly depending on the hardware. An Acer Swift Edge laptop with a Ryzen 7840U processor experienced sluggish response times, while a Surface Laptop 7 with a Qualcomm Snapdragon X Elite chip delivered near-instantaneous reactions, likely due to the more powerful NPU.
Using Copilot Vision is straightforward. After launching the Copilot app, you select the "eyeglasses" icon and choose specific apps to share with Copilot Vision. This targeted approach ensures privacy, as Copilot Vision can only see the selected app.
To evaluate Copilot Vision’s capabilities, I subjected it to seven diverse scenarios: interpreting a PCWorld article and a list of airfares, playing Balatro (a card game), playing Solitaire, identifying photos, exploring potential airfares, and assisting with Adobe Photoshop. The results were inconsistent, highlighting both strengths and weaknesses.
One crucial aspect of Copilot Vision is that it only processes what’s visible on your screen. Unlike Copilot, Google Gemini, or ChatGPT in "research" mode, it doesn’t automatically ingest entire documents or web pages. If you scroll down, it can "read" along, but it doesn’t retain information from what it hasn’t seen. This limits its utility in tasks requiring a broader understanding of context. However, the ability to ask conversational questions, such as calculating tariff-adjusted prices, proved useful.
The Minecraft demo, where Copilot Vision offered specific assistance, raised suspicion. It seemed carefully scripted to showcase Copilot Vision’s usefulness. This prompted me to test Copilot Vision with Balatro, a popular indie game.
Copilot Vision doesn’t proactively offer suggestions; it needs to be prompted. While it could recognize that I was playing Balatro and identify the available choices, its card recognition was flawed. It failed to identify a missing pair of queens and misidentified other cards, rendering its advice inaccurate.
I then simplified the test by launching a game of Windows Solitaire (FreeCell). Unfortunately, Copilot Vision exhibited the same object recognition problems as with Balatro, inventing non-existent cards. Despite understanding basic card movements, its gameplay didn’t match what was on the screen. This proved extremely frustrating.
When I sarcastically remarked that Copilot wasn’t a great Solitaire player, it responded with lighthearted banter. While conversational, this wasn’t the helpful AI assistant I was hoping for.
Next, I tested Copilot Vision’s ability to identify potentially offensive content in a letter. After drafting a professional complaint letter using Google Gemini, I added an insulting line at the end. Copilot Vision either failed to recognize the problematic addition or simply didn’t care, raising concerns about its ability to provide reliable career advice.
I then attempted to use Copilot Vision to identify actors in a promotional still from "The Breakfast Club." Initially, it declined, citing a policy against identifying specific individuals unless they are famous. However, after prompting, it successfully identified the five principal cast members, suggesting a potential bias against "doxing" non-public figures. It was able to identify Rodney Dangerfield but only after confirming he was a famous person, identifying him from the context of the window title and his "recognizable look."
Predictably, Copilot Vision struggled with comparing airfares. Its inability to see the entire list of flights at once and its lack of understanding of personal preferences (price vs. stopovers) hindered its effectiveness. A "screenshot" feature that captures the entire webpage would be beneficial.
However, Copilot Vision showed promise in assisting with Adobe Photoshop. Its ability to understand the interface and provide guidance on specific tools, such as identifying the Move tool as a "four-point arrow," proved helpful. While it doesn’t visually highlight elements on the screen, its conversational approach and real-time relevance made it a valuable aid. It shines when used in this way.
While Copilot Vision doesn’t replace existing Photoshop tutorials, it offers the advantage of being potentially up-to-date with the latest software versions and interfaces.
Overall, Copilot Vision feels like a tentative step towards a more useful AI future. While it demonstrates competence in certain areas, its inconsistencies and limitations can be frustrating. It has enormous potential, but Microsoft seems to be proceeding cautiously in the consumer space.
The long-term vision for AI assistants is compelling. While some may be hesitant to allow an AI to constantly monitor their work, the potential benefits of real-time assistance and context-aware guidance are undeniable. The competition between Microsoft and Google in this space will drive innovation and lead to the development of better, privacy-preserving tools that enhance productivity and creativity. A comparison of Gemini’s future within Chrome OS, and Copilot Vision’s future within Windows will be something to keep a close eye on.