OpenAI’s Agent Mode Tested: Mixed Results in Web Navigation

OpenAI’s new Agent Mode, demonstrated on the Atlas model, was put through a series of web‑based tasks to assess its ability to search, click, and retrieve information without human input. While the agent succeeded in locating specific content such as macOS game demos, it frequently struggled with navigation, looping, and time limits, leading to incomplete task completion. Overall, the evaluation shows that the technology can handle simple, repetitive actions but is not yet reliable enough for fully autonomous use.

Performance Overview

OpenAI’s Atlas agent was examined using a set of six varied web‑based tasks that required it to search for specific items, follow links, and identify relevant information. In one scenario, the agent began by searching for the term “demo.” It eventually reached a filtered results page for macOS games, but then spent several minutes trying to apply a non‑existent “has demo” filter, despite the initial search already narrowing the results.

The agent managed to click the top result—Project II: Silent Valley—yet hesitated when a prominent “Download Demo” link appeared, suspecting it was on the full‑game page rather than a demo. It backtracked to the search results and attempted the process again. After roughly ten minutes of this looping behavior, the test was stopped.

When scored on a 10‑point scale, the agent achieved a median of 7.5 points and a mean of 6.83 points across the tasks. This suggests that while the system can interpret instructions and navigate simple menus, its speed and consistency are limited.

Limitations

The primary constraints identified were technical session‑length limits, which capped most tasks to a few minutes, and the agent’s tendency to enter repetitive loops when faced with ambiguous navigation cues. These factors greatly reduced the utility of the system for longer or more complex workflows. The evaluation noted that a version capable of running indefinitely could score higher.

Additionally, the agent’s cautious behavior—such as questioning whether a page displayed a demo or the full product—illustrates a need for better context understanding. The system’s reliance on visual cues rather than deeper content analysis leads to hesitation and back‑tracking.

Potential Uses

Despite the shortcomings, the Agent Mode shows promise for automating simple, repetitive web tasks that can be spot‑checked by a human afterward. Scenarios such as gathering product links, checking availability, or performing routine searches could benefit from the tool’s ability to navigate menus and extract information without direct supervision.

Overall, the technology is not yet ready for “set it and forget it” automation but may serve as a time‑saving assistant for low‑complexity tasks, reducing the drudgery of manual web browsing.

OpenAI’s Agent Mode Tested: Mixed Results in Web Navigation

Key Points

Performance Overview

Limitations

Potential Uses

Also available in: