Meet The AI Agent With Multiple Personalities | EUROtoday
In the approaching years, brokers are extensively anticipated to take over an increasing number of chores on behalf of people, together with utilizing computer systems and smartphones. For now, although, they’re too error susceptible to be a lot use.
A brand new agent known as S2, created by the startup Simular AI, combines frontier fashions with fashions specialised for utilizing computer systems. The agent achieves state-of-the-art efficiency on duties like utilizing apps and manipulating information—and means that turning to completely different fashions in numerous conditions might assist brokers advance.
“Computer-using agents are different from large language models and different from coding,” says Ang Li, cofounder and CEO of Simular. “It’s a different type of problem.”
In Simular’s strategy, a robust general-purpose AI mannequin, like OpenAI’s GPT-4o or Anthropic’s Claude 3.7, is used to motive about how finest to finish the duty at hand—whereas smaller open supply fashions step in for duties like deciphering net pages.
Li, who was a researcher at Google DeepMind earlier than founding Simular in 2023, explains that giant language fashions excel at planning however aren’t pretty much as good at recognizing the weather of a graphical consumer interface.
S2 is designed to be taught from expertise with an exterior reminiscence module that information actions and consumer suggestions and makes use of these recordings to enhance future actions.
On significantly advanced duties, S2 performs higher than some other mannequin on OSWorld, a benchmark that measures an agent’s means to make use of a pc working system.
For instance, S2 can full 34.5 p.c of duties that contain 50 steps, beating OpenAI’s Operator, which may full 32 p.c. Similarly, S2 scores 50 p.c on AndroidWorld, a benchmark for smartphone-using brokers, whereas the subsequent finest agent scores 46 p.c.
Victor Zhong, a pc scientist on the University of Waterloo in Canada and one of many creators of OSWorld, believes that future large AI fashions might incorporate coaching knowledge that helps them perceive the visible world and make sense of graphical consumer interfaces.
“This will help agents navigate GUIs with much higher precision,” Zhong says. “I think in the meantime, before such fundamental breakthroughs, state-of-the-art systems will resemble Simular in that they combine multiple models to patch the limitations of single models.”
To put together for this column, I used Simular to e-book flights and scour Amazon for offers, and it appeared higher than among the open supply brokers I attempted final yr, together with AutoGen and vimGPT.
But even the neatest AI brokers are, it appears, nonetheless troubled by edge instances and infrequently exhibit odd habits. In one occasion, once I requested S2 to assist discover contact data for the researchers behind OSWorld, the agent obtained caught in a loop hopping between the challenge web page and the login for OSWorld’s Discord.
OSWorld’s benchmarks present why brokers stay extra hype than actuality for now. While people can full 72 p.c of OSWorld duties, brokers are foiled 38 p.c of the time on advanced duties. That mentioned, when the benchmark was launched in April 2024, one of the best agent might full solely 12 p.c of the duties.
https://www.wired.com/story/simular-ai-agent-multiple-models-personalities/