MASSIVE Step Allowing Agents To Control Computers (MacOS, Windows, Linux)

The OS World project aims to provide a consistent and thorough way to test AI agents’ performance in real computer environments, backed by research from multiple institutions. By offering a robust environment with multiple operating systems and tools like xLang for translating natural language instructions into executable code, OS World enhances AI agents’ capabilities and allows for effective benchmarking and evaluation.

A new project called OS World aims to address the benchmarking problem for AI agents by providing a consistent and thorough way to test them. The project, backed by a research paper from the University of Hong Kong, CMU, Salesforce Research, and University of Waterloo, offers an open-source solution for evaluating AI agents’ performance in real computer environments. OS World provides a robust environment with multiple operating systems, enabling agents to interact with the environment and measure their performance effectively.

The project emphasizes the importance of grounding in enabling AI agents to execute tasks successfully. By providing detailed instructions and a way for agents to interact with the environment, OS World aims to bridge the gap between task instructions and task execution. This involves generating actions such as mouse movements, clicks, text input, and other interactions with the computer environment to complete tasks accurately.

OS World allows agents to operate within various environments, including computer, mobile, data, and physical worlds. Through the use of sensors like cameras, screenshots, ultrasonic radar, and more, agents can perceive their environment and act rationally upon it. The project defines an intelligent agent as one that perceives its environment via sensors and acts rationally based on those perceptions, highlighting the autonomy, reactivity, proactivity, goal-directedness, and interaction with other agents as key properties.

The project introduces xLang, a tool that translates natural language instructions into executable code for interacting with the environment. By providing a way for agents to interpret abstract user instructions, utilize tools, explore complex environments, and follow feedback for self-debugging, OS World enhances the capabilities of AI agents. The project has published various works and projects related to agent environments, such as Instructor, Binder, Lemur, Open Agents, Text to Reward, and OS World, to advance the field of AI agent testing and evaluation.

In testing scenarios, OS World has demonstrated success in evaluating AI agents’ performance across different input modes, with GPT-4 generally outperforming other agents. The project found that the use of the accessibility tree or a combination of screenshot and accessibility tree yielded the best results. Additionally, higher screenshot resolution was shown to lead to improved performance in agent testing. Overall, OS World offers a promising solution for benchmarking AI agents in real computer environments, providing a valuable tool for improving agent performance through consistent testing and evaluation.