AI next steps, from words to actions
By Gemma Lara Savill
Published at January 25, 2025
2025 has started strong in the AI field, with the major models announcing new capabilities: models will now be able to act.
While AI agents have existed for some time, their application has typically been limited to specific, smaller-scale tasks. The current shift is monumental because it's powered by the same large-scale, versatile Generative AI models we've become accustomed to using for information retrieval. Now, these models are getting ready to help us interact with the world through prompts, just as they do with our questions.
An agent can be defined as "anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators." . Think of it as AI with the ability to not just understand, but do.
Open AI introduces Operator
Open AI has recently announced its new agent Operator This innovative agent uses a cloud-based web browser, interacting with websites much like a human user. Instead of relying on APIs, Operator analyzes websites through screenshots, reading the raw pixels on the screen. It learns by observing the effects of its actions, creating a feedback loop of reading screenshots and performing actions until a task is complete. This allows it to perform complex tasks like booking a table online, or any other action a human can take on a website. Operator also displays the steps it takes, offering a window into its decision-making process.
What will happen on mobile apps? Since Operator works by reading pixels, it's conceivable that it could interact with mobile apps in a similar way. However, navigating the complexities of mobile operating systems and app ecosystems might present new challenges.
Gemini 2.0 "Enabling the agentic era"
Google has announced that Gemini 2.0 is also "Enabling the agentic era". Like Operator, Gemini 2.0 can navigate the web, reasoning through each step of its interactions.
It can use the web, as Operator can, and it reasons in each step of the way. Now leveraging Google Search, it can access and process up-to-the-minute information, resulting in more accurate and relevant responses.
But Gemini 2.0 goes a step further: it can navigate both physical (3D) and virtual worlds. Its multimodal capabilities allow it to process images, audio, video, and text, enabling it to interpret and interact with virtual environments. Imagine AI assistance while navigating a video game – that's the potential of Gemini 2.0.
Android is also getting ready for action
The preview of upcoming Android version 16, is showing docs that include "App Functions", currently in beta. These functions, such as "orderFood," could allow developers to expose specific app functionalities to AI agents. This could mean that when a user asks their AI to "order a pizza," the AI could potentially choose an app with a corresponding "order pizza" function.
Apple is also hinting at its own advancements in this space. Recent invitations to "Explore the power of Apple Intelligence and App Intents" suggest that Apple is actively developing similar capabilities.
AI is about to get much more hands-on. As AI agents become more common on our phones and in our apps, it'll be harder to tell where human action ends and AI begins. Soon, AI will be doing more than just answering questions; it will be taking action and helping us get things done.