Computer Gemini 2.5 computer: Free Google browser Use Agent AUX!

Google has just introduced its new web browser based on Google Deepmind agents, driven by Gemini 2.5 Pro. Built on the Gemini API, it can “see” and interact with the web interface and application: by clicking, writing and moving as well as man. This new AI automation model exceeds the gap between understanding and action. In this article, we will explore the key features of the use of the Gemini computer, its ability and how to integrate them into your AI agent workflows.

What uses Gemini 2.5?

The use of Gemini 2.5 is an AI assistant that can control a browser using a natural language. You describe the goal and perform the steps needed to complete it. Built on the new Computer_USE in the Gemini API, it analyzes images of web page or applications, and then generates actions such as “click”, “type” or “scrolling”. The customer like playwright performs these actions and returns another screen, so the task is.

The model interprets buttons, text fields and other interface elements to decide how to act. Within Gemini 2.5 for inheriting strong visual reasoning, allowing him to complete the complex tasks to hide with minimal human entry. He is currently focusing on the browser’s surroundings and does not control board applications outside the browser.

Key skills

  • On the Auto Data Entering Data Website and Forms Filling Data. The agent will be able to find the field, enter the text and send the forms where they are assigned.
  • Test web applications and user streams using the site, start event and ennance that are displayed accurately.
  • Research on multiple websites. For example, it can collect information about products, prices or reviews on several pages of electronic trading and dive results.

How to access the use of Gemini 2.5?

Experimental skills of using Gemini 2.5 are now publicly available for all developers. Developers must only register to the Gemini API via AI Studio or Vertex AI and then request access to the model on the computer. Google provides documentation, sample sample samples Starting and implementation of references that you can try. For example, the Gemini API documents provide an example of a four -speed agency loop, with an example in Python using Google Cloud Genai SDK and Playwright.

You would set either a browser automation environment such as a playwright for the following steps:

  • Sign up for the Gemini API via AI Studio or Vertex AI.
  • Request access to the model of the kidnap computer.
  • Check Google Documents, Start Samples and Reference Implementation Samples.

As an example, there is an example of an agent loop using four steps in Python provided by Googa with Genai SDK and a playwright for browser automation.

Also read: How to access and use the Gemini API?

Example: Basic set

Here is a general example of what the setting of your code looks like:

# Load environment variables 

from dotenv import load_dotenv 

load_dotenv() 

# Initialize Gemini API client 

from google import genai 

client = genai.Client() 

# Set screen dimensions (recommended by Google) 

SCREEN_WIDTH = 1440 

SCREEN_HEIGHT = 900 

Environmental settings

We start by setting the environmental variables are loaded for the API login data and the Gemini customer is initialized. The recommended screen dimensions from Google are defined and are later used to convert normalized coordinates to actual pixels necessary for actions in the user interface.

Furthermore, the code sets the playwright to automate your browser:

from playwright.sync_api import sync_playwright 

playwright = sync_playwright().start() 

browser = playwright.chromium.launch( 

headless=False, 

args=( 

'--disable-blink-features=AutomationControlled', 

'--disable-dev-shm-usage', 

'--no-sandbox', 

) 

) 

context = browser.new_context( 

viewport={"width": SCREEN_WIDTH, "height": SCREEN_HEIGHT}, 

user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36", 

) 

page = context.new_page() 

Starting the Agent loop

Here we start the chrome browser and use symptoms against detection to prevent website recognition. We then define a realistic cut -out and user agency so that we can imitate a normal user and create a new page for navigation and automating interaction.

After the browser settings and ready for the transition, the model is provided by the user’s goal and due to the initial screen image:

from google.genai.types import Content, Part 

USER_PROMPT = "Go to BBC News and find today's top technology headlines" 

initial_screenshot = page.screenshot(type="png") 

contents = ( 

Content(role="user", parts=( 

Part(text=USER_PROMPT), 

Part.from_bytes(data=initial_screenshot, mime_type="image/png") 

)) 

) 

User_prompt defines what task of the natural language the agent will do. It captures the initial frame of the browser status page that will feel along with the challenge of the model. They are encapsulated with a content object that will later be handed over to the Gemini model.

Finlly, agent launchs and sends status to the model and carries out actions that return:

Compertion_USE prompts the model to create functional calls, and then they are carried out in the browser environment. Thinking_config keeps intermediate steps to think about providing transparency to the user, which can be useful for later tuning or understanding the agent’s decision -making process.

from google.genai.types import types 

config = types.GenerateContentConfig( 

tools=(types.Tool( 

computer_use=types.ComputerUse( 

environment=types.Environment.ENVIRONMENT_BROWSER 

) 

)), 

thinking_config=types.ThinkingConfig(include_thoughts=True), 

)

How does it work?

Gemini 2.5 uses runs as an agent with a closed loop. You give him the goal and screen of the screen, predict another action, performs it through the customer, and then check the updated screen and decide what to do next. This feedback loop allows Gemini to see, reason and act like human browser navigation. The whole process is powered computer_use Tool in Gemini API.

The inputs that the model receives in every iteration

With each iteration receives the model:

  • User request: Natural language objective or instructions (example: “Find Message Message Message”).
  • Current screenshot: Image of the browser or window of the application and its current status.
  • History Action: Record of events received (for context).

The model analyzes screenshot and user target and then generates one more Calling functions—Ked representing the event UI. For example:

{"name": "click_at", "args": {"x": 400, "y": 600}}

This would instruct the agent to click on these coordinates. Each function call can act as clicking, writing, moving or navigation.

Client to carry out a program

The client program (for example, using the API Playwright’s Mouse and Keyboard API) performs these actions in the browser. After each event, it captures a new screshot and sends it back to the model as a function. The model uses this feedback to decide what to do next. This loop repeats the unit that the agent has finished task, or there will be no further action to perform. For this computer, Gemini 2.5 uses use in the closed loop structure. The steps are as follows:

  1. Input target and screenshot: The model receives the user’s instructions (eg “Find the top subtitles”) and the screenshot of the current browser status.
  2. Generate Actions: The model designs one or more function calls that correspond to the user interface actions using Compertion_USE.
  3. Performing actions: The client program performs this function of calls in the browser.

As an example of making functions of functions:

def execute_function_calls(candidate, page, screen_width, screen_height): 

for part in candidate.content.parts: 

if part.function_call: 

fname = part.function_call.name 

args = part.function_call.args 

if fname == "click_at": 

actual_x = int(args("x") / 1000 * screen_width) 

actual_y = int(args("y") / 1000 * screen_height) 

page.mouse.click(actual_x, actual_y) 

elif fname == "type_text_at": 

actual_x = int(args("x") / 1000 * screen_width) 

actual_y = int(args("y") / 1000 * screen_height) 

page.mouse.click(actual_x, actual_y) 

page.keyboard.type(args("text")) 

# ...other supported actions... 

Functions analyzes FunctionCall items returned by model and performs every action in the browser, for example click_at gold type_text_at. It converts normalized coordinates (0–1000) to actual pixels based on screen size. This logic, such as writing, shifting, navigation and Drag-and-Drop wear.

  1. Capture feedback: The new screen and URL will also be captured after events and smell back to the model.
def get_function_responses(page, results): 

screenshot_bytes = page.screenshot(type="png") 

current_url = page.url 

function_responses = () 
 

for name, result, extra_fields in results: 

response_data = {"url": current_url} 

response_data.update(result) 

response_data.update(extra_fields) 

function_responses.append( 

types.FunctionResponse( 

name=name, 

response=response_data, 

parts=(types.FunctionResponsePart( 

inline_data=types.FunctionResponseBlob( 

mime_type="image/png", 

data=screenshot_bytes 

) 

)) 

) 

) 

return function_responses 

Here the new browser status is packed in Funcrespons, which uses the model to give further. The loop continues until the model has long returned any call or unit function, the task is completed.

Read also: Top 7 Agents using your computer

Loop agent

After loading the Computer_USE tool after these steps follows the typical agent loop:

  1. Send a model request: Include the user’s target and screenshot of the current browser status to the API call
  2. To receive a model response: The model returns the answer containing text and/or one or more functions.
  3. Make action: The customer’s code analyzes every function call and performs action in the browser.
  4. Capture and send feedback: After performing the client, the client takes a new screenshot and records the current URL. They pack them into Funcrespons and reduces the back to the model as another user message. This tells the model the result of his action to plan the next step.

This process runs automatically in the loop. When the model stops generating a new feature call, it indicates that the task is completed. At this point, he returns any final text output, such as a summary of what he has achieved. In most cases, the agent goes through several cycles before Eith complicates the target or reaches the curve limit.

More supported actions

The Gemini tool has dozens of actions to build. The basic set includes typical web applications including:

  • Open_Web_browser: Initializes the browser before starting the agent loop (usually processed by the customer).
  • Click_at: Clicking on a particular (x, y) coordinates on the page.
  • Type_text_at: Click at the point and enter and give thong, optionally pressed.
  • Navigate: Opens a new URL in the browser.
  • Go_back / Go_forward: It moves backwards or forward in the browser history.
  • Hover_at: It moves the mouse to a specific point to trigger the hovering effects.
  • Scroll_document / Scroll_at: Shifts the entire page or specific part.
  • Key_combination: Simulates the keyboard shortcuts.
  • WAIT_5_SECONDS: Interruption of implementation, useful for waiting for animations gold loading page.
  • DRAG_AND_DROP: Clicks and drags the element on the next rental on the page.

Google documentation states that sample implementation includes three most common actions: open_web_browser, click_at and type_text_at. You can expand this by adding any other actions you need or exclude actions that are not according to your workflow.

The performance and benchmarks

The use of the Gemini 2.5 computer works strongly in the user interface control tasks. In Google tests, it reached more than 70% with approximately 225 ms of latency and overcame other models on the web and mobile scales, such as the websites for viewing and completing the application workflows.

In practice, the agent can combine tasks such as filling in forms and loading data. Independent benchmarks are shifting as the worst and fastest public model AI for simple automation of the browser. Its powerful performance comes from the visual reasoning of Gemini 2.5 Pro and the optimized API pipes. Sale is still in the preview, you should watch its actions, there may be mistakes.

Read also:

Conclusion

The use of Gemini 2.5 is a significant development in AI support, which allows agents to effectively and efficiently interact with real interfaces. With this, developers can automate tasks such as web browsing, data entry, or data mining with great accuracy and speed.

In our publicly Avairable, we can offer developers a way to safely experience with the capacity of the use of the Gemini 2.5 computer and adapt it to their work streams. Overall, it suppresses a flexible framework by creating intelligent new generation assistants or powerful automation procedures for various USSE and domains.

Vashisth

Hi! Even VIPin, a passionate enthusiast of science and machine learning with a strong basis in data analysis, machine learning and programming algorithms. I have experience in exposing models, management of chaotic data and solving problems in the real world. My goal is to apply data knowledge and create practical solutions that control the results. I am eager to contribute my skills in cooperation with the environmental environment and at the same time continue reading and growth in the fields of science, machine learning and NLP.

Sign in and continue reading and enjoy the content of experts-hurates.

Leave a Comment