Automation
Web automation is a comprehensive topic: program repetitive workflows on graphical user interfaces, wire-up on-page tutorials, or enable AI agents to interact with the web. The latter, web AI agents, are becoming more and more popular. At Webfuse, we therefore focus on providing powerful tools for agent automation.
Conventionally, integrating agents from providers like Cognigy or ElevenLabs represents yet another feature implementation. It requires invasive changes to the underlying web application’s code, and adjacent steps such as deployment. Here’s the thing: scalable agents should not depend on hardcoded instructions for every other indiviual web page. Instead, smart agents draw context from the current web page on runtime (universality). They can subsequently act in the page to respond to a specific prompt. Webfuse bridges pain points of conventional agent implementation (invasive code, integration testing, need of deployment, …). Serve an existing website through Webfuse to augment it with an agent widget, and only if needed. Metaphorically speaking, Webfuse provides agent applications with ‘eyes’ to see a page, and ‘hands’ to act in a page. The agent is the brain (logic) in between.
| Conventional | With Webfuse | |
|---|---|---|
| Persistent agent widget across page navigations | ❌ | ✔️ |
| Secure and isolated agent evaluation in web client | ❌ | ✔️ |
| Standalone agent deployment in the cloud | ❌ | ✔️ |
| Separate agent and website code | ❌ | ✔️ |
| Reusable agent implementation (Extension) | ❌ | ✔️ |
Agent Lifecycle
Section titled “Agent Lifecycle”The highlevel lifecycle of a basic web AI agent looks as follows:
- See the current web page, which corresponds to serializing it somehow (this is also often referred to as taking a ‘snapshot’).
- Request a suitable AI model (commonly an LLM), consulted with the snapshot and an interaction problem prompt (e.g., “I want to log in”).
- Receive suggestions on how to act in the page to solve the interaction problem.
- Act in the page according to the suggestions.
Steps 3. and 4. can optionally be bridged through the model context protocol (MCP).
If you decide to develop your own agent, with custom logic, state and model communciation, the Webfuse scope stretches to the AI model interface:
Automation API
Section titled “Automation API”Webfuse Extensions provide a uniform interface to augment an existing website with custom logic and UI. Tailored for agentic AI, the Automation API provides comprehensive tools to see and act in a page right from an extension. Agents are technically able to see web pages by vision, just like human users, but also by DOM, which is the web browser’s internal representation of the rendered page. Both means of context come with different advantages and disadvantages. Which approach to use depends on the particular agent problem and environment. The Automation API has been designed for both DOM- and vision-based agent logic.
DOM-based Automation
Section titled “DOM-based Automation”DOM-based agents see the web through a page’s latest document object model (DOM) state:
browser.webfuseSession .automation .see .domSnapshot()<body> <h1>Book a Flight</h1> <form> <input placeholder="Destination Airport" name="destination"> <button type="submit" id="btn-submit">Search</button> </form></body>As a result, the agent can act in the page through CSS selectors:
browser.webfuseSession .automation .act .type('input[name="destination"]', 'Amsterdam')
browser.webfuseSession .automation .act .click('#btn-submit')Vision-based Automation
Section titled “Vision-based Automation”Vision-based agents see the web through a rendered page’s graphical user interface (GUI):
browser.webfuseSession .automation .see .guiSnapshot()<IMAGE>In turn, the agent can act in the page through pixel coordinates:
browser.webfuseSession .automation .act .type([350, 405], 'Amsterdam')
browser.webfuseSession .automation .act .click([325, 570])Here’s an instance of vision-based artefacts along the agent cycle for a Wikipedia search: