Automation API
The Automation API is available from the browser.webfuseSession.automation namespace.
- Perception methods (‘see’ 👁️) are grouped under
automation.see. - Actuation methods (‘act’ 👆️) are grouped under
automation.act. - Auxiliary tool methods (‘tool’ 🛠️) are grouped under
automation.tool. - Other methods and properties are available right from
automation.
Targeting
Section titled “Targeting”The Automation API can be utilized with both DOM- and vision-based agents. For this, targeting an element in the page is overloaded with different types:
type Target = HTMLElement | CSSSelector | Point | WebfuseID | MetaTarget;
type CSSSelector = string;type Point = [number, number]; // [x, y];type WebfuseID = string; // Unique ID of an element per Tab in a Webfuse Sessionenum MetaTarget { POINTER, // Current virtual pointer position FOCUS // Currently focused element}// By element referencebrowser.webfuseSession .automation .act .click(document.getElementById('cta'))
// By CSS selectorbrowser.webfuseSession .automation .act .click('main > button.cta')
// By point coordinatebrowser.webfuseSession .automation .act .click([420, 890])
// By meta targetbrowser.webfuseSession .automation .act .click( browser.webfuseSession.automation.Target.POINTER )Cross-Shadow and -Frame Targeting Webfuse Exclusive
Section titled “Cross-Shadow and -Frame Targeting ”Relevant elements may, in some cases, be hidden inside shadow DOM, or even iframes. For example, if the agent-enhanced web page embeds a checkout component from a third-party provider. Targeting with the Automation API is able to pierce shadow root and even iframe boundaries. Point coordinates on iframes simply lead to descending into the iframe and targeting with normalized coordinates until a non-frame element is found. By design, cross-shadow or -frame CSS selectors do not exist. To enable cross-shadow and -frame targeting via CSS selectors, Webfuse considers both types of DOM nodes as ordinary container elements. For shadow DOM, the shadow-root pseudo container tag name is therefore introduced. As a result, shadow roots and frames represent implicit container tags, like, i.a., div or section.
Cross-Shadow Targeting
Section titled “Cross-Shadow Targeting”Shadow DOM subtrees in the browser are usually invisible upon parent DOM serialization (.outerHTML/.innerHTML). The position of a shadow root is visualized with # Shadow in the following example:
<div> <custom-element> # Shadow <b>Slotted</b> </custom-element></div>Webfuse, however, implies a shadow root element, which is reflected with Webfuse-native perception in DOM snapshots. For above given example, this would look as follows:
<div> <custom-element> <shadow-root> <strong>Shadow</strong> <p> <slot></slot> </p> </shadow-root> <b>Slotted</b> </custom-element></div>Now, the shadow root and elements within the shadow DOM can be targeted with valid CSS selector syntax:
// Cross-shadowbrowser.webfuseSession .automation .act .click('body my-component shadow-root button#submit')Cross-Frame Targeting
Section titled “Cross-Frame Targeting”To isolate embedded DOMs from each other, subtrees beneath iframes are usually hidden upon parent DOM serialization (.outerHTML/.innerHTML):
<html> <head></head> <body> <h1>Parent</h1> <iframe src="/child"></iframe> </body></html>Webfuse implies frames to be parent DOM native container elements. Optionally, DOM snaphots through Webfuse-native perception inline iframe contents. This would looks as follows for above given example:
<html> <head></head> <body> <h1>Parent</h1> <iframe src="/child"> <html> <h1>Child</h1> </html> </iframe> </body></html>Now, elements within the frame can be targeted with valid CSS selector syntax:
// Cross-framebrowser.webfuseSession .automation .act .click('body iframe button#submit').act Actuation Scope
Section titled “.act Actuation Scope”Actuation refers to the process of interacting with a web page to change its state. For human users, this typically involves a few common actions, such as clicking, typing, or scrolling.
In the browser, an action is actually a sequence of multiple events fired on a specific target element. For example, a single “click” triggers a chain of events including:
- mousedown
- click
- mouseup
Webfuse emulates human actuation as closely as possible. When an action is performed via Webfuse, all corresponding events are dispatched not only on the target element but also along the trajectory of the mouse pointer movement, ensuring the web application reacts exactly as it would for a real user.
act.mouseMove()
Section titled “act.mouseMove()”Move the virtual mouse pointer.
browser.webfuseSession.automation.act.mouseMove( target: Target, options?: { persistent?: boolean; }): Promise<void>Parameters
Section titled “Parameters”target
- Mouse pointer target.
[options]
- Mouse move options:
[persistent]Whether to keep the pointer on screen after it was moved. By default, the pointer fades out after some time.
Returns
Section titled “Returns”A promise that resolves once the mouse was moved.
Example
Section titled “Example”// Move the virtual mouse pointer to pixel position x=100 and y=400:await browser.webfuseSession .automation .act .mouseMove([100, 400]);act.scroll()
Section titled “act.scroll()”Scrolls the deepest scrollable element under the target by the given amount in the given direction.
browser.webfuseSession.automation.act.scroll( target: Target, direction: 'vertical' | 'horizontal', amount: number): Promise<void>Parameters
Section titled “Parameters”target
- Scroll(able) target.
direction
- The direction to scroll.
amount
- The amount of pixels to scroll.
Returns
Section titled “Returns”A promise that resolves once scroll ended.
Example
Section titled “Example”// Scroll the body element 100 pixels up:await browser.webfuseSession .automation .act .scroll('body', 'vertical', 100);act.click()
Section titled “act.click()”Perform a left (primary) mouse button click.
browser.webfuseSession.automation.act.click( target: Target, options?: { button?: 'left' | 'middle' | 'right'; moveMouse?: boolean; scrollIntoView?: boolean; }): Promise<void>Parameters
Section titled “Parameters”target
- Click target.
[options]
- Click options:
[button]Mouse button to click (leftby default).[moveMouse]Whether to move the virtual mouse pointer to the target center before performing the action (falseby default).[scrollIntoView]Whether to scroll the target element into view before performing the action (trueby default).
Returns
Section titled “Returns”A promise that resolves once click was performed.
Example
Section titled “Example”// Click the fourth element in the second frame.// Move the virtual mouse pointer to the center of the target element beforehand:await browser.webfuseSession .automation .act .click('2-4', { moveMouse: true, });act.type()
Section titled “act.type()”browser.webfuseSession.automation.act.type( target: Target, text: string, options?: { followFocus?: boolean; overwrite?: boolean; timePerChar?: number; moveMouse?: boolean; scrollIntoView?: boolean; }): Promise<void>Type text to an element. Typing is natural, i.e. as if a human presses a sequence of keys.
Parameters
Section titled “Parameters”target
- Typing target.
text
- Text to type.
[options]
- Type options:
[followFocus]Whether to type to the element that has focus even if it changed to a different target (trueby default).[overwrite]Whether to overwrite the current contents of the target input (trueby default).[timePerChar]Expected mean time to press each character in ms (100by default).[moveMouse]Whether to move the virtual mouse pointer to the target center before performing the action (falseby default).[scrollIntoView]Whether to scroll the target element into view before performing the action (trueby default).
Returns
Section titled “Returns”A promise that resolves once text was typed.
Example
Section titled “Example”// Type 'Jane Doe' to the currently focused element (suppose it is an input field):await browser.webfuseSession .automation .act .type(browser.webfuseSession.automation.Target.FOCUS, 'Jane Doe');act.select()
Section titled “act.select()”browser.webfuseSession.automation.act.select( target: Target, value: string, options?: { moveMouse?: boolean; scrollIntoView?: boolean; }): Promise<void>Select an option of a dropdown element by value.
Parameters
Section titled “Parameters”target
- Select target.
value
- Value to select (according to
valueattribute).
[options]
- Type options:
[moveMouse]Whether to move the virtual mouse pointer to the target center before performing the action (falseby default).[scrollIntoView]Whether to scroll the target element into view before performing the action (trueby default).
Returns
Section titled “Returns”A promise that resolves once value was selected.
Example
Section titled “Example”// Select 'Netherlands' from a country dropdown list:await browser.webfuseSession .automation .act .select('select#country', 'netherlands', { scrollIntoView: true });act.keyPress()
Section titled “act.keyPress()”browser.webfuseSession.automation.act.keyPress( target: Target, key: 'a' | 'b' | ... | 'Z' | '.' | '!' | ... | 'Enter' | 'ArrowUp' | ..., options?: { altKey?: boolean; ctrlKey?: boolean; metaKey?: boolean; shiftKey?: boolean; moveMouse?: boolean; scrollIntoView?: boolean; }): Promise<void>Press a key on an element. The key argument must be equivalent to a supported KeyboardEvent.key property, which is either:
- a control key, e.g.,
Enterto submit forms, orArrowUpto scroll up a page, or - a printable key, i.e., a typeable symbol that could be typed to an input field, e.g.,
a,B,!, or>.
TabEnterBackspaceDeleteInsertArrowUpArrowDownArrowLeftArrowRightHomeEndPageUpPageDownEscapeCapsLockShiftControlAltMetaF1F2F3F4F5F6F7F8F9F10F11F12Numpad0Numpad1Numpad2Numpad3Numpad4Numpad5Numpad6Numpad7Numpad8Numpad9NumpadAddNumpadSubtractNumpadMultiplyNumpadDivideNumpadDecimalNumpadEnter abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789,<.>/?;:!@#$%^&*~-_=+()[]{}\|`’”␠\t\n\r Parameters
Section titled “Parameters”target
- Key press target.
key
- Key to press.
[options]
- Booleans to hold down a secondary during the press:
alt,ctrl,meta, orshift. [moveMouse]Whether to move the virtual mouse pointer to the target center before performing the action (falseby default).[scrollIntoView]Whether to scroll the target element into view before performing the action (trueby default).
Returns
Section titled “Returns”A promise that resolves once key was pressed.
Example
Section titled “Example”// Hit 'Enter' key on an identifiable submit button:await browser.webfuseSession .automation .act .keyPress('#submit', 'Enter');act.textSelect()
Section titled “act.textSelect()”browser.webfuseSession.automation.act.textSelect( target: Target, text: string, options?: { occurrence?: number; moveMouse?: boolean; scrollIntoView?: boolean; }): Promise<void>Select continuous text in the page.
Parameters
Section titled “Parameters”target
- Text content selection target.
text
- Text to select (empty text also removes any existing selection).
[options]
- Type options:
[occurrence]Occurrence of text to select if is redundant (0, i.e., first by default).[moveMouse]Whether to move the virtual mouse pointer to the target center before performing the action (falseby default).[scrollIntoView]Whether to scroll the target element into view before performing the action (trueby default).
Returns
Section titled “Returns”A promise that resolves once the selection was applied.
Example
Section titled “Example”// Select the third occurrence of the text 'ipsum' in the main element:await browser.webfuseSession .automation .act .textSelect('main', 'ipsum', { occurrence: 2 });.see Perception Scope
Section titled “.see Perception Scope”Perception is the process of interpreting the current state of a web page. In the case of web automation, the state of web applications. While human users primarily see websites via graphical user interfaces (GUIs), computer agents can easily understand web application state by more that visual means. A prominent technical representation of non-visual state is the document object model (DOM)—a web browser’s runtime model of a web application.
By analyzing the DOM, the agent can programmatically understand:
- Structural hierarchy (how elements are nested).
- Element attributes (IDs, classes, and accessibility labels).
- Metadata not immediately visible to the naked eye.
see.textSelection()
Section titled “see.textSelection()”browser.webfuseSession.automation.see.textSelection(): Promise<string>Get the currently selected text in the page.
Returns
Section titled “Returns”A promise that resolves with the currently selected text (or empty string if nothing is selected).
see.domSnapshot()
Section titled “see.domSnapshot()”browser.webfuseSession.automation.see.domSnapshot(options?: { root?: Target; // Snapshot tree root crossframe?: boolean; crossshadow?: boolean; inlineWebfuseIDs?: boolean; revealMaskedElements?: boolean;}): Promise<string>Take a web page snapshot, which is a time-sensitive serialization of the current web page state.
Parameters
Section titled “Parameters”[options] DOM snapshot options:
[root]Target a snapshot root element to scope the DOM snapshot to a specific subtree (integrates with targeting,bodyby default).[crossframe]Webfuse Exclusive Whether to include iframe subtrees (falseby default).[crossshadow]Webfuse Exclusive Whether to include shadow DOM subtrees (trueby default).[inlineWebfuseIDs]Whether to assign each element the Webfuse ID (Webfuse ID) via the HTML pseudo attributewf-id(falseby default).[revealMaskedElements]Whether to include masked elements (falseby default).
Returns
Section titled “Returns”A promise that resolves with the DOM snapshot, i.e., DOM state serialized as HTML.
Example
Section titled “Example”// Take a DOM snapshot, i.e., serialize the current state of the DOM as HTML, descending only from the main element.// To each element, assign its unique Webfuse ID via `wf-id` pseudo attribute.// Do not include shadow DOMs in snapshot:await browser.webfuseSession .automation .see .domSnapshot({ root: 'main', crossshadow: false, inlineWebfuseIDs: true, });Cross-Frame Snapshots
Section titled “Cross-Frame Snapshots”Cross-frame snapshots (crossframe = true) will have all iframe contents inlined, e.g.:
<html> <head></head> <body> <h1>Parent</h1> <iframe src="/child"> <html> <h1>Child</h1> </html> </iframe> </body></html>Cross-Shadow Snapshots
Section titled “Cross-Shadow Snapshots”Cross-shadow snapshots (crossshadow = true) will have all shadow root contents inlined, e.g.:
<div> <custom-element> <shadow-root> <strong>Shadow</strong> <p> <slot></slot> </p> </shadow-root> <b>Slotted</b> </custom-element></div>Snapshots and Agentic AI
Section titled “Snapshots and Agentic AI”The web AI agent lifecycle models a constant loop of perceiving the current state of a website, prompting the model for actuation suggestions, and acting out these suggestions. Snapshots paired with Webfuse IDs represent a robust means of targeting, even after destructive snapshot processing.
PROMPT
Book a flight to Amsterdam.SNAPSHOT
<body WF-ID="1"> <h1 WF-ID="2">Book Flight</h1> <p>Please confirm your booking information.</p> <checkout-form WF-ID="3"> <shadow-root> <button type="button" WF-ID="4">Confirm</button> </shadow-root> </checkout-form></body>Based on the AI model’s suggestions, actuation can target elements via Webfuse ID:
browser.webfuseSession .automation .act .click('4')see.guiSnapshot()
Section titled “see.guiSnapshot()”Serialize the GUI for various processing purposes, such as for LLM input. Serialized GUI corresponds to a screenshot. Hence, this is an alias of webfuseSession.takeScreenshot().
browser.webfuseSession.automation.see.guiSnapshot(): Promise<ImageBitmap>Returns
Section titled “Returns”A promise that resolves with the GUI snapshot, i.e., GUI state serialized as an image bitmap.
.tool Tool Scope
Section titled “.tool Tool Scope”tool.computeAccessibilityTree()
Section titled “tool.computeAccessibilityTree()”browser.webfuseSession.automation.tool.computeAccessibilityTree(domSnapshot: string): objectTranslate a DOM snapshot to its accessibility tree object representation.
Parameters
Section titled “Parameters”domSnapshot
- DOM snapshot to compute an accessibility tree representation from.
Example
Section titled “Example”<form role="form" aria-describedby="recipe-hint" aria-labelledby="recipe-form-title"> <div role="group" aria-labelledby="checkbox-group"> <h3 id="checkbox-group">Recipe Preferences</h3> <label for="notifications" aria-describedby="notifications-description"> <input type="checkbox" id="notifications" name="notifications" aria-label="Enable recipe update notifications"> Receive recipe updates </label> <p id="notifications-description">I would like to receive updates.</p> </div></form>const snapshot = await browser.webfuseSession .automation .see .domSnapshot({ root: 'form', });
await browser.webfuseSession .automation .tool .computeAccessibilityTree(snapshot);{ "role": "RootWebArea", "source": "html", "children": [ { "name": "Recipe Preferences", "properties": { "level": 3 }, "role": "heading", "source": "#checkbox-group" }, { "children": [ { "name": "Enable recipe update notifications", "properties": { "aria-label": "Enable recipe update notifications" }, "role": "checkbox", "source": "#notifications", "states": { "checked": false } } ], "properties": { "aria-describedby": "notifications-description" }, "role": "generic", "source": "html > body > section > form > div > label", "description": "I would like to receive updates." } ]}tool.applyD2Snap()
Section titled “tool.applyD2Snap()”Apply the D2Snap DOM downsampling algorithm to a DOM snapshot. This will reduce its size, while retaining its overall structural features, and also a majority of inherent UI features. The D2Snap algorithm was developed in order to mitigate the prevalent DOM token size disadvantage.
browser.webfuseSession.automation.tool.applyD2Snap( domSnapshot: string, hierarchyRatio: number = 0.4, textRatio: number = 0.6, attributeRatio: number = 0.8, options?: { keepUnknownElements?: boolean; skipMarkdownTranslation?: boolean; }): Promise<string>Parameters
Section titled “Parameters”domSnapshot
- DOM snapshot to downsample.
[hierarchyRatio]
- Hierarchy (nesting) compression ratio of the result.
[textRatio]
- Text (length) compression ratio of the result.
[attributeRatio]
- Attribute (count) compression ratio of the result.
[options]
- Snapshot options:
[keepUnknownElements]Whether to keep unknown (custom) elements in the downsampled DOM (falseby default).[skipMarkdownTranslation]Whether to skip content HTML to Markdown translation (falseby default).
Returns
Section titled “Returns”A promise that resolves with the downsampled DOM snapshot.
Example
Section titled “Example”Raw
<section class="container" tabindex="3" required="true" type="example"> <div class="mx-auto" data-topic="products" required="false"> <h1>Our Pizza</h1> <div> <div class="shadow-lg"> <h2>Margherita</h2> <p> A simple classic: mozzarela, tomatoes and basil. An everyday choice! </p> <button type="button">Add</button> </div> <div class="shadow-lg"> <h2>Capricciosa</h2> <p> A rich taste: mozzarella, ham, mushrooms, artichokes, and olives. A true favourite! </p> <button type="button">Add</button> </div> </div> </div></section>const snapshot = await browser.webfuseSession .automation .see .domSnapshot({ root: 'section.container:first-of-type', });
// Downsample ('compress') the raw DOM snapshot.// Compress hierarchy, i.e., element nesting depth by about 50%.// Compress text, i.e., paragraph sentence length by about 30%.// Compress attributes, i.e., attribute amount by about 80%:await browser.webfuseSession .automation .tool .applyD2Snap(snapshot, 0.5, 0.3, 0.8);Downsampled
<!-- hierarchyRatio = .4, textRatio = .6, attributeRatio = .8 --><section> # Our Pizza <div> ## Margherita A simple classic: <button>Add</button> ## Capricciosa A rich taste: <button>Add</button> </div></section>tool.applyAdaptiveD2Snap()
Section titled “tool.applyAdaptiveD2Snap()”Alias: tool.downsample()
Apply the AdaptiveD2Snap DOM downsampling algorithm to a DOM snapshot. This is an adaptive version of the D2Snap algorithm that does not require explicit parameters.
browser.webfuseSession.automation.tool.applyAdaptiveD2Snap( domSnapshot: string, maxTokens: number = 2**15, // ≈ 32K maxIterations: number = 3, options?: { inlineWebfuseIDs?: boolean; keepUnknownElements?: boolean; skipMarkdownTranslation?: boolean; }): Promise<string>Parameters
Section titled “Parameters”domSnapshot
- DOM snapshot to downsample.
[maxTokens]
- Maximum expected snapshot size in estimated LLM input tokens (1 token ≈ 4 bytes/symbols).
[maxIterations]
- Maximun tries to downsample with inceasing compression ratio paramteters to obtain a snapshot below the given token limit (throws error otherise).
[options]
- Snapshot options:
[inlineWebfuseIDs]Whether to add a unique data attributewf-idto every element in the DOM in order to allow identification of equivalent elements across the original and the downsampled DOM. For example,<button class="btn btn-primary" wf-id="27">Click here!</button>(falseby default).[keepUnknownElements]Whether to keep unknown (custom) elements in the downsampled DOM (falseby default).[skipMarkdownTranslation]Whether to skip content HTML to Markdown translation (falseby default).
Returns
Section titled “Returns”A promise that resolves with the downsampled DOM snapshot.
Example
Section titled “Example”The recommended way to get started with downsampled DOM snapshots is by simply calling the adaptive D2Snap tool alias with the default arguments:
const snapshot = await browser.webfuseSession .automation .see .domSnapshot();
await browser.webfuseSession .automation .tool .downsample(snapshot); // .applyAdaptiveD2Snap(snapshot)wait()
Section titled “wait()”browser.webfuseSession.automation.wait(ms: number): Promise<void>Parameters
Section titled “Parameters”ms
- The amount of milliseconds to wait.
Returns
Section titled “Returns”A promise that resolves once the given wait time has passed.
Example
Section titled “Example”await browser.webfuseSession .automation .act .mouseMove([230, 1215]);
// Wait for 500ms:await browser.webfuseSession .automation .wait(500);
await browser.webfuseSession .automation .act .type(browser.webfuseSession.automation.Target.POINTER, 'Amsterdam');