Skip to content

Automation API

The Automation API is available from the browser.webfuseSession.automation namespace.

  • Perception methods (‘see’ 👁️) are grouped under automation.see.
  • Actuation methods (‘act’ 👆️) are grouped under automation.act.
  • Auxiliary tool methods (‘tool’ 🛠️) are grouped under automation.tool.
  • Other methods and properties are available right from automation.

The Automation API can be utilized with both DOM- and vision-based agents. For this, targeting an element in the page is overloaded with different types:

type Target = HTMLElement | CSSSelector | Point | WebfuseID | MetaTarget;
type CSSSelector = string;
type Point = [number, number]; // [x, y];
type WebfuseID = string; // Unique ID of an element per Tab in a Webfuse Session
enum MetaTarget {
POINTER, // Current virtual pointer position
FOCUS // Currently focused element
}
// By element reference
browser.webfuseSession
.automation
.act
.click(document.getElementById('cta'))
// By CSS selector
browser.webfuseSession
.automation
.act
.click('main > button.cta')
// By point coordinate
browser.webfuseSession
.automation
.act
.click([420, 890])
// By meta target
browser.webfuseSession
.automation
.act
.click(
browser.webfuseSession.automation.Target.POINTER
)

Cross-Shadow and -Frame Targeting Webfuse Exclusive

Section titled “Cross-Shadow and -Frame Targeting ”

Relevant elements may, in some cases, be hidden inside shadow DOM, or even iframes. For example, if the agent-enhanced web page embeds a checkout component from a third-party provider. Targeting with the Automation API is able to pierce shadow root and even iframe boundaries. Point coordinates on iframes simply lead to descending into the iframe and targeting with normalized coordinates until a non-frame element is found. By design, cross-shadow or -frame CSS selectors do not exist. To enable cross-shadow and -frame targeting via CSS selectors, Webfuse considers both types of DOM nodes as ordinary container elements. For shadow DOM, the shadow-root pseudo container tag name is therefore introduced. As a result, shadow roots and frames represent implicit container tags, like, i.a., div or section.

Shadow DOM subtrees in the browser are usually invisible upon parent DOM serialization (.outerHTML/.innerHTML). The position of a shadow root is visualized with # Shadow in the following example:

<div>
<custom-element>
# Shadow
<b>Slotted</b>
</custom-element>
</div>

Webfuse, however, implies a shadow root element, which is reflected with Webfuse-native perception in DOM snapshots. For above given example, this would look as follows:

<div>
<custom-element>
<shadow-root>
<strong>Shadow</strong>
<p>
<slot></slot>
</p>
</shadow-root>
<b>Slotted</b>
</custom-element>
</div>

Now, the shadow root and elements within the shadow DOM can be targeted with valid CSS selector syntax:

// Cross-shadow
browser.webfuseSession
.automation
.act
.click('body my-component shadow-root button#submit')

To isolate embedded DOMs from each other, subtrees beneath iframes are usually hidden upon parent DOM serialization (.outerHTML/.innerHTML):

<html>
<head></head>
<body>
<h1>Parent</h1>
<iframe src="/child"></iframe>
</body>
</html>

Webfuse implies frames to be parent DOM native container elements. Optionally, DOM snaphots through Webfuse-native perception inline iframe contents. This would looks as follows for above given example:

<html>
<head></head>
<body>
<h1>Parent</h1>
<iframe src="/child">
<html>
<h1>Child</h1>
</html>
</iframe>
</body>
</html>

Now, elements within the frame can be targeted with valid CSS selector syntax:

// Cross-frame
browser.webfuseSession
.automation
.act
.click('body iframe button#submit')

Actuation refers to the process of interacting with a web page to change its state. For human users, this typically involves a few common actions, such as clicking, typing, or scrolling.

In the browser, an action is actually a sequence of multiple events fired on a specific target element. For example, a single “click” triggers a chain of events including:

  • mousedown
  • click
  • mouseup

Webfuse emulates human actuation as closely as possible. When an action is performed via Webfuse, all corresponding events are dispatched not only on the target element but also along the trajectory of the mouse pointer movement, ensuring the web application reacts exactly as it would for a real user.

Move the virtual mouse pointer.

browser.webfuseSession.automation.act.mouseMove(
target: Target,
options?: {
persistent?: boolean;
}
): Promise<void>

target

  • Mouse pointer target.

[options]

  • Mouse move options:
    • [persistent] Whether to keep the pointer on screen after it was moved. By default, the pointer fades out after some time.

A promise that resolves once the mouse was moved.

// Move the virtual mouse pointer to pixel position x=100 and y=400:
await browser.webfuseSession
.automation
.act
.mouseMove([100, 400]);

Scrolls the deepest scrollable element under the target by the given amount in the given direction.

browser.webfuseSession.automation.act.scroll(
target: Target,
direction: 'vertical' | 'horizontal',
amount: number
): Promise<void>

target

  • Scroll(able) target.

direction

  • The direction to scroll.

amount

  • The amount of pixels to scroll.

A promise that resolves once scroll ended.

// Scroll the body element 100 pixels up:
await browser.webfuseSession
.automation
.act
.scroll('body', 'vertical', 100);

Perform a left (primary) mouse button click.

browser.webfuseSession.automation.act.click(
target: Target,
options?: {
button?: 'left' | 'middle' | 'right';
moveMouse?: boolean;
scrollIntoView?: boolean;
}
): Promise<void>

target

  • Click target.

[options]

  • Click options:
    • [button] Mouse button to click (left by default).
    • [moveMouse] Whether to move the virtual mouse pointer to the target center before performing the action (false by default).
    • [scrollIntoView] Whether to scroll the target element into view before performing the action (true by default).

A promise that resolves once click was performed.

// Click the fourth element in the second frame.
// Move the virtual mouse pointer to the center of the target element beforehand:
await browser.webfuseSession
.automation
.act
.click('2-4', {
moveMouse: true,
});
browser.webfuseSession.automation.act.type(
target: Target,
text: string,
options?: {
followFocus?: boolean;
overwrite?: boolean;
timePerChar?: number;
moveMouse?: boolean;
scrollIntoView?: boolean;
}
): Promise<void>

Type text to an element. Typing is natural, i.e. as if a human presses a sequence of keys.

target

  • Typing target.

text

  • Text to type.

[options]

  • Type options:
    • [followFocus] Whether to type to the element that has focus even if it changed to a different target (true by default).
    • [overwrite] Whether to overwrite the current contents of the target input (true by default).
    • [timePerChar] Expected mean time to press each character in ms (100 by default).
    • [moveMouse] Whether to move the virtual mouse pointer to the target center before performing the action (false by default).
    • [scrollIntoView] Whether to scroll the target element into view before performing the action (true by default).

A promise that resolves once text was typed.

// Type 'Jane Doe' to the currently focused element (suppose it is an input field):
await browser.webfuseSession
.automation
.act
.type(browser.webfuseSession.automation.Target.FOCUS, 'Jane Doe');
browser.webfuseSession.automation.act.select(
target: Target,
value: string,
options?: {
moveMouse?: boolean;
scrollIntoView?: boolean;
}
): Promise<void>

Select an option of a dropdown element by value.

target

  • Select target.

value

  • Value to select (according to value attribute).

[options]

  • Type options:
    • [moveMouse] Whether to move the virtual mouse pointer to the target center before performing the action (false by default).
    • [scrollIntoView] Whether to scroll the target element into view before performing the action (true by default).

A promise that resolves once value was selected.

// Select 'Netherlands' from a country dropdown list:
await browser.webfuseSession
.automation
.act
.select('select#country', 'netherlands', { scrollIntoView: true });
browser.webfuseSession.automation.act.keyPress(
target: Target,
key: 'a' | 'b' | ... | 'Z' | '.' | '!' | ... | 'Enter' | 'ArrowUp' | ...,
options?: {
altKey?: boolean;
ctrlKey?: boolean;
metaKey?: boolean;
shiftKey?: boolean;
moveMouse?: boolean;
scrollIntoView?: boolean;
}
): Promise<void>

Press a key on an element. The key argument must be equivalent to a supported KeyboardEvent.key property, which is either:

  • a control key, e.g., Enter to submit forms, or ArrowUp to scroll up a page, or
  • a printable key, i.e., a typeable symbol that could be typed to an input field, e.g., a, B, !, or >.
TabEnterBackspaceDeleteInsert
ArrowUpArrowDownArrowLeftArrowRight
HomeEndPageUpPageDown
EscapeCapsLockShiftControlAltMeta
F1F2F3F4F5F6F7F8F9F10F11F12
Numpad0Numpad1Numpad2Numpad3Numpad4Numpad5Numpad6Numpad7Numpad8Numpad9
NumpadAddNumpadSubtractNumpadMultiplyNumpadDivideNumpadDecimalNumpadEnter

target

  • Key press target.

key

  • Key to press.

[options]

  • Booleans to hold down a secondary during the press: alt, ctrl, meta, or shift.
  • [moveMouse] Whether to move the virtual mouse pointer to the target center before performing the action (false by default).
  • [scrollIntoView] Whether to scroll the target element into view before performing the action (true by default).

A promise that resolves once key was pressed.

// Hit 'Enter' key on an identifiable submit button:
await browser.webfuseSession
.automation
.act
.keyPress('#submit', 'Enter');
browser.webfuseSession.automation.act.textSelect(
target: Target,
text: string,
options?: {
occurrence?: number;
moveMouse?: boolean;
scrollIntoView?: boolean;
}
): Promise<void>

Select continuous text in the page.

target

  • Text content selection target.

text

  • Text to select (empty text also removes any existing selection).

[options]

  • Type options:
    • [occurrence] Occurrence of text to select if is redundant (0, i.e., first by default).
    • [moveMouse] Whether to move the virtual mouse pointer to the target center before performing the action (false by default).
    • [scrollIntoView] Whether to scroll the target element into view before performing the action (true by default).

A promise that resolves once the selection was applied.

// Select the third occurrence of the text 'ipsum' in the main element:
await browser.webfuseSession
.automation
.act
.textSelect('main', 'ipsum', {
occurrence: 2
});

Perception is the process of interpreting the current state of a web page. In the case of web automation, the state of web applications. While human users primarily see websites via graphical user interfaces (GUIs), computer agents can easily understand web application state by more that visual means. A prominent technical representation of non-visual state is the document object model (DOM)—a web browser’s runtime model of a web application.

By analyzing the DOM, the agent can programmatically understand:

  • Structural hierarchy (how elements are nested).
  • Element attributes (IDs, classes, and accessibility labels).
  • Metadata not immediately visible to the naked eye.
browser.webfuseSession.automation.see.textSelection(): Promise<string>

Get the currently selected text in the page.

A promise that resolves with the currently selected text (or empty string if nothing is selected).

browser.webfuseSession.automation.see.domSnapshot(options?: {
root?: Target; // Snapshot tree root
crossframe?: boolean;
crossshadow?: boolean;
inlineWebfuseIDs?: boolean;
revealMaskedElements?: boolean;
}): Promise<string>

Take a web page snapshot, which is a time-sensitive serialization of the current web page state.

[options] DOM snapshot options:

  • [root] Target a snapshot root element to scope the DOM snapshot to a specific subtree (integrates with targeting, body by default).
  • [crossframe] Webfuse Exclusive Whether to include iframe subtrees (false by default).
  • [crossshadow] Webfuse Exclusive Whether to include shadow DOM subtrees (true by default).
  • [inlineWebfuseIDs] Whether to assign each element the Webfuse ID (Webfuse ID) via the HTML pseudo attribute wf-id (false by default).
  • [revealMaskedElements] Whether to include masked elements (false by default).

A promise that resolves with the DOM snapshot, i.e., DOM state serialized as HTML.

// Take a DOM snapshot, i.e., serialize the current state of the DOM as HTML, descending only from the main element.
// To each element, assign its unique Webfuse ID via `wf-id` pseudo attribute.
// Do not include shadow DOMs in snapshot:
await browser.webfuseSession
.automation
.see
.domSnapshot({
root: 'main',
crossshadow: false,
inlineWebfuseIDs: true,
});

Cross-frame snapshots (crossframe = true) will have all iframe contents inlined, e.g.:

<html>
<head></head>
<body>
<h1>Parent</h1>
<iframe src="/child">
<html>
<h1>Child</h1>
</html>
</iframe>
</body>
</html>

Cross-shadow snapshots (crossshadow = true) will have all shadow root contents inlined, e.g.:

<div>
<custom-element>
<shadow-root>
<strong>Shadow</strong>
<p>
<slot></slot>
</p>
</shadow-root>
<b>Slotted</b>
</custom-element>
</div>

The web AI agent lifecycle models a constant loop of perceiving the current state of a website, prompting the model for actuation suggestions, and acting out these suggestions. Snapshots paired with Webfuse IDs represent a robust means of targeting, even after destructive snapshot processing.

PROMPT
Book a flight to Amsterdam.
SNAPSHOT
<body WF-ID="1">
<h1 WF-ID="2">Book Flight</h1>
<p>Please confirm your booking information.</p>
<checkout-form WF-ID="3">
<shadow-root>
<button type="button" WF-ID="4">Confirm</button>
</shadow-root>
</checkout-form>
</body>

Based on the AI model’s suggestions, actuation can target elements via Webfuse ID:

browser.webfuseSession
.automation
.act
.click('4')

Serialize the GUI for various processing purposes, such as for LLM input. Serialized GUI corresponds to a screenshot. Hence, this is an alias of webfuseSession.takeScreenshot().

browser.webfuseSession.automation.see.guiSnapshot(): Promise<ImageBitmap>

A promise that resolves with the GUI snapshot, i.e., GUI state serialized as an image bitmap.


browser.webfuseSession.automation.tool.computeAccessibilityTree(domSnapshot: string): object

Translate a DOM snapshot to its accessibility tree object representation.

domSnapshot

  • DOM snapshot to compute an accessibility tree representation from.
<form
role="form"
aria-describedby="recipe-hint"
aria-labelledby="recipe-form-title">
<div
role="group"
aria-labelledby="checkbox-group">
<h3 id="checkbox-group">Recipe Preferences</h3>
<label for="notifications"
aria-describedby="notifications-description">
<input type="checkbox" id="notifications"
name="notifications"
aria-label="Enable recipe update notifications">
Receive recipe updates
</label>
<p id="notifications-description">I would like to receive updates.</p>
</div>
</form>
const snapshot = await browser.webfuseSession
.automation
.see
.domSnapshot({
root: 'form',
});
await browser.webfuseSession
.automation
.tool
.computeAccessibilityTree(snapshot);
{
"role": "RootWebArea",
"source": "html",
"children": [
{
"name": "Recipe Preferences",
"properties": {
"level": 3
},
"role": "heading",
"source": "#checkbox-group"
},
{
"children": [
{
"name": "Enable recipe update notifications",
"properties": {
"aria-label": "Enable recipe update notifications"
},
"role": "checkbox",
"source": "#notifications",
"states": {
"checked": false
}
}
],
"properties": {
"aria-describedby": "notifications-description"
},
"role": "generic",
"source": "html > body > section > form > div > label",
"description": "I would like to receive updates."
}
]
}

Apply the D2Snap DOM downsampling algorithm to a DOM snapshot. This will reduce its size, while retaining its overall structural features, and also a majority of inherent UI features. The D2Snap algorithm was developed in order to mitigate the prevalent DOM token size disadvantage.

browser.webfuseSession.automation.tool.applyD2Snap(
domSnapshot: string,
hierarchyRatio: number = 0.4, textRatio: number = 0.6, attributeRatio: number = 0.8,
options?: {
keepUnknownElements?: boolean;
skipMarkdownTranslation?: boolean;
}
): Promise<string>

domSnapshot

  • DOM snapshot to downsample.

[hierarchyRatio]

  • Hierarchy (nesting) compression ratio of the result.

[textRatio]

  • Text (length) compression ratio of the result.

[attributeRatio]

  • Attribute (count) compression ratio of the result.

[options]

  • Snapshot options:
    • [keepUnknownElements] Whether to keep unknown (custom) elements in the downsampled DOM (false by default).
    • [skipMarkdownTranslation] Whether to skip content HTML to Markdown translation (false by default).

A promise that resolves with the downsampled DOM snapshot.

Raw
<section class="container" tabindex="3" required="true" type="example">
<div class="mx-auto" data-topic="products" required="false">
<h1>Our Pizza</h1>
<div>
<div class="shadow-lg">
<h2>Margherita</h2>
<p>
A simple classic: mozzarela, tomatoes and basil.
An everyday choice!
</p>
<button type="button">Add</button>
</div>
<div class="shadow-lg">
<h2>Capricciosa</h2>
<p>
A rich taste: mozzarella, ham, mushrooms, artichokes, and olives.
A true favourite!
</p>
<button type="button">Add</button>
</div>
</div>
</div>
</section>
const snapshot = await browser.webfuseSession
.automation
.see
.domSnapshot({
root: 'section.container:first-of-type',
});
// Downsample ('compress') the raw DOM snapshot.
// Compress hierarchy, i.e., element nesting depth by about 50%.
// Compress text, i.e., paragraph sentence length by about 30%.
// Compress attributes, i.e., attribute amount by about 80%:
await browser.webfuseSession
.automation
.tool
.applyD2Snap(snapshot, 0.5, 0.3, 0.8);
Downsampled
<!-- hierarchyRatio = .4, textRatio = .6, attributeRatio = .8 -->
<section>
# Our Pizza
<div>
## Margherita
A simple classic:
<button>Add</button>
## Capricciosa
A rich taste:
<button>Add</button>
</div>
</section>

Alias: tool.downsample()

Apply the AdaptiveD2Snap DOM downsampling algorithm to a DOM snapshot. This is an adaptive version of the D2Snap algorithm that does not require explicit parameters.

browser.webfuseSession.automation.tool.applyAdaptiveD2Snap(
domSnapshot: string,
maxTokens: number = 2**15, // ≈ 32K
maxIterations: number = 3,
options?: {
inlineWebfuseIDs?: boolean;
keepUnknownElements?: boolean;
skipMarkdownTranslation?: boolean;
}
): Promise<string>

domSnapshot

  • DOM snapshot to downsample.

[maxTokens]

  • Maximum expected snapshot size in estimated LLM input tokens (1 token ≈ 4 bytes/symbols).

[maxIterations]

  • Maximun tries to downsample with inceasing compression ratio paramteters to obtain a snapshot below the given token limit (throws error otherise).

[options]

  • Snapshot options:
    • [inlineWebfuseIDs] Whether to add a unique data attribute wf-id to every element in the DOM in order to allow identification of equivalent elements across the original and the downsampled DOM. For example, <button class="btn btn-primary" wf-id="27">Click here!</button> (false by default).
    • [keepUnknownElements] Whether to keep unknown (custom) elements in the downsampled DOM (false by default).
    • [skipMarkdownTranslation] Whether to skip content HTML to Markdown translation (false by default).

A promise that resolves with the downsampled DOM snapshot.

The recommended way to get started with downsampled DOM snapshots is by simply calling the adaptive D2Snap tool alias with the default arguments:

const snapshot = await browser.webfuseSession
.automation
.see
.domSnapshot();
await browser.webfuseSession
.automation
.tool
.downsample(snapshot); // .applyAdaptiveD2Snap(snapshot)

browser.webfuseSession.automation.wait(ms: number): Promise<void>

ms

  • The amount of milliseconds to wait.

A promise that resolves once the given wait time has passed.

await browser.webfuseSession
.automation
.act
.mouseMove([230, 1215]);
// Wait for 500ms:
await browser.webfuseSession
.automation
.wait(500);
await browser.webfuseSession
.automation
.act
.type(browser.webfuseSession.automation.Target.POINTER, 'Amsterdam');