browsy

browsy

Zero-render browser engine for AI agents. browsy.dev

browsy converts web pages into a structured Spatial DOM -- a flat list of interactive and text elements with bounding boxes, roles, and states -- without rendering pixels. On top of this, it layers page intelligence: automatic page type detection, suggested actions with stable element IDs, CAPTCHA detection, and hidden content exposure.

$ browsy fetch https://github.com/login

page_type: Login
suggested_actions:
  Login { username: 19, password: 21, submit: 34 }

[19:input "Username or email address" @top-C]
[21:input "Password" @mid-C]
[34:button "Sign in" @mid-C]

// 203ms. No Chromium. No LLM needed.

Why browsy?

Every AI agent that touches the web today launches a 300MB Chromium instance, waits 5 seconds for it to render, then asks an LLM "what am I looking at?"

browsy skips all of that:

Chromium-based toolsbrowsy
Speed5-30 seconds per page~200ms
Dependencies282MB+ Chromium6MB binary
Page intelligenceNone (LLM must figure it out)12 page types, 13 action recipes
Hidden contentNot accessibleExposed with hidden: true
CAPTCHA detectionNonereCAPTCHA, hCaptcha, Turnstile, Cloudflare, image grid
OutputRaw accessibility treeStructured Spatial DOM
DeterministicNo (LLM variance)Yes (same HTML = same output)

When to use browsy

browsy handles server-rendered HTML -- the 90% of the web that doesn't need a browser to understand. Login forms, search pages, news sites, government portals, documentation, e-commerce product pages.

For JS-rendered SPAs (React, Angular, Vue apps that render client-side), you still need a real browser. browsy is the fast path, not a full browser replacement.

Key features

  • Page intelligence -- 12 page types detected automatically, 13 action recipes with element IDs
  • CAPTCHA detection -- identifies reCAPTCHA, hCaptcha, Cloudflare Turnstile, image grids with sitekey extraction
  • Hidden content exposure -- dropdowns, modals, accordions included with hidden: true
  • Session API -- navigate, click, type, select, search -- with cookie persistence
  • Built-in web search -- DuckDuckGo and Google, search and fetch results in one call
  • Smart deduplication -- 34-42% element reduction on real sites
  • Delta output -- only changes after first load
  • MCP server -- use browsy from Claude Code or any MCP client
  • Python bindings -- PyO3-based, full session API
  • 6MB binary -- zero runtime dependencies

Quickstart

This guide covers the core browsy-core workflow: parse HTML, fetch live pages, read page intelligence, and interact with forms.

1. Install

cargo add browsy-core

This pulls in the fetch feature by default, which includes HTTP fetching via reqwest. See Installation for other installation methods.

2. Parse HTML

The simplest entry point is browsy_core::parse. Pass it an HTML string and a viewport size, and it returns a SpatialDom -- a flat list of elements with bounding boxes, roles, and states.

#![allow(unused)]
fn main() {
let html = r#"
<html>
  <body>
    <h1>Hello, world</h1>
    <a href="/about">About</a>
    <input type="text" placeholder="Search..." />
  </body>
</html>
"#;

let dom = browsy_core::parse(html, 1920.0, 1080.0);

// Iterate over elements
for el in &dom.els {
    println!("[{}:{} {:?}]", el.id, el.tag, el.text);
}
}

The viewport dimensions (1920x1080 here) affect layout computation -- elements get positioned and sized as they would in a real browser at that resolution.

SpatialDom serializes to JSON via serde:

#![allow(unused)]
fn main() {
let json = serde_json::to_string_pretty(&dom).unwrap();
println!("{}", json);
}

3. Fetch and parse a live page

The Session API handles HTTP fetching, cookie persistence, and page interaction. It requires the fetch feature (enabled by default).

use browsy_core::fetch::Session;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut session = Session::new()?;
    let dom = session.goto("https://example.com")?;

    println!("Title: {}", dom.title);
    println!("Elements: {}", dom.els.len());

    // Elements are accessible by ID
    if let Some(el) = dom.get(1) {
        println!("First element: {} {:?}", el.tag, el.text);
    }

    // Filter to visible-only or above-the-fold
    let visible = dom.visible();
    let above_fold = dom.above_fold();

    Ok(())
}

Sessions persist cookies across navigations. Each call to goto returns a fresh SpatialDom for the new page.

4. Read page intelligence

Every SpatialDom includes two forms of page intelligence: a detected page type and a list of suggested actions with stable element IDs.

use browsy_core::fetch::Session;
use browsy_core::output::{PageType, SuggestedAction};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut session = Session::new()?;
    let dom = session.goto("https://github.com/login")?;

    // Page type: Login, Search, Article, Form, List, Dashboard, etc.
    println!("Page type: {:?}", dom.page_type);

    // Suggested actions tell the agent exactly what to do
    for action in &dom.suggested_actions {
        match action {
            SuggestedAction::Login { username_id, password_id, submit_id, .. } => {
                println!("Login form found:");
                println!("  Username field: element {}", username_id);
                println!("  Password field: element {}", password_id);
                println!("  Submit button: element {}", submit_id);
            }
            SuggestedAction::Search { input_id, submit_id } => {
                println!("Search: input={}, submit={}", input_id, submit_id);
            }
            SuggestedAction::EnterCode { input_id, submit_id, code_length } => {
                println!("2FA code: input={}, submit={}, length={:?}",
                    input_id, submit_id, code_length);
            }
            _ => println!("Action: {:?}", action),
        }
    }

    Ok(())
}

Page types

browsy detects 12 page types automatically:

PageTypeMeaning
LoginLogin form with username/password fields
TwoFactorAuthVerification code entry (2FA, email confirmation)
OAuthConsentOAuth authorization prompt
CaptchaCAPTCHA challenge page
SearchSearch page (empty query state)
SearchResultsSearch results page
InboxEmail or message inbox
EmailBodySingle email or message view
DashboardDashboard or admin panel
FormGeneric form (registration, contact, settings)
ArticleArticle, blog post, documentation page
ListList or catalog page (products, directory)
ErrorError page (404, 500, access denied)
OtherNo specific type detected

CAPTCHA detection

When browsy detects a CAPTCHA, it sets page_type to Captcha and populates captcha with details:

#![allow(unused)]
fn main() {
if dom.page_type == PageType::Captcha {
    if let Some(captcha) = &dom.captcha {
        println!("CAPTCHA type: {:?}", captcha.captcha_type);
        // ReCaptcha, HCaptcha, Turnstile, CloudflareChallenge, ImageGrid, TextCaptcha
        if let Some(sitekey) = &captcha.sitekey {
            println!("Site key: {}", sitekey);
        }
    }
}
}

Or use the session convenience methods:

#![allow(unused)]
fn main() {
if session.is_captcha() {
    println!("CAPTCHA: {:?}", session.captcha_info());
}
}

5. Log in to a site

browsy provides two ways to interact with login forms: manual (using element IDs) and automatic (using session.login).

Manual login

Use the element IDs from SuggestedAction::Login to type credentials and submit:

use browsy_core::fetch::Session;
use browsy_core::output::SuggestedAction;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut session = Session::new()?;
    let dom = session.goto("https://github.com/login")?;

    // Find the login action
    for action in &dom.suggested_actions {
        if let SuggestedAction::Login { username_id, password_id, submit_id, .. } = action {
            session.type_text(*username_id, "user@example.com")?;
            session.type_text(*password_id, "my-password")?;
            let result = session.click(*submit_id)?;
            println!("After login: {:?}", result.page_type);
            break;
        }
    }

    Ok(())
}

Automatic login

session.login detects the login form from suggested_actions and fills it in one call:

#![allow(unused)]
fn main() {
let mut session = Session::new()?;
session.goto("https://github.com/login")?;

let result = session.login("user@example.com", "my-password")?;
println!("After login: {:?}", result.page_type);
}

This fails with FetchError::ActionError if no SuggestedAction::Login is detected on the current page.

2FA / verification codes

If the login redirects to a 2FA page, use enter_code:

#![allow(unused)]
fn main() {
if result.page_type == PageType::TwoFactorAuth {
    let final_page = session.enter_code("123456")?;
    println!("After 2FA: {:?}", final_page.page_type);
}
}

6. Search the web

browsy has built-in web search via DuckDuckGo and Google. No API keys required.

Get search results

use browsy_core::fetch::{Session, SearchEngine};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut session = Session::new()?;

    // DuckDuckGo (default)
    let results = session.search("rust web frameworks")?;
    for r in &results {
        println!("{}: {}", r.title, r.url);
        println!("  {}", r.snippet);
    }

    // Google
    let results = session.search_with("rust web frameworks", SearchEngine::Google)?;

    Ok(())
}

Search and read pages

search_and_read fetches the top N results and returns each page's SpatialDom:

#![allow(unused)]
fn main() {
let pages = session.search_and_read("browsy browser engine", 3)?;

for page in &pages {
    println!("--- {} ---", page.result.title);
    if let Some(dom) = &page.dom {
        println!("  Page type: {:?}", dom.page_type);
        println!("  Elements: {}", dom.els.len());
    } else {
        println!("  (fetch failed)");
    }
}
}

Next steps

Installation

browsy is available as a Rust library, a CLI binary, a Python package, and an MCP server.

Rust library

Add browsy-core to your project:

cargo add browsy-core

This enables the fetch feature by default, which includes HTTP fetching, session management, and web search via reqwest.

Without networking

To use browsy as a pure HTML-to-Spatial-DOM parser with no network dependencies:

cargo add browsy-core --no-default-features

This disables the fetch feature. You get browsy_core::parse(html, width, height) and nothing else -- no Session, no HTTP, no reqwest. Useful for embedding browsy in contexts where you handle fetching yourself.

#![allow(unused)]
fn main() {
// Available without fetch feature
let dom = browsy_core::parse(html, 1920.0, 1080.0);

// Requires fetch feature (enabled by default)
use browsy_core::fetch::Session;
let mut session = Session::new()?;
}

Feature flags

FeatureDefaultDescription
fetchYesHTTP fetching, Session API, web search, cookie persistence

CLI

Install the browsy CLI binary:

cargo install browsy

Usage:

# Fetch and parse a live page
browsy fetch https://example.com

# Parse local HTML from stdin
cat page.html | browsy parse

# JSON output
browsy fetch https://example.com --format json

REST API server

The CLI includes a built-in REST API + A2A server:

browsy serve
browsy serve --port 8080
browsy serve --allow-private-network

See REST API for endpoint documentation and A2A Protocol for agent-to-agent integration.

Python

browsy has PyO3 bindings published as the browsy-ai package:

pip install browsy-ai
import browsy

# Parse HTML directly
dom = browsy.parse(html, 1920.0, 1080.0)
print(dom.page_type)
print(dom.suggested_actions)

# Session-based browsing
session = browsy.Session()
dom = session.goto("https://example.com")
session.type_text(19, "hello")
session.click(34)

The Python bindings expose the same Session API as the Rust library, including login, search, enter_code, and all form interaction methods.

Framework integrations

Install browsy with framework-specific extras:

pip install browsy-ai[langchain]   # LangChain tools
pip install browsy-ai[crewai]      # CrewAI tool
pip install browsy-ai[openai]      # OpenAI function calling
pip install browsy-ai[autogen]     # AutoGen integration
pip install browsy-ai[smolagents]  # HuggingFace smolagents
pip install browsy-ai[all]         # All integrations

See Framework Integrations for usage guides.

Requirements

  • Python 3.9+
  • No native dependencies (the compiled extension includes everything)

JavaScript / TypeScript

The browsy-ai npm package provides a TypeScript SDK with integrations for LangChain.js, OpenAI, and Vercel AI SDK:

npm install browsy-ai
import { BrowsyClient, BrowsyContext } from "browsy-ai";       // Core SDK
import { getTools } from "browsy-ai/langchain";                  // LangChain.js
import { getToolDefinitions, handleToolCall } from "browsy-ai/openai";  // OpenAI
import { browsyTools } from "browsy-ai/vercel-ai";               // Vercel AI SDK

Framework dependencies are optional peer dependencies -- install only what you need:

npm install browsy-ai @langchain/core    # LangChain.js
npm install browsy-ai openai             # OpenAI
npm install browsy-ai ai                 # Vercel AI SDK

Requires Node.js 22+ and the browsy CLI (cargo install browsy) for the REST server.

See JavaScript / TypeScript for the full SDK guide.

MCP Server

browsy ships an MCP server that exposes the full Session API as tools. This works with Claude Code, Claude Desktop, and any MCP-compatible client.

Install

cargo install browsy-mcp

Configure for Claude Code

Add to your Claude Code MCP configuration (.claude/mcp.json or equivalent):

{
  "mcpServers": {
    "browsy": {
      "command": "browsy-mcp",
      "args": []
    }
  }
}

Configure for Claude Desktop

Add to your Claude Desktop config (claude_desktop_config.json):

{
  "mcpServers": {
    "browsy": {
      "command": "browsy-mcp",
      "args": []
    }
  }
}

Available MCP tools

The MCP server exposes these tools:

ToolDescription
browseNavigate to a URL, returns Spatial DOM
clickClick an element by ID
type_textType into an input field by ID
check / uncheckToggle checkboxes and radio buttons
selectSelect a dropdown option
get_pageGet the current page DOM with form state
backGo back in navigation history
searchWeb search via DuckDuckGo or Google
findFind elements by text or ARIA role
loginFill and submit a login form
enter_codeFill and submit a verification code
tablesExtract structured table data
page_infoGet page metadata, type, and suggested actions

Building from source

git clone https://github.com/GhostPeony/browsy
cd browsy

# Build everything (library + CLI + MCP server)
cargo build --release

# Run tests
cargo test -p browsy-core

# Install CLI and MCP server from local source
cargo install --path crates/cli
cargo install --path crates/mcp

Spatial DOM

The Spatial DOM is the primary output of browsy. It converts an HTML document into a flat list of SpatialElement structs -- each representing an interactive element, text block, or structural landmark -- with bounding boxes, ARIA roles, and form state. No tree traversal, no pixel rendering.

#![allow(unused)]
fn main() {
use browsy_core::parse;

let dom = parse(html, 1920.0, 1080.0);
// dom.els: Vec<SpatialElement> -- flat, ordered, ready for agent consumption
}

SpatialElement fields

Every element in the Spatial DOM is a SpatialElement with these fields:

FieldTypeDescription
idu32Stable numeric ID, assigned sequentially. Used for all interactions (click, type_text, etc.)
tagStringHTML tag name (a, button, input, p, h1, etc.)
roleOption<String>ARIA role -- explicit from role attr or implicit from tag. link, button, textbox, heading, navigation, etc.
textOption<String>Visible text content. For images, this is the alt text
hrefOption<String>Link destination (resolved to absolute URL when parsed via Session)
b[i32; 4]Bounding box: [x, y, width, height] in pixels relative to the document
hiddenOption<bool>Some(true) if the element is hidden. Absent (None) when visible
nameOption<String>HTML name attribute (form fields only: input, textarea, select)
valOption<String>Current value from the HTML value attribute
phOption<String>Placeholder text
labelOption<String>Associated <label> text (resolved via <label for="id">)
input_typeOption<String>Input type (text, password, email, checkbox, radio, search, etc.). Serializes as type in JSON
checkedOption<bool>Whether a checkbox/radio is checked
disabledOption<bool>Whether the element is disabled
expandedOption<bool>ARIA expanded state (dropdowns, accordions)
selectedOption<bool>ARIA selected state (tabs, options)
requiredOption<bool>Whether the field is required
alert_typeOption<String>Alert classification: "alert", "status", "error", "success", "warning"

All Option fields use skip_serializing_if -- absent fields are omitted from JSON output to keep payloads compact.

Hidden content exposure

Elements with display: none, visibility: hidden, aria-hidden="true", or the hidden attribute are not discarded. They appear in the Spatial DOM with hidden: Some(true).

This is a deliberate design decision. Without JavaScript execution, browsy cannot toggle visibility. By including hidden elements, agents can see:

  • Dropdown menus -- <ul> inside a nav that only appears on hover
  • Modal dialogs -- login forms, cookie consent, popups
  • Accordion panels -- FAQ content behind collapsed sections
  • Tab content -- inactive tab panels
  • Off-canvas navigation -- mobile menus hidden at desktop widths
#![allow(unused)]
fn main() {
// All elements including hidden
let all = &dom.els;

// Only visible elements
let visible = dom.visible();

// Hidden elements are distinguishable
for el in &dom.els {
    if el.hidden == Some(true) {
        // This element is hidden in the rendered page
    }
}
}

Hidden elements always have a zero-size exemption -- they are preserved regardless of bounding box dimensions. Visible elements with zero width and height are skipped as layout artifacts.

Deduplication

HTML commonly wraps interactive elements in container tags that carry no additional meaning:

<li><a href="/about">About</a></li>
<td><span><button>Submit</button></span></td>

browsy collapses these wrappers. When a wrapper tag (li, td, th, span, p, dt, dd) contains only interactive children and no meaningful text of its own, the wrapper is skipped. Only the inner interactive element is emitted.

This produces a 34-42% element reduction on real sites without losing any semantic content.

Landmark markers

HTML5 landmark elements (nav, header, footer, main, aside, section, form) and elements with explicit landmark ARIA roles (navigation, banner, contentinfo, complementary, region, main, form) emit as role-only structural markers.

A landmark element appears in the output with its role but no recursive text. Its children carry the actual content as separate elements. This prevents the entire navigation bar's text from being duplicated into a single massive nav element.

{"id": 1, "tag": "nav", "role": "navigation", "b": [0, 0, 1920, 60]},
{"id": 2, "tag": "a", "role": "link", "text": "Home", "href": "/", "b": [20, 10, 80, 40]},
{"id": 3, "tag": "a", "role": "link", "text": "About", "href": "/about", "b": [120, 10, 80, 40]}

Element lookup

The SpatialDom maintains an internal HashMap<u32, usize> index for O(1) element lookup by ID:

#![allow(unused)]
fn main() {
// O(1) -- does not scan the element list
let element = dom.get(42);
}

The index is built automatically during parsing and can be rebuilt after mutation:

#![allow(unused)]
fn main() {
dom.els.push(new_element);
dom.rebuild_index();
}

Filtering

#![allow(unused)]
fn main() {
// Only visible (non-hidden) elements
let visible: Vec<&SpatialElement> = dom.visible();

// Elements whose top edge is within the viewport
let above: Vec<&SpatialElement> = dom.above_fold();

// Elements whose top edge is below the viewport
let below: Vec<&SpatialElement> = dom.below_fold();

// New SpatialDom containing only above-fold elements (for token-limited contexts)
let trimmed: SpatialDom = dom.filter_above_fold();
}

The fold line is determined by dom.vp[1] (viewport height, default 1080px).

Tables

dom.tables() extracts structured table data by grouping th and td elements by their Y coordinates:

#![allow(unused)]
fn main() {
let tables: Vec<TableData> = dom.tables();
for table in &tables {
    println!("Headers: {:?}", table.headers);   // Vec<String>
    for row in &table.rows {
        println!("Row: {:?}", row);              // Vec<String>
    }
}
}

Elements within 5px of the same Y coordinate are grouped into the same row. Cells are sorted left-to-right by X position within each row.

Alerts

dom.alerts() returns elements with a detected alert_type:

#![allow(unused)]
fn main() {
let alerts: Vec<&SpatialElement> = dom.alerts();
for alert in &alerts {
    println!("{}: {}", alert.alert_type.as_deref().unwrap(), alert.text.as_deref().unwrap_or(""));
    // "error: Invalid password"
    // "success: Account created"
}
}

Alert types are detected from ARIA role attributes (alert, status) and CSS class patterns (alert-error, msg-danger, flash-success, etc.). Only compound class patterns are matched -- a bare error class is too ambiguous.

Verification codes

dom.find_codes() extracts 4-8 digit verification codes from page text:

#![allow(unused)]
fn main() {
let codes: Vec<String> = dom.find_codes();
// ["847291"] -- extracted from "Your verification code is 847291"
}

Codes are found near keyword context (verification code, security code, your code, otp, passcode, one-time). Year-like 4-digit numbers (1900-2099) are filtered out. Proximity matching also checks nearby elements within 100px Y distance for keyword context.

Text fallback chain

For interactive elements (links, buttons) that contain no direct text -- only images or icons -- browsy walks a fallback chain to find meaningful text:

  1. aria-label attribute
  2. title attribute
  3. Child <img> alt text
  4. Child <svg> <title> text

This ensures that icon-only buttons and image links always have text for the agent to read.

Page Intelligence

Page intelligence is browsy's deterministic classification layer. Given a Spatial DOM, browsy computes a page type and a set of suggested actions (action recipes) -- each with concrete element IDs that agents can use directly. No LLM inference, no probabilistic guessing.

#![allow(unused)]
fn main() {
let dom = session.goto("https://github.com/login")?;

assert_eq!(dom.page_type, PageType::Login);
// dom.suggested_actions[0] == Login { username_id: 19, password_id: 21, submit_id: 34 }
}

Page types

browsy classifies pages into one of 14 types, detected via priority-ordered heuristics applied to the Spatial DOM. The first matching rule wins.

Page TypeDetection Signal
ErrorAlert elements with alert_type == "error", or title contains 404, 500, 403, not found, error
CaptchaCAPTCHA service detected in HTML (reCAPTCHA, hCaptcha, Turnstile), or title/heading contains captcha, verify you're human, just a moment
LoginVisible password input field present
TwoFactorAuthTitle/heading contains verification keywords (verification, 2fa, otp, one-time, passcode) AND a visible text/number/tel input exists
OAuthConsentTitle/heading contains authorize, allow access, grant permission, oauth, consent
InboxTitle contains inbox, mail, messages AND page has 10+ links
EmailBody3+ email markers present in element text (from:, to:, subject:, date:)
DashboardTitle/heading contains dashboard, welcome back, overview AND both nav and main landmarks exist
Article3+ headings AND 2+ long paragraphs (>100 chars). When link count >= 20, requires 10+ long paragraphs. Heading-heavy pages (15+ headings with low paragraph ratio) are excluded
SearchResultsSearch input present AND 8+ links AND (title/heading contains search results/results for OR URL contains search query params like ?q=)
List10+ visible links
SearchVisible search input (type search, role searchbox, name q, or placeholder/name containing search)
Form2+ visible data-entry inputs (excludes checkbox, radio, hidden, submit, button, image)
OtherNo other type matched

Detection order matters. A page with a password field and a search bar is classified as Login, not Search, because Login is checked first.

Action recipes

Alongside page type, browsy detects suggested actions -- structured recipes telling the agent exactly what to do and which element IDs to use. Each action maps directly to Session API calls.

Login

Detected when a visible password input exists near a text/email input.

{
  "action": "Login",
  "username_id": 19,
  "password_id": 21,
  "submit_id": 34,
  "remember_me_id": 36
}

Agent usage: session.type_text(19, "user@example.com"), session.type_text(21, "pass"), session.click(34). Or simply: session.login("user@example.com", "pass").

Register

Detected when a password field is accompanied by a confirm-password field or registration keywords in the title/heading. Login takes priority when both login and registration sections are present on the same page.

{
  "action": "Register",
  "email_id": 12,
  "username_id": 14,
  "password_id": 16,
  "confirm_password_id": 18,
  "name_id": 10,
  "submit_id": 22
}

EnterCode

Detected on verification/2FA pages with code-related keywords in the title or heading.

{
  "action": "EnterCode",
  "input_id": 8,
  "submit_id": 12,
  "code_length": 6
}

code_length is set when the page uses separate narrow digit inputs (4-8 inputs each <60px wide).

Detected when an input has type search, role searchbox, name q, or a name/placeholder containing search.

{
  "action": "Search",
  "input_id": 5,
  "submit_id": 7
}

Detected on OAuth/authorization pages with approve/deny buttons.

{
  "action": "Consent",
  "approve_ids": [15, 18],
  "deny_ids": [20]
}

CookieConsent

Detected when a substantial text block mentions cookies/GDPR and accept/reject buttons are present.

{
  "action": "CookieConsent",
  "accept_id": 42,
  "reject_id": 44
}

Contact

Detected on pages with contact-related keywords and a visible textarea for the message body.

{
  "action": "Contact",
  "name_id": 5,
  "email_id": 7,
  "message_id": 9,
  "submit_id": 11
}

FillForm

Generic form detection. Emitted when visible form fields exist and no more specific action (Login, Register, Contact) matched. Includes labeled field metadata.

{
  "action": "FillForm",
  "fields": [
    {"id": 10, "label": "First Name", "name": "first_name", "type": "text"},
    {"id": 12, "label": "Email", "name": "email", "type": "email"}
  ],
  "submit_id": 20
}

SelectFromList

Detected when 5+ links are arranged in distinct vertical rows (list-like layout).

{
  "action": "SelectFromList",
  "items": [3, 8, 13, 18, 23]
}

Paginate

Detected when next/previous navigation links are found (text matching next, previous, >, >>, etc.).

{
  "action": "Paginate",
  "next_id": 95,
  "prev_id": 91
}

Download

Detected when links point to downloadable file types.

{
  "action": "Download",
  "items": [{"id": 30, "text": "Report Q4 2024", "href": "/files/report.pdf"}]
}

CaptchaChallenge

Detected when a CAPTCHA service is found in the HTML structure.

{
  "action": "CaptchaChallenge",
  "captcha_type": "ReCaptcha",
  "sitekey": "6LcXxxAAAABBBCCC...",
  "submit_id": 50
}

CAPTCHA detection

browsy identifies CAPTCHA services by scanning the HTML structure for known markers:

TypeDetection
ReCaptchag-recaptcha class, data-sitekey attr, reCAPTCHA script URLs
HCaptchah-captcha class, hCaptcha script URLs
Turnstilecf-turnstile class, Turnstile script URLs
CloudflareChallengeCloudflare "Just a moment..." challenge page pattern
ImageGridCustom image-grid CAPTCHA (select matching images)
TextCaptchaText-based CAPTCHA (type characters from an image)
UnknownCAPTCHA detected but service not identified

CAPTCHA info is available at dom.captcha:

#![allow(unused)]
fn main() {
if let Some(captcha) = &dom.captcha {
    println!("Type: {:?}", captcha.captcha_type);     // CaptchaType::ReCaptcha
    println!("Sitekey: {:?}", captcha.sitekey);        // Some("6Lc...")
}
}

How detection works

All detection is deterministic, heuristic-based, priority-ordered. No machine learning models, no token costs. The same HTML always produces the same page type and action set.

The detection pipeline:

  1. Parse HTML into the Spatial DOM (element list with bounding boxes and roles)
  2. Scan for CAPTCHA markers in the layout tree
  3. Run detect_page_type -- walks through page type checks in priority order, returns the first match
  4. Run detect_suggested_actions -- runs all action detectors independently, collecting all that match

Multiple actions can coexist. A login page might have both Login and CookieConsent actions. A search results page might have Search, SelectFromList, and Paginate.

Example flow

#![allow(unused)]
fn main() {
use browsy_core::fetch::Session;
use browsy_core::output::PageType;

let mut session = Session::new()?;
let dom = session.goto("https://example.com/login")?;

match dom.page_type {
    PageType::Login => {
        // Use the Login action recipe directly
        session.login("user@example.com", "hunter2")?;
    }
    PageType::TwoFactorAuth => {
        session.enter_code("847291")?;
    }
    PageType::Captcha => {
        let info = session.captcha_info();
        // Report to the caller -- browsy cannot solve CAPTCHAs
    }
    _ => {
        // Read the page content, follow links, etc.
    }
}
}

Session API

The Session API provides stateful web browsing with cookie persistence, form interaction, navigation history, and built-in web search. It is the primary interface for agents interacting with the web through browsy.

#![allow(unused)]
fn main() {
use browsy_core::fetch::Session;

let mut session = Session::new()?;
let dom = session.goto("https://example.com")?;
}

Requires the fetch feature (enabled by default).

Creating a session

Session::new()

Creates a session with default configuration (1920x1080 viewport, 30s timeout, CSS fetching enabled).

#![allow(unused)]
fn main() {
let mut session = Session::new()?;
}

Session::with_config(config)

Creates a session with custom configuration.

#![allow(unused)]
fn main() {
use browsy_core::fetch::{Session, SessionConfig};

let config = SessionConfig {
    viewport_width: 1366.0,
    viewport_height: 768.0,
    timeout_secs: 15,
    fetch_css: false,  // Skip external CSS for speed
    ..Default::default()
};
let mut session = Session::with_config(config)?;
}

SessionConfig fields

FieldTypeDefaultDescription
viewport_widthf321920.0Viewport width in pixels. Affects layout computation and fold detection
viewport_heightf321080.0Viewport height in pixels. Defines the fold line
user_agentStringChrome-like UAHTTP User-Agent header
timeout_secsu6430HTTP request timeout
fetch_cssbooltrueWhether to fetch external CSS stylesheets. Disabling speeds up parsing but reduces layout accuracy
blocked_patternsVec<String>Analytics/tracking URLsURL patterns to block (analytics, ads, tracking pixels)
max_response_bytesusize5MBMaximum HTML response size
max_css_bytes_totalusize2MBMaximum total CSS bytes across all stylesheets
max_css_bytes_per_fileusize512KBMaximum size per individual CSS file
max_redirectsusize10Maximum HTTP redirect chain length
allow_private_networkboolfalseWhether to allow requests to private/internal IPs
allow_non_httpboolfalseWhether to allow non-HTTP(S) schemes

goto(url) -> Result<SpatialDom, FetchError>

Navigate to a URL. Fetches the page, parses HTML, optionally fetches external CSS, computes layout, and returns the Spatial DOM. Cookies are persisted automatically.

#![allow(unused)]
fn main() {
let dom = session.goto("https://news.ycombinator.com")?;
println!("Title: {}", dom.title);
println!("Elements: {}", dom.els.len());
}

back() -> Result<SpatialDom, FetchError>

Navigate to the previous page in history. Returns an error if there is no history.

#![allow(unused)]
fn main() {
session.goto("https://example.com")?;
session.goto("https://example.com/about")?;
let dom = session.back()?;  // Back to example.com
}

url() -> Option<&str>

Returns the current page URL.

#![allow(unused)]
fn main() {
if let Some(url) = session.url() {
    println!("Currently at: {}", url);
}
}

Interaction

click(id) -> Result<SpatialDom, FetchError>

Click an element by ID. Behavior depends on the element type:

  • Links (<a>) -- navigates to the href URL. Skips javascript:, mailto:, tel:, and anchor-only (#) links.
  • Buttons / submit inputs -- submits the parent form with all current form values.
  • Elements with JS behaviors -- simulated. onclick handlers with window.location trigger navigation. Toggle/show/hide behaviors modify the DOM.
#![allow(unused)]
fn main() {
let dom = session.goto("https://news.ycombinator.com")?;
// Click the first link
let dom = session.click(3)?;
}

type_text(id, text) -> Result<(), FetchError>

Type text into an input or textarea. The value is stored in the session and overlaid onto the DOM. When a form is submitted via click, these values are included in the form data.

#![allow(unused)]
fn main() {
session.type_text(19, "user@example.com")?;
session.type_text(21, "hunter2")?;
}

Returns an error if the element is not an input or textarea.

check(id) -> Result<(), FetchError>

Check a checkbox or radio button.

#![allow(unused)]
fn main() {
session.check(36)?;  // Check "Remember me"
}

uncheck(id) -> Result<(), FetchError>

Uncheck a checkbox or radio button.

#![allow(unused)]
fn main() {
session.uncheck(36)?;
}

toggle(id) -> Result<(), FetchError>

Toggle a checkbox or radio button based on its current effective state (considering session overrides and HTML defaults).

#![allow(unused)]
fn main() {
session.toggle(36)?;  // If checked, unchecks. If unchecked, checks.
}

select(id, value) -> Result<(), FetchError>

Select an option in a <select> element by value.

#![allow(unused)]
fn main() {
session.select(15, "california")?;
}

Reading page state

dom() -> Option<SpatialDom>

Returns the current Spatial DOM with form state overlaid. Typed values, checked/unchecked states from type_text, check, and uncheck are reflected in the returned DOM.

#![allow(unused)]
fn main() {
session.type_text(19, "hello")?;
let dom = session.dom().unwrap();
let el = dom.get(19).unwrap();
assert_eq!(el.val.as_deref(), Some("hello"));
}

dom_ref() -> Option<&SpatialDom>

Returns a reference to the raw Spatial DOM without form state overlay. Reflects the page as parsed, ignoring any type_text/check/uncheck calls.

#![allow(unused)]
fn main() {
let raw = session.dom_ref().unwrap();
}

delta() -> Option<DeltaDom>

Returns the diff between the current and previous page. Only available after at least two navigations.

#![allow(unused)]
fn main() {
session.goto("https://example.com")?;
session.goto("https://example.com/about")?;
if let Some(delta) = session.delta() {
    println!("Added/changed: {}", delta.changed.len());
    println!("Removed IDs: {:?}", delta.removed);
}
}

element(id) -> Option<&SpatialElement>

O(1) element lookup by ID.

#![allow(unused)]
fn main() {
if let Some(el) = session.element(42) {
    println!("{}: {}", el.tag, el.text.as_deref().unwrap_or(""));
}
}

Finding elements

find_by_text(text) -> Vec<&SpatialElement>

Exact substring match on element text (case-sensitive).

#![allow(unused)]
fn main() {
let results = session.find_by_text("Sign in");
}

find_by_text_fuzzy(text) -> Vec<&SpatialElement>

Case-insensitive substring match on element text.

#![allow(unused)]
fn main() {
let results = session.find_by_text_fuzzy("sign in");
// Matches "Sign In", "SIGN IN", "Please sign in", etc.
}

find_by_role(role) -> Vec<&SpatialElement>

Find all elements with a specific ARIA role.

#![allow(unused)]
fn main() {
let headings = session.find_by_role("heading");
let links = session.find_by_role("link");
let buttons = session.find_by_role("button");
}

find_input_by_purpose(purpose) -> Option<&SpatialElement>

Find an input element by its semantic purpose. Matches on input type, name, label, and placeholder.

#![allow(unused)]
fn main() {
use browsy_core::fetch::InputPurpose;

let password = session.find_input_by_purpose(InputPurpose::Password);
let email = session.find_input_by_purpose(InputPurpose::Email);
let username = session.find_input_by_purpose(InputPurpose::Username);
let code = session.find_input_by_purpose(InputPurpose::VerificationCode);
let search = session.find_input_by_purpose(InputPurpose::Search);
let phone = session.find_input_by_purpose(InputPurpose::Phone);
}
PurposeMatching logic
Passwordinput[type="password"]
Emailinput[type="email"] or name/label contains email
UsernameText/email input with name/label containing user or login
VerificationCodeText/number/tel input with name/label/placeholder containing code, otp, or verify
Searchinput[type="search"], role searchbox, or name containing search
Phoneinput[type="tel"] or name/label containing phone

find_nearest_button(input_id) -> Option<&SpatialElement>

Find the nearest submit button to a given input element. Prefers buttons below the input, scored by Manhattan distance with Y weighted 2x.

#![allow(unused)]
fn main() {
if let Some(btn) = session.find_nearest_button(19) {
    println!("Submit button: {} (id: {})", btn.text.as_deref().unwrap_or(""), btn.id);
}
}

Compound actions

These methods combine multiple interactions into a single call, using the page intelligence action recipes.

login(username, password) -> Result<SpatialDom, FetchError>

Detects the login form from suggested_actions, fills in credentials, and submits. Returns the resulting page.

#![allow(unused)]
fn main() {
let dom = session.goto("https://github.com/login")?;
let result = session.login("user@example.com", "hunter2")?;
}

Returns an error if no Login action recipe was detected on the current page.

enter_code(code) -> Result<SpatialDom, FetchError>

Fills in a verification code and submits the form, using the EnterCode action recipe.

#![allow(unused)]
fn main() {
let result = session.enter_code("847291")?;
}

find_verification_code() -> Option<String>

Extracts a verification code from the current page text (4-8 digit sequences near code-related keywords).

#![allow(unused)]
fn main() {
// On a page that says "Your verification code is 847291"
if let Some(code) = session.find_verification_code() {
    session.enter_code(&code)?;
}
}

CAPTCHA detection

is_captcha() -> bool

Returns true if the current page is classified as a CAPTCHA challenge.

#![allow(unused)]
fn main() {
if session.is_captcha() {
    println!("CAPTCHA detected -- cannot proceed automatically");
}
}

captcha_info() -> Option<&CaptchaInfo>

Returns CAPTCHA details if detected: captcha_type (ReCaptcha, HCaptcha, Turnstile, CloudflareChallenge, ImageGrid, TextCaptcha, Unknown) and optional sitekey.

#![allow(unused)]
fn main() {
if let Some(info) = session.captcha_info() {
    match info.captcha_type {
        CaptchaType::ReCaptcha => {
            println!("reCAPTCHA sitekey: {:?}", info.sitekey);
        }
        CaptchaType::CloudflareChallenge => {
            println!("Cloudflare challenge -- wait and retry");
        }
        _ => {}
    }
}
}

search(query) -> Result<Vec<SearchResult>, FetchError>

Search the web using DuckDuckGo. Returns structured results with title, URL, and snippet.

#![allow(unused)]
fn main() {
let results = session.search("rust programming language")?;
for r in &results {
    println!("{}: {} -- {}", r.title, r.url, r.snippet);
}
}

search_with(query, engine) -> Result<Vec<SearchResult>, FetchError>

Search with a specific engine.

#![allow(unused)]
fn main() {
use browsy_core::fetch::SearchEngine;

let results = session.search_with("browsy", SearchEngine::Google)?;
}

Available engines: SearchEngine::DuckDuckGo (default, most reliable) and SearchEngine::Google (may return CAPTCHAs for automated requests).

search_and_read(query, n) -> Result<Vec<SearchPage>, FetchError>

Search and fetch the top N results, returning each page's Spatial DOM alongside the search result metadata.

#![allow(unused)]
fn main() {
let pages = session.search_and_read("rust web scraping", 3)?;
for page in &pages {
    println!("{}:", page.result.title);
    if let Some(ref dom) = page.dom {
        println!("  {} elements, page_type: {:?}", dom.els.len(), dom.page_type);
    }
}
}

Behaviors

behaviors() -> Vec<JsBehavior>

Detects JavaScript behaviors from HTML attributes (onclick, data-toggle, data-bs-toggle, etc.). Returns trigger element IDs and inferred actions.

#![allow(unused)]
fn main() {
let behaviors = session.behaviors();
for b in &behaviors {
    println!("Element {} triggers {:?}", b.trigger_id, b.action);
}
}

Error handling

All fallible methods return Result<_, FetchError>. Error variants:

VariantCause
FetchError::InvalidUrl(msg)URL could not be parsed
FetchError::BlockedUrl(url)URL matched a blocked pattern or is a private network address
FetchError::Network(msg)HTTP request failed (timeout, DNS, connection refused)
FetchError::HttpError(status)Non-2xx HTTP status code
FetchError::ResponseTooLarge(size, max)Response exceeded max_response_bytes
FetchError::ActionError(msg)Invalid interaction (element not found, wrong element type, no page loaded)

Output Formats

browsy supports three output formats for the Spatial DOM: JSON (full fidelity), compact (minimal tokens), and delta (changes only). The choice depends on your token budget and whether you need machine-readable structure or LLM-friendly brevity.

JSON format

The full SpatialDom serialized as JSON. Every field, every element, complete fidelity.

#![allow(unused)]
fn main() {
let json = serde_json::to_string_pretty(&dom)?;
}
{
  "url": "https://example.com",
  "title": "Example",
  "vp": [1920.0, 1080.0],
  "scroll": [0.0, 0.0],
  "page_type": "Login",
  "suggested_actions": [
    {
      "action": "Login",
      "username_id": 19,
      "password_id": 21,
      "submit_id": 34
    }
  ],
  "els": [
    {
      "id": 1,
      "tag": "nav",
      "role": "navigation",
      "b": [0, 0, 1920, 60]
    },
    {
      "id": 19,
      "tag": "input",
      "role": "textbox",
      "ph": "Username or email address",
      "type": "text",
      "name": "login",
      "label": "Username or email address",
      "b": [480, 320, 960, 40]
    },
    {
      "id": 21,
      "tag": "input",
      "role": "textbox",
      "ph": "Password",
      "type": "password",
      "name": "password",
      "label": "Password",
      "b": [480, 380, 960, 40]
    },
    {
      "id": 34,
      "tag": "button",
      "role": "button",
      "text": "Sign in",
      "b": [480, 440, 960, 44]
    }
  ]
}

Optional fields (text, href, ph, val, name, label, input_type, hidden, checked, disabled, expanded, selected, required, alert_type) are omitted when absent, keeping the JSON compact. The page_type field is omitted when it is Other. The captcha field is omitted when no CAPTCHA is detected.

Use JSON when you need programmatic access to the full DOM structure, or when feeding the output to code rather than an LLM.

Compact format

A one-line-per-element text format designed for minimal token usage. This is the default output format in the MCP server and CLI.

#![allow(unused)]
fn main() {
use browsy_core::output::to_compact_string;

let compact = to_compact_string(&dom);
}

Each element is rendered as a bracketed line:

[id:tag "text" ->href]

Full example output:

[1:nav]
[5:h1 "Welcome"]
[19:input [login] "Username or email address" wide]
[21:input:password [password] "Password" wide]
[!25:a "Forgot password?" ->/reset]
[34:button "Sign in" wide]
[40:a "Create an account" ->/signup @bot]

Compact format rules

Basic structure: [id:tag ...] where id is the numeric element ID and tag is the HTML tag.

Input types: Non-text input types are appended after the tag: [21:input:password ...], [30:input:checkbox ...], [35:input:email ...]. Plain text inputs omit the type suffix.

Text content: Quoted strings show the element's text or placeholder: "Sign in", "Enter your email".

Links: Destinations shown with ->: [12:a "About" ->/about].

Form field names: Shown in square brackets: [login], [password], [email].

Checked state: [v] indicates a checked checkbox or radio button.

Required state: [*] indicates a required field.

Current value: [=value] shows the current value of a form field.

Hidden elements: Prefixed with ! to distinguish from visible elements: [!25:a "Forgot password?"].

Size hints: Form elements (input, button, textarea, select) include a width classification relative to viewport:

HintMeaning
narrowWidth < 15% of viewport
wideWidth > 50% of viewport
fullWidth > 90% of viewport

No hint is shown for elements between 15-50% of viewport width.

Position disambiguation: When multiple elements share the same (tag, text) tuple, a position tag is appended to disambiguate: @top-L, @top, @top-R, @mid-L, @mid, @mid-R, @bot-L, @bot, @bot-R, or @below (below the fold). Position tags are only added when needed -- unique elements have no position suffix.

The viewport is divided into a 3x3 grid for classification:

+--------+--------+--------+
| top-L  |  top   | top-R  |
+--------+--------+--------+
| mid-L  |  mid   | mid-R  |
+--------+--------+--------+
| bot-L  |  bot   | bot-R  |
+--------+--------+--------+

Compact format header

When served through the MCP server or CLI, compact output includes a metadata header:

title: GitHub Login
url: https://github.com/login
els: 47
---
[1:nav]
[5:h1 "Sign in to GitHub"]
...

Delta format

After the first page load, subsequent navigations can use delta output -- only the elements that changed. This dramatically reduces token usage for multi-step workflows.

#![allow(unused)]
fn main() {
use browsy_core::output::{diff, delta_to_compact_string};

let delta = diff(&old_dom, &new_dom);
let compact_delta = delta_to_compact_string(&delta);
}

The DeltaDom struct contains:

#![allow(unused)]
fn main() {
pub struct DeltaDom {
    pub changed: Vec<SpatialElement>,  // Added or modified elements
    pub removed: Vec<u32>,             // IDs of removed elements
    pub vp: [f32; 2],                  // Viewport for size hints
}
}

Compact delta format uses + for added/changed elements and - for removed IDs:

-[3,7,12,15]
[+19:input "Search" wide]
[+20:button "Go"]
[+21:h2 "Results"]
[+22:a "First result" ->https://example.com]

Matching between old and new elements is done by content similarity (tag + text + placeholder + href + input type + bounds), not by ID. IDs are assigned sequentially and may differ between page loads.

Using delta in the Session API

#![allow(unused)]
fn main() {
let mut session = Session::new()?;
session.goto("https://example.com")?;
session.goto("https://example.com/about")?;

if let Some(delta) = session.delta() {
    let output = delta_to_compact_string(&delta);
    println!("{}", output);
}
}

Token comparison

Compact format uses approximately 58 characters per element on average, compared to 96-157 characters for JSON and accessibility-tree-based competitors. On a typical page with 80 elements:

FormatApproximate tokens
Compact~1,200
JSON~2,500
Raw accessibility tree~4,000+

Delta format reduces this further on subsequent pages -- a navigation that changes 15 elements and removes 10 produces roughly 200 tokens instead of re-sending the full 1,200.

Choosing a format

ScenarioFormat
Programmatic consumption (code, not LLM)JSON
LLM agent with normal contextCompact
LLM agent with tight token budgetCompact + filter_above_fold()
Multi-step browsing workflowCompact for first page, delta for subsequent
Debugging / inspectionJSON

MCP Server (Claude Code)

browsy runs as a Model Context Protocol (MCP) server, exposing its browser engine as tools that Claude Code (or any MCP client) can call directly.

Starting the server

browsy mcp

This launches browsy as a stdio-based MCP server. It creates a single persistent Session with cookie jar, navigation history, and form state.

Claude Code configuration

Add browsy to your claude_desktop_config.json:

{
  "mcpServers": {
    "browsy": {
      "command": "browsy",
      "args": ["mcp"]
    }
  }
}

The server advertises itself as browsy-mcp and exposes 14 tools.

Available tools

browse

Navigate to a URL and return the page content.

ParameterTypeRequiredDescription
urlstringyesURL to navigate to
formatstringno"compact" (default) or "json"
scopestringno"all" (default), "visible", "above_fold", or "visible_above_fold"

Returns the full Spatial DOM. In compact format, the output begins with a header block:

title: Example Domain
url: https://example.com
els: 12
---
[1:h1 "Example Domain"]
[2:p "This domain is for use in illustrative examples..."]
[3:a "More information..." ->https://www.iana.org/domains/example]

If a CAPTCHA is detected, a warning is prepended to the output:

CAPTCHA detected (ReCaptcha) -- this page requires human verification to proceed.

click

Click an element by its ID. Links navigate to new pages, buttons submit forms.

ParameterTypeRequiredDescription
idu32yesElement ID to click

Returns the resulting page DOM. Link clicks trigger navigation (fetching the href). Button clicks submit the enclosing form with all typed values and checked states. If a CAPTCHA is detected on the resulting page, a warning is included.

type_text

Type text into an input field or textarea by element ID.

ParameterTypeRequiredDescription
idu32yesElement ID of the text input
textstringyesText to type into the input

This stores the value in session state. The value is included in form submissions and reflected in subsequent get_page calls. Only works on <input> and <textarea> elements.

check

Check a checkbox or radio button by element ID.

ParameterTypeRequiredDescription
idu32yesElement ID of the checkbox or radio button

uncheck

Uncheck a checkbox or radio button by element ID.

ParameterTypeRequiredDescription
idu32yesElement ID of the checkbox or radio button

select

Select an option in a dropdown/select element.

ParameterTypeRequiredDescription
idu32yesElement ID of the select element
valuestringyesValue to select

get_page

Get the current page DOM with form state overlaid. Use after type_text, check, select, or uncheck to see the updated form values without re-fetching.

ParameterTypeRequiredDescription
formatstringno"compact" (default) or "json"
scopestringno"all" (default), "visible", "above_fold", or "visible_above_fold"

search

Search the web and return structured results with title, URL, and snippet.

ParameterTypeRequiredDescription
querystringyesSearch query
enginestringno"duckduckgo" (default) or "google"

Returns a JSON array of search results, each with title, url, and snippet fields.

back

Go back to the previous page in browsing history. No parameters. Returns the previous page's DOM.

login

Fill in a detected login form and submit it. Requires a page with a Login suggested action.

ParameterTypeRequiredDescription
usernamestringyesUsername or email
passwordstringyesPassword

This is a compound action: it types the username into the detected username field, types the password into the password field, and clicks the submit button. Returns the resulting page DOM.

enter_code

Enter a verification or 2FA code into the detected code input field. Requires a page with an EnterCode suggested action.

ParameterTypeRequiredDescription
codestringyesVerification or 2FA code

Types the code into the detected input and clicks submit. Returns the resulting page DOM.

find

Find elements on the current page by text content or ARIA role.

ParameterTypeRequiredDescription
textstringnoFind elements containing this text
rolestringnoFind elements with this ARIA role

At least one of text or role must be provided. Returns a JSON array of matching elements.

tables

Extract structured table data from the current page. No parameters. Returns a JSON array of tables, each with headers (string array) and rows (array of string arrays).

page_info

Get page metadata without the full element list. No parameters. Returns:

{
  "title": "Sign In - Example",
  "url": "https://example.com/login",
  "page_type": "Login",
  "suggested_actions": [
    {
      "action": "Login",
      "username_id": 5,
      "password_id": 8,
      "submit_id": 12
    }
  ],
  "alerts": [],
  "pagination": null
}

When a CAPTCHA is detected, the response includes a captcha field with captcha_type and optional sitekey.

Example conversation flow

A typical agent interaction with a login-protected site:

  1. browse https://app.example.com -- page_type is Login, suggested_actions includes Login with field IDs.
  2. login with username and password -- the agent calls login directly, which fills and submits the form.
  3. The result page might be TwoFactorAuth with an EnterCode action.
  4. enter_code with the 2FA code -- fills the code input and submits.
  5. The result page is now Dashboard -- the agent can proceed with its task.

For pages without compound actions, the lower-level tools work:

  1. browse the URL.
  2. type_text to fill form fields by ID.
  3. check or select for checkboxes and dropdowns.
  4. get_page to verify the form state looks correct.
  5. click the submit button to submit.

CAPTCHA warnings

When browse or click returns a page detected as Captcha, a warning line is prepended to the output:

CAPTCHA detected (HCaptcha) -- this page requires human verification to proceed.

The page_info tool also surfaces CAPTCHA details in a structured captcha field. browsy cannot solve CAPTCHAs -- it detects and classifies them so the agent can decide how to proceed (request human help, use a third-party solver, or try a different approach).

Output format

In compact mode (the default), elements are rendered as:

[id:tag "text"]

With additional annotations:

  • !id:tag -- hidden element (display:none, visibility:hidden, aria-hidden, or hidden attribute)
  • [name] -- HTML name attribute
  • [v] -- checked checkbox/radio
  • [*] -- required field
  • [=value] -- current value
  • ->url -- href target
  • narrow / wide / full -- size hint for form elements
  • @top-L / @mid / @bot-R -- position hint (only shown to disambiguate duplicate elements)

REST API

browsy includes a built-in HTTP server that exposes the full Session API as REST endpoints. This is the primary integration point for non-Rust, non-Python, and non-MCP clients.

Starting the server

browsy serve --port 3847

The server listens on http://localhost:3847 by default. See CLI Usage for all flags.

Session management

The server manages multiple concurrent browsing sessions. Each session has its own cookie jar, navigation history, and form state.

Sessions are identified by the X-Browsy-Session header:

ScenarioBehavior
No X-Browsy-Session headerServer creates a new session and returns the token in the response header
Valid token in headerExisting session is reused
Invalid or expired tokenServer creates a new session and returns the new token
Session idle > 30 minutesSession expires and is cleaned up
Server at capacity (default: 100 sessions)Returns 503 Service Unavailable

Every response includes the X-Browsy-Session header. Clients should capture it from the first response and include it in all subsequent requests.

# First request -- capture the session token
TOKEN=$(curl -s -D- -o /dev/null http://localhost:3847/api/browse \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}' | grep -i x-browsy-session | tr -d '\r' | cut -d' ' -f2)

# Subsequent requests -- reuse the session
curl http://localhost:3847/api/page-info -H "X-Browsy-Session: $TOKEN"

CORS

The server sends CORS headers on all responses:

  • Access-Control-Allow-Origin: *
  • Access-Control-Allow-Headers: Content-Type, X-Browsy-Session
  • Access-Control-Expose-Headers: X-Browsy-Session

This allows browser-based clients to call the API directly.

Endpoint reference

MethodPathDescription
POST/api/browseNavigate to a URL
POST/api/clickClick an element by ID
POST/api/typeType text into an input
POST/api/checkCheck a checkbox or radio
POST/api/uncheckUncheck a checkbox or radio
POST/api/selectSelect a dropdown option
POST/api/searchWeb search
POST/api/loginFill and submit a login form
POST/api/enter-codeEnter a verification code
POST/api/findFind elements by text or role
POST/api/backGo back in history
GET/api/pageGet current page DOM
GET/api/page-infoGet page metadata
GET/api/tablesExtract table data
GET/healthHealth check

All POST endpoints accept Content-Type: application/json.

Endpoints

POST /api/browse

Navigate to a URL and return the Spatial DOM.

Request body:

FieldTypeRequiredDescription
urlstringyesURL to navigate to
formatstringno"compact" (default) or "json"
scopestringno"all" (default), "visible", "above_fold", or "visible_above_fold"
curl http://localhost:3847/api/browse \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Response: The Spatial DOM in the requested format. Compact format returns plain text; JSON format returns the full structured DOM.

# JSON format with only visible elements
curl http://localhost:3847/api/browse \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "format": "json", "scope": "visible"}'

POST /api/click

Click an element by its ID. Links navigate to new pages; buttons submit forms.

Request body:

FieldTypeRequiredDescription
idintegeryesElement ID to click
curl http://localhost:3847/api/click \
  -H "Content-Type: application/json" \
  -H "X-Browsy-Session: $TOKEN" \
  -d '{"id": 3}'

Response: The resulting page DOM (after navigation or form submission).

POST /api/type

Type text into an input field or textarea.

Request body:

FieldTypeRequiredDescription
idintegeryesElement ID of the text input
textstringyesText to type
curl http://localhost:3847/api/type \
  -H "Content-Type: application/json" \
  -H "X-Browsy-Session: $TOKEN" \
  -d '{"id": 5, "text": "user@example.com"}'

Response: Confirmation. Use GET /api/page to see the updated form state.

POST /api/check

Check a checkbox or radio button.

Request body:

FieldTypeRequiredDescription
idintegeryesElement ID
curl http://localhost:3847/api/check \
  -H "Content-Type: application/json" \
  -H "X-Browsy-Session: $TOKEN" \
  -d '{"id": 10}'

POST /api/uncheck

Uncheck a checkbox or radio button.

Request body:

FieldTypeRequiredDescription
idintegeryesElement ID
curl http://localhost:3847/api/uncheck \
  -H "Content-Type: application/json" \
  -H "X-Browsy-Session: $TOKEN" \
  -d '{"id": 10}'

POST /api/select

Select an option in a dropdown.

Request body:

FieldTypeRequiredDescription
idintegeryesElement ID of the select element
valuestringyesValue to select
curl http://localhost:3847/api/select \
  -H "Content-Type: application/json" \
  -H "X-Browsy-Session: $TOKEN" \
  -d '{"id": 12, "value": "en-US"}'

POST /api/search

Search the web and return structured results.

Request body:

FieldTypeRequiredDescription
querystringyesSearch query
enginestringno"duckduckgo" (default) or "google"
curl http://localhost:3847/api/search \
  -H "Content-Type: application/json" \
  -d '{"query": "rust web framework"}'

Response:

[
  {
    "title": "Actix Web - Rust Web Framework",
    "url": "https://actix.rs",
    "snippet": "A powerful, pragmatic, and fast web framework for Rust."
  }
]

POST /api/login

Fill and submit a detected login form. Requires a page with a Login suggested action loaded in the session.

Request body:

FieldTypeRequiredDescription
usernamestringyesUsername or email
passwordstringyesPassword
# First navigate to the login page
curl http://localhost:3847/api/browse \
  -H "Content-Type: application/json" \
  -H "X-Browsy-Session: $TOKEN" \
  -d '{"url": "https://app.example.com/login"}'

# Then submit credentials
curl http://localhost:3847/api/login \
  -H "Content-Type: application/json" \
  -H "X-Browsy-Session: $TOKEN" \
  -d '{"username": "user@example.com", "password": "secretpassword"}'

Response: The resulting page DOM after login submission.

POST /api/enter-code

Enter a verification or 2FA code. Requires a page with an EnterCode suggested action.

Request body:

FieldTypeRequiredDescription
codestringyesVerification or 2FA code
curl http://localhost:3847/api/enter-code \
  -H "Content-Type: application/json" \
  -H "X-Browsy-Session: $TOKEN" \
  -d '{"code": "847291"}'

Response: The resulting page DOM after code submission.

POST /api/find

Find elements on the current page by text content or ARIA role.

Request body:

FieldTypeRequiredDescription
textstringnoFind elements containing this text
rolestringnoFind elements with this ARIA role

At least one of text or role must be provided.

# Find by text
curl http://localhost:3847/api/find \
  -H "Content-Type: application/json" \
  -H "X-Browsy-Session: $TOKEN" \
  -d '{"text": "Sign In"}'

# Find by role
curl http://localhost:3847/api/find \
  -H "Content-Type: application/json" \
  -H "X-Browsy-Session: $TOKEN" \
  -d '{"role": "button"}'

Response: JSON array of matching elements.

POST /api/back

Go back to the previous page in browsing history. No request body required.

curl -X POST http://localhost:3847/api/back \
  -H "X-Browsy-Session: $TOKEN"

Response: The previous page's DOM.

GET /api/page

Get the current page DOM with form state overlaid. Use after type, check, select, or uncheck to see updated form values without re-fetching.

Query parameters:

ParameterTypeRequiredDescription
scopestringno"all" (default), "visible", "above_fold", or "visible_above_fold"
formatstringno"compact" (default) or "json"
curl "http://localhost:3847/api/page?format=json&scope=visible" \
  -H "X-Browsy-Session: $TOKEN"

GET /api/page-info

Get page metadata without the full element list. No parameters.

curl http://localhost:3847/api/page-info \
  -H "X-Browsy-Session: $TOKEN"

Response:

{
  "title": "Sign In - Example",
  "url": "https://example.com/login",
  "page_type": "Login",
  "suggested_actions": [
    {
      "action": "Login",
      "username_id": 5,
      "password_id": 8,
      "submit_id": 12
    }
  ],
  "alerts": [],
  "pagination": null
}

GET /api/tables

Extract structured table data from the current page. No parameters.

curl http://localhost:3847/api/tables \
  -H "X-Browsy-Session: $TOKEN"

Response:

[
  {
    "headers": ["Name", "Price", "Stock"],
    "rows": [
      ["Widget A", "$9.99", "In stock"],
      ["Widget B", "$14.99", "Out of stock"]
    ]
  }
]

GET /health

Health check endpoint. No session required.

curl http://localhost:3847/health

Response:

{
  "status": "ok"
}

Scopes

The scope parameter controls which elements are included in the output:

ScopeDescription
allAll elements including hidden ones (default)
visibleOnly non-hidden elements
above_foldOnly elements with top edge within the viewport height
visible_above_foldNon-hidden elements above the fold

Output formats

The format parameter controls the response format:

FormatContent-TypeDescription
compacttext/plainMinimal token-efficient text format (default)
jsonapplication/jsonFull structured Spatial DOM

See Output Formats for details on both formats.

Error responses

Errors return JSON with an error field:

{
  "error": "Element 999 not found"
}
StatusCause
400Invalid request body or parameters
404Element not found, no page loaded, or no matching action
503Server at session capacity

Example: complete login flow

# Start the server
browsy serve --port 3847 &

# Browse to login page (captures session token)
TOKEN=$(curl -s -D- http://localhost:3847/api/browse \
  -H "Content-Type: application/json" \
  -d '{"url": "https://app.example.com/login"}' \
  | grep -i x-browsy-session | tr -d '\r' | cut -d' ' -f2)

# Check page type
curl -s http://localhost:3847/api/page-info \
  -H "X-Browsy-Session: $TOKEN" | jq .page_type
# "Login"

# Submit credentials
curl -s http://localhost:3847/api/login \
  -H "Content-Type: application/json" \
  -H "X-Browsy-Session: $TOKEN" \
  -d '{"username": "user@example.com", "password": "secret"}'

# Check if 2FA is needed
curl -s http://localhost:3847/api/page-info \
  -H "X-Browsy-Session: $TOKEN" | jq .page_type
# "TwoFactorAuth"

# Enter 2FA code
curl -s http://localhost:3847/api/enter-code \
  -H "Content-Type: application/json" \
  -H "X-Browsy-Session: $TOKEN" \
  -d '{"code": "847291"}'

# Now on the dashboard -- extract tables
curl -s http://localhost:3847/api/tables \
  -H "X-Browsy-Session: $TOKEN" | jq .

A2A Protocol

browsy implements Google's Agent-to-Agent (A2A) protocol, enabling agent discovery and task delegation over HTTP. Any A2A-compatible agent can discover browsy's capabilities and delegate web browsing tasks to it.

Overview

A2A is a standard for agents to find and communicate with each other. browsy's A2A support consists of two parts:

  1. Agent card -- a JSON manifest at a well-known URL describing browsy's capabilities.
  2. Task execution -- an endpoint that accepts goals, executes them as browsing tasks, and streams status events back via SSE.

Both are served automatically by browsy serve.

browsy serve --port 3847

Agent card

The agent card is served at GET /.well-known/agent.json and describes browsy's identity and capabilities.

curl http://localhost:3847/.well-known/agent.json

Response:

{
  "name": "browsy",
  "description": "Zero-render browser engine for AI agents. Navigates, extracts, and interacts with web pages without rendering pixels.",
  "url": "http://localhost:3847",
  "version": "1.0",
  "capabilities": {
    "streaming": true,
    "pushNotifications": false
  },
  "skills": [
    {
      "id": "web-browse",
      "name": "Web Browsing",
      "description": "Navigate to URLs, interact with pages, extract content, fill forms, and search the web.",
      "tags": ["browse", "scrape", "extract", "search", "login", "forms"]
    }
  ]
}

Agents discover browsy by fetching this card and inspecting the skills array. The streaming: true capability indicates that task responses are delivered as Server-Sent Events (SSE).

Task execution

POST /a2a/tasks

Submit a task for browsy to execute. The response is an SSE event stream with status updates.

Request body:

FieldTypeRequiredDescription
goalstringyesNatural language description of the task
paramsobjectnoStructured parameters (see below)

Params fields:

FieldTypeDescription
urlstringTarget URL to browse
credentialsobject{ "username": "...", "password": "..." } for login tasks
search_querystringQuery string for search tasks
extractstringWhat to extract from the page (e.g., "tables", "links", "text")

browsy infers the task intent from the goal text and params fields. Explicit params take priority over goal parsing.

Intent detection

browsy maps each task to one of these intents:

IntentTriggerBehavior
Searchsearch_query param, or goal contains "search"Performs a web search, returns results
Logincredentials param, or goal contains "login"/"sign in"Navigates to URL, fills login form, submits
Extractextract param (not "tables"), or goal contains "extract"/"scrape"Navigates to URL, returns page content
ExtractTablesextract: "tables", or goal contains "table"Navigates to URL, extracts structured table data
FillFormGoal contains "fill"/"form"/"submit"Navigates to URL, interacts with form elements
BrowseDefault fallbackNavigates to URL, returns the Spatial DOM

SSE event stream

The response uses Content-Type: text/event-stream. Each event is a JSON object with the following structure:

data: {"id":"task_abc123","status":"working","steps":[{"description":"Navigating to https://example.com"}]}

data: {"id":"task_abc123","status":"completed","steps":[{"description":"Navigating to https://example.com"},{"description":"Page loaded: Example Domain (3 elements)"}],"result":{"page_type":"Other","title":"Example Domain","elements":3}}

Event fields:

FieldTypeDescription
idstringUnique task identifier
statusstring"working", "completed", or "failed"
stepsarrayList of { "description": "..." } objects showing progress
resultobjectPresent when status is "completed". Contains extracted data
errorstringPresent when status is "failed". Describes what went wrong

The stream always ends with a terminal event ("completed" or "failed").

Examples

Browse a page

curl -N http://localhost:3847/a2a/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "goal": "Browse the Hacker News front page",
    "params": { "url": "https://news.ycombinator.com" }
  }'

Event stream:

data: {"id":"task_1","status":"working","steps":[{"description":"Navigating to https://news.ycombinator.com"}]}

data: {"id":"task_1","status":"completed","steps":[{"description":"Navigating to https://news.ycombinator.com"},{"description":"Page loaded: Hacker News (120 elements)"}],"result":{"page_type":"List","title":"Hacker News","elements":120}}

Search the web

curl -N http://localhost:3847/a2a/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "goal": "Search for Rust web frameworks",
    "params": { "search_query": "rust web framework 2026" }
  }'

Login to a site

curl -N http://localhost:3847/a2a/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "goal": "Login to the application",
    "params": {
      "url": "https://app.example.com/login",
      "credentials": { "username": "user@example.com", "password": "secret" }
    }
  }'

Event stream:

data: {"id":"task_3","status":"working","steps":[{"description":"Navigating to https://app.example.com/login"}]}

data: {"id":"task_3","status":"working","steps":[{"description":"Navigating to https://app.example.com/login"},{"description":"Login page detected, submitting credentials"}]}

data: {"id":"task_3","status":"completed","steps":[{"description":"Navigating to https://app.example.com/login"},{"description":"Login page detected, submitting credentials"},{"description":"Login successful, redirected to Dashboard"}],"result":{"page_type":"Dashboard","title":"Dashboard - App"}}

Extract table data

curl -N http://localhost:3847/a2a/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "goal": "Extract the pricing table",
    "params": {
      "url": "https://example.com/pricing",
      "extract": "tables"
    }
  }'

Extract page content

curl -N http://localhost:3847/a2a/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "goal": "Extract the main article text",
    "params": {
      "url": "https://example.com/blog/post",
      "extract": "text"
    }
  }'

Fill a form

curl -N http://localhost:3847/a2a/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "goal": "Fill out the contact form with name John and email john@example.com",
    "params": { "url": "https://example.com/contact" }
  }'

Task status polling

A stub endpoint exists for polling task status by ID:

GET /a2a/tasks/{task_id}
curl http://localhost:3847/a2a/tasks/task_abc123

This returns the last known state of the task. Since tasks execute synchronously over SSE, polling is primarily useful for checking whether a task completed after a disconnection.

Error handling

When a task fails, the final SSE event includes an error field:

data: {"id":"task_5","status":"failed","steps":[{"description":"Navigating to https://invalid.example"}],"error":"Network error: DNS resolution failed"}

Common failure causes:

ErrorCause
Network errorDNS failure, connection refused, timeout
CAPTCHA detectedTarget page requires human verification
No login form foundLogin intent but page has no detected login action
Element not foundForm interaction referenced a nonexistent element

Framework Integrations

browsy provides native integrations for popular AI/agent frameworks in both Python and JavaScript/TypeScript. Each integration wraps browsy as framework-compatible tools, so agents can browse the web using their native tool-calling patterns.

JavaScript / TypeScript

The browsy-ai npm package provides integrations for LangChain.js, OpenAI, and Vercel AI SDK. Install the core package and whichever framework you use:

npm install browsy-ai                    # Core SDK
npm install browsy-ai @langchain/core    # + LangChain.js
npm install browsy-ai openai             # + OpenAI
npm install browsy-ai ai                 # + Vercel AI SDK

LangChain.js

import { getTools } from "browsy-ai/langchain";

const tools = getTools();  // -> 14 LangChain tool instances

OpenAI function calling

import { getToolDefinitions, handleToolCall } from "browsy-ai/openai";

const tools = getToolDefinitions();
const result = await handleToolCall("browsy_browse", { url: "https://example.com" });

Vercel AI SDK

import { browsyTools } from "browsy-ai/vercel-ai";
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";

const result = await generateText({
  model: openai("gpt-4o"),
  tools: browsyTools(),
  prompt: "Go to example.com and summarize it",
  maxSteps: 10,
});

See the full JavaScript / TypeScript guide for complete examples and API reference.


Python

Install browsy with the extras for your framework:

pip install browsy-ai[langchain]   # LangChain tools
pip install browsy-ai[crewai]      # CrewAI tool
pip install browsy-ai[openai]      # OpenAI function calling
pip install browsy-ai[autogen]     # AutoGen integration
pip install browsy-ai[smolagents]  # HuggingFace smolagents
pip install browsy-ai[all]         # All integrations

All Python integrations share a lazily-initialized Browser instance. You can pass your own Browser for custom viewport configuration.

LangChain

The LangChain integration provides individual tools that plug directly into LangChain agents and chains.

from browsy.langchain import get_tools

Available tools

Tool classDescription
BrowsyBrowseToolNavigate to a URL, returns Spatial DOM
BrowsyClickToolClick an element by ID
BrowsyTypeTextToolType text into an input field
BrowsySearchToolWeb search via DuckDuckGo or Google
BrowsyLoginToolFill and submit a login form
BrowsyPageInfoToolGet page metadata and suggested actions

Quick start

from browsy.langchain import get_tools
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent

llm = ChatOpenAI(model="gpt-4o")
tools = get_tools()

agent = create_react_agent(llm, tools)

result = agent.invoke({
    "messages": [{"role": "user", "content": "Go to news.ycombinator.com and list the top 5 stories"}]
})

Custom browser

Pass a Browser instance to control viewport size or other settings:

from browsy import Browser
from browsy.langchain import get_tools

browser = Browser(viewport_width=375, viewport_height=812)
tools = get_tools(browser=browser)

Using individual tools

from browsy.langchain import BrowsyBrowseTool, BrowsyClickTool

browse = BrowsyBrowseTool()
page = browse.invoke({"url": "https://example.com"})

click = BrowsyClickTool()
result = click.invoke({"id": 3})

CrewAI

The CrewAI integration wraps all browsy actions into a single tool that CrewAI agents can call.

from browsy.crewai import BrowsyTool

Quick start

from browsy.crewai import BrowsyTool
from crewai import Agent, Task, Crew

browsy_tool = BrowsyTool()

researcher = Agent(
    role="Web Researcher",
    goal="Find and summarize information from web pages",
    backstory="You are an expert at navigating websites and extracting key information.",
    tools=[browsy_tool],
    verbose=True,
)

task = Task(
    description="Go to https://news.ycombinator.com and summarize the top 3 stories.",
    expected_output="A summary of the top 3 Hacker News stories with titles and URLs.",
    agent=researcher,
)

crew = Crew(agents=[researcher], tasks=[task])
result = crew.kickoff()
print(result)

Tool actions

The BrowsyTool accepts a JSON string with an action field and action-specific parameters:

# Browse
browsy_tool.run('{"action": "browse", "url": "https://example.com"}')

# Click
browsy_tool.run('{"action": "click", "id": 3}')

# Type
browsy_tool.run('{"action": "type", "id": 5, "text": "hello"}')

# Search
browsy_tool.run('{"action": "search", "query": "rust web framework"}')

# Login
browsy_tool.run('{"action": "login", "username": "user@example.com", "password": "secret"}')

# Page info
browsy_tool.run('{"action": "page_info"}')

OpenAI function calling

The OpenAI integration provides tool definitions compatible with the OpenAI Chat Completions API and a dispatcher to handle tool calls.

from browsy.openai import get_tool_definitions, handle_tool_call

Tool definitions

get_tool_definitions() returns a list of OpenAI-compatible tool schemas:

from browsy.openai import get_tool_definitions

tools = get_tool_definitions()
# Returns list of {"type": "function", "function": {"name": ..., "parameters": ...}}

Handling tool calls

handle_tool_call(name, args) dispatches a tool call to browsy and returns the result as a string:

from browsy.openai import handle_tool_call

result = handle_tool_call("browsy_browse", {"url": "https://example.com"})

Complete example

import json
from openai import OpenAI
from browsy.openai import get_tool_definitions, handle_tool_call

client = OpenAI()
tools = get_tool_definitions()

messages = [
    {"role": "user", "content": "Go to example.com and tell me what's on the page."}
]

# Initial request
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
)

# Tool call loop
while response.choices[0].message.tool_calls:
    msg = response.choices[0].message
    messages.append(msg)

    for tool_call in msg.tool_calls:
        args = json.loads(tool_call.function.arguments)
        result = handle_tool_call(tool_call.function.name, args)

        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": result,
        })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools,
    )

print(response.choices[0].message.content)

Available functions

Function nameParametersDescription
browsy_browseurl, format?, scope?Navigate to a URL
browsy_clickidClick an element
browsy_type_textid, textType into an input
browsy_searchquery, engine?Web search
browsy_loginusername, passwordLogin to a site
browsy_page_info(none)Get page metadata

AutoGen

The AutoGen integration provides a BrowsyBrowser class compatible with Microsoft AutoGen's ConversableAgent.

from browsy.autogen import BrowsyBrowser

Quick start

from browsy.autogen import BrowsyBrowser
from autogen import ConversableAgent, UserProxyAgent

browser = BrowsyBrowser()

assistant = ConversableAgent(
    name="web_assistant",
    system_message="You help users browse the web and extract information.",
    llm_config={"config_list": [{"model": "gpt-4o"}]},
)

# Register browsy tools with the agent
browser.register(assistant)

user = UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
    code_execution_config=False,
)
browser.register(user)

user.initiate_chat(
    assistant,
    message="Go to https://example.com and describe what you see.",
)

Custom browser

from browsy import Browser
from browsy.autogen import BrowsyBrowser

custom = Browser(viewport_width=1366, viewport_height=768)
browser = BrowsyBrowser(browser=custom)

Smolagents

The smolagents integration provides a tool compatible with HuggingFace's smolagents framework.

from browsy.smolagents import BrowsyTool

Quick start

from browsy.smolagents import BrowsyTool
from smolagents import CodeAgent, HfApiModel

tool = BrowsyTool()

agent = CodeAgent(
    tools=[tool],
    model=HfApiModel("Qwen/Qwen2.5-Coder-32B-Instruct"),
)

result = agent.run("Go to https://example.com and extract the main heading text.")
print(result)

Custom browser

from browsy import Browser
from browsy.smolagents import BrowsyTool

browser = Browser(viewport_width=1920, viewport_height=1080)
tool = BrowsyTool(browser=browser)

OpenClaw / SimpleClaw

The openclaw-browsy plugin integrates browsy as a first-class tool in OpenClaw and compatible frameworks like SimpleClaw. Unlike the Python integrations above, this is a TypeScript/Node.js plugin that manages its own browsy server process.

npm install openclaw-browsy
import { register } from "openclaw-browsy";
export default { register };

The plugin auto-starts a browsy serve process and injects 14 browsing tools into every agent. It can also intercept built-in Playwright browser tools for a transparent speed upgrade.

See the full OpenClaw / SimpleClaw integration guide for configuration, standalone usage, and custom orchestrator support.

Shared Browser instance

All integrations lazily initialize a Browser instance with default settings (1920x1080 viewport) if none is provided. The Browser instance is shared across all tool calls within the same integration, maintaining session state (cookies, history, form values) across interactions.

To share a single Browser across multiple integrations:

from browsy import Browser
from browsy.langchain import get_tools as get_langchain_tools
from browsy.openai import get_tool_definitions

browser = Browser(viewport_width=1920, viewport_height=1080)

# Both use the same session
langchain_tools = get_langchain_tools(browser=browser)
openai_tools = get_tool_definitions(browser=browser)

JavaScript / TypeScript

The browsy-ai npm package provides a TypeScript SDK for the browsy REST API, plus ready-made integrations for LangChain.js, OpenAI, and Vercel AI SDK.

Installation

npm install browsy-ai

The package uses ESM and requires Node.js 22+. Framework dependencies are optional peer dependencies — install only what you need.

Core SDK

The core SDK manages the browsy server process, HTTP communication, and per-agent session isolation.

import { BrowsyClient, BrowsyContext, ServerManager } from "browsy-ai";

BrowsyContext

The simplest way to use browsy. BrowsyContext is a facade that coordinates the client, server manager, and session manager.

import { BrowsyContext } from "browsy-ai";

const ctx = new BrowsyContext({ port: 3847 });

// Execute tool calls — server auto-starts, sessions auto-managed
const page = await ctx.executeToolCall("browse", { url: "https://example.com" });
console.log(page);

const info = await ctx.executeToolCall("pageInfo", {});
console.log(info);

BrowsyClient

Lower-level HTTP client for direct API calls. Use this when you manage the server and sessions yourself.

import { BrowsyClient } from "browsy-ai";

const client = new BrowsyClient(3847);

// Navigate
const res = await client.browse({ url: "https://example.com" });
console.log(res.body);

// Interact using the session from the response
await client.typeText({ id: 5, text: "hello" }, res.session);
await client.click({ id: 12 }, res.session);

// Extract data
const tables = await client.tables(res.session);
const info = await client.pageInfo(res.session);

Configuration

import { BrowsyContext } from "browsy-ai";

const ctx = new BrowsyContext({
  port: 3847,           // REST server port (default: 3847)
  autoStart: true,      // Auto-start browsy serve (default: true)
  allowPrivateNetwork: false,  // Allow private network URLs (default: false)
  serverTimeout: 10_000,      // Startup timeout in ms (default: 10000)
});

When autoStart is true, the SDK finds the browsy binary in your PATH (or via the BROWSY_BIN environment variable) and spawns browsy serve --port <port>.

Session isolation

Each agent gets its own isolated browsing session with independent cookies, history, and form state:

const ctx = new BrowsyContext();

// Different agents get different sessions
const page1 = await ctx.executeToolCall("browse", { url: "https://a.com" }, "agent-1");
const page2 = await ctx.executeToolCall("browse", { url: "https://b.com" }, "agent-2");

LangChain.js

npm install browsy-ai @langchain/core
import { getTools } from "browsy-ai/langchain";

Quick start

import { getTools } from "browsy-ai/langchain";
import { ChatOpenAI } from "@langchain/openai";
import { createReactAgent } from "@langchain/langgraph/prebuilt";

const tools = getTools({ port: 3847 });
const llm = new ChatOpenAI({ model: "gpt-4o" });
const agent = createReactAgent({ llm, tools });

const result = await agent.invoke({
  messages: [{ role: "user", content: "Go to news.ycombinator.com and list the top 5 stories" }],
});

Custom context

Pass a BrowsyContext for full control:

import { BrowsyContext } from "browsy-ai";
import { getTools } from "browsy-ai/langchain";

const ctx = new BrowsyContext({ port: 9000, autoStart: false });
const tools = getTools(ctx);

Available tools

getTools() returns 14 LangChain tool instances:

Tool nameParametersDescription
browsy_browseurl, format?, scope?Navigate to a URL
browsy_clickidClick an element by ID
browsy_type_textid, textType into an input field
browsy_checkidCheck a checkbox/radio
browsy_uncheckidUncheck a checkbox/radio
browsy_selectid, valueSelect a dropdown option
browsy_searchquery, engine?Web search
browsy_loginusername, passwordLog in using detected form
browsy_enter_codecodeEnter 2FA/verification code
browsy_findtext?, role?Find elements by text or role
browsy_get_pageformat?, scope?Get current page with form state
browsy_page_infoPage metadata and suggested actions
browsy_tablesExtract structured table data
browsy_backGo back in history

OpenAI

npm install browsy-ai openai
import { getToolDefinitions, handleToolCall } from "browsy-ai/openai";

Quick start

import OpenAI from "openai";
import { getToolDefinitions, handleToolCall, createToolCallHandler } from "browsy-ai/openai";

const client = new OpenAI();
const tools = getToolDefinitions();

const messages = [
  { role: "user" as const, content: "Go to example.com and tell me what's there." },
];

let response = await client.chat.completions.create({
  model: "gpt-4o",
  messages,
  tools,
});

// Tool call loop
while (response.choices[0].message.tool_calls?.length) {
  const msg = response.choices[0].message;
  messages.push(msg);

  for (const toolCall of msg.tool_calls!) {
    const args = JSON.parse(toolCall.function.arguments);
    const result = await handleToolCall(toolCall.function.name, args);

    messages.push({
      role: "tool" as const,
      tool_call_id: toolCall.id,
      content: result,
    });
  }

  response = await client.chat.completions.create({
    model: "gpt-4o",
    messages,
    tools,
  });
}

console.log(response.choices[0].message.content);

Bound handler

Use createToolCallHandler() to get a pre-bound handler:

import { getToolDefinitions, createToolCallHandler } from "browsy-ai/openai";

const tools = getToolDefinitions();
const handle = createToolCallHandler({ port: 3847 });

// In your tool call loop:
const result = await handle(toolCall.function.name, args);

Vercel AI SDK

npm install browsy-ai ai
import { browsyTools } from "browsy-ai/vercel-ai";

Quick start

import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
import { browsyTools } from "browsy-ai/vercel-ai";

const result = await generateText({
  model: openai("gpt-4o"),
  tools: browsyTools(),
  prompt: "Go to news.ycombinator.com and list the top 5 stories",
  maxSteps: 10,
});

console.log(result.text);

Custom context

import { BrowsyContext } from "browsy-ai";
import { browsyTools } from "browsy-ai/vercel-ai";

const ctx = new BrowsyContext({ port: 9000 });
const tools = browsyTools(ctx);

Zod schemas

All tool parameter schemas are exported as Zod objects for use in custom integrations:

import {
  BrowseParams,
  ClickParams,
  TypeTextParams,
  SearchParams,
  TOOL_DESCRIPTIONS,
  TOOL_SCHEMAS,
} from "browsy-ai";

// Use in your own tool definitions
const parsed = BrowseParams.parse({ url: "https://example.com" });

// Iterate over all tools
for (const { name, method, schema } of TOOL_SCHEMAS) {
  console.log(name, TOOL_DESCRIPTIONS[name]);
}

Prerequisites

The SDK talks to a browsy REST server. You need the browsy CLI installed:

cargo install browsy

With autoStart: true (the default), the SDK starts the server automatically. With autoStart: false, start it manually:

browsy serve --port 3847

OpenClaw Integration

browsy integrates with OpenClaw as a first-class plugin, giving every agent fast, zero-render browsing capabilities without Playwright or Chromium.

Why use browsy in OpenClaw?

OpenClaw's built-in browser uses Playwright + CDP: ~300MB RAM, 2-5s per page. browsy handles 70%+ of agent browsing tasks at 10x speed and 60x less memory. The plugin auto-starts a browsy server and injects 14 browsing tools into every agent.

Built-in Browserbrowsy Plugin
EngineChromium via PlaywrightZero-render Spatial DOM
Memory~300MB/page~5MB/page
Latency2-5s/page<100ms/page
JS supportFullHidden content exposure
SetupBundlednpm install openclaw-browsy + browsy CLI

Installation

# Install the OpenClaw plugin
npm install openclaw-browsy

# Install the browsy CLI (needed for the server)
cargo install browsy

Configuration

Add to your OpenClaw config:

{
  "plugins": {
    "openclaw-browsy": {
      "port": 3847,
      "autoStart": true,
      "allowPrivateNetwork": false,
      "preferBrowsy": true,
      "serverTimeout": 10000
    }
  }
}
OptionDefaultDescription
port3847Port for the browsy REST server
autoStarttrueStart browsy serve automatically on plugin init
allowPrivateNetworkfalseAllow fetching private/internal network URLs
preferBrowsytrueIntercept built-in browser tool calls and redirect through browsy
serverTimeout10000Timeout (ms) waiting for server startup

Plugin registration

// openclaw.config.ts
import { register } from "openclaw-browsy";
export default { register };

The plugin registers four components following OpenClaw's standard pattern:

  1. preToolExecution hook — intercepts built-in browser tools (browser, web_browser, playwright_browser, browse_web) and redirects them through browsy when preferBrowsy is enabled
  2. agent:bootstrap hook — injects 14 browsy tools into every agent's toolset at startup
  3. browsy-server service — manages the browsy serve process lifecycle (auto-start, health polling, shutdown)
  4. Gateway methods + CLI commandsbrowsy.status, browsy.restart, /browsy-status, /browsy-sessions

Available tools

Every agent gets these 14 tools automatically:

ToolParametersDescription
browsy_browseurl, format?, scope?Navigate to a URL
browsy_clickidClick an element by ID
browsy_type_textid, textType text into an input field
browsy_checkidCheck a checkbox or radio button
browsy_uncheckidUncheck a checkbox or radio button
browsy_selectid, valueSelect a dropdown option
browsy_searchquery, engine?Search the web (DuckDuckGo or Google)
browsy_loginusername, passwordLog in using detected form fields
browsy_enter_codecodeEnter a verification or 2FA code
browsy_findtext?, role?Find elements by text or ARIA role
browsy_get_pageformat?, scope?Get current page DOM with form state
browsy_page_infoGet page metadata and suggested actions
browsy_tablesExtract structured table data
browsy_backGo back in browsing history

How it works

The plugin is a pure proxy — it talks to browsy's REST API via fetch() and manages sessions:

Agent → browsy_browse("https://example.com")
  → Plugin ensures browsy server is running
  → Plugin gets/creates session for this agent
  → POST /api/browse with X-Browsy-Session header
  → browsy fetches, parses, and returns Spatial DOM
  → Plugin updates session token
  → Agent receives page content

Each agent gets its own isolated session with independent cookies, history, and form state.

SimpleClaw and other OpenClaw-compatible frameworks

The openclaw-browsy plugin works with any framework that implements the OpenClaw plugin API. This includes SimpleClaw and other lightweight agent orchestrators built on the OpenClaw standard.

SimpleClaw quick start

import { SimpleClaw } from "simpleclaw";
import { register } from "openclaw-browsy";

const claw = new SimpleClaw({
  plugins: [{ register }],
  config: {
    "openclaw-browsy": {
      port: 3847,
      preferBrowsy: true,
    },
  },
});

// Agents automatically get browsy tools
const agent = claw.createAgent({
  name: "researcher",
  instructions: "You browse the web and extract information.",
});

const result = await agent.run("Search for 'Rust web frameworks' and summarize the top 3 results");

Standalone usage (no framework)

You can also use the browsy client directly without OpenClaw:

import { BrowsyContext } from "openclaw-browsy";

const ctx = new BrowsyContext({ port: 3847, autoStart: false });

// Assumes browsy serve is already running
const page = await ctx.executeToolCall("browse", { url: "https://example.com" });
console.log(page);

const info = await ctx.executeToolCall("pageInfo", {});
console.log(info);

Custom agent orchestrators

Any orchestrator that implements the four-method OpenClawPluginApi interface can use the plugin:

interface OpenClawPluginApi {
  registerHook(name: string, handler: (...args: unknown[]) => unknown): void;
  registerService(name: string, service: { start?: () => void | Promise<void>; stop?: () => void | Promise<void> }): void;
  registerGatewayMethod(name: string, handler: (...args: unknown[]) => unknown): void;
  registerCommand(name: string, handler: (...args: unknown[]) => unknown): void;
}

Call register(api) with your implementation and browsy tools become available to your agents.

preferBrowsy mode

When preferBrowsy is enabled (the default), the plugin intercepts calls to built-in browser tools and aborts them with a message directing the agent to use browsy_browse instead. This provides a transparent speed upgrade for agents that were previously using Playwright.

The intercepted tool names are:

  • browser
  • web_browser
  • playwright_browser
  • browse_web

To disable interception and run browsy alongside the built-in browser:

{
  "openclaw-browsy": {
    "preferBrowsy": false
  }
}

When to fall back to a full browser

browsy handles server-rendered pages, forms, search, and data extraction. Fall back to Playwright for:

  • JS-heavy SPAs — React, Vue, Angular apps that render entirely client-side
  • Screenshots — when you need visual/pixel-level inspection
  • Complex JS interactions — drag-and-drop, infinite scroll, WebSocket-driven UIs
  • PDF generation — print-to-PDF workflows

With preferBrowsy: false, both browsy and the built-in browser are available. Agents can choose the right tool for each task.

Bundled skills

The plugin includes three runtime skills for common browsing patterns:

browse-and-extract

Navigate to a URL and extract data, automatically handling cookie consent and login walls.

web-research

Search the web, visit multiple pages, and compile a research summary with source attribution.

form-filler

Detect form fields using browsy's page intelligence, fill them with provided data, and submit.

Python Bindings

browsy provides Python bindings via PyO3. The API closely mirrors the Rust Session API.

Installation

pip install browsy-ai

The package ships a compiled native extension (_core.pyd / _core.so). No Rust toolchain required for installation from wheels.

Module contents

from browsy import Browser, Page, Element
ClassDescription
BrowserA browsing session with cookie persistence and form state
PageA parsed page (the Spatial DOM)
ElementA single element in the Spatial DOM

Basic usage: parsing HTML

The Browser class can parse local HTML without network access:

from browsy import Browser

browser = Browser(viewport_width=1920, viewport_height=1080)
page = browser.load_html('<h1>Hello</h1><a href="/about">About</a>', 'https://example.com')

print(page.title)       # ""
print(len(page))        # 2
for el in page.elements:
    print(el.id, el.tag, el.text)
# 1 h1 Hello
# 2 a About

Browsing: navigating URLs

from browsy import Browser

browser = Browser()
page = browser.goto("https://example.com")

print(page.title)       # "Example Domain"
print(page.url)         # "https://example.com"
print(page.page_type()) # "Other"

Page properties and methods

page.title              # str: page title
page.url                # str: current URL
page.elements           # list[Element]: all elements
page.visible()          # list[Element]: non-hidden elements only
page.above_fold()       # list[Element]: elements with top edge within viewport
page.get(id)            # Element or None: lookup by ID
page.page_type()        # str: "Login", "Search", "Article", "List", etc.
page.suggested_actions() # list[dict]: detected action recipes
page.alerts()           # list[Element]: elements with alert_type set
page.tables()           # list[dict]: extracted table data (headers + rows)
page.pagination()       # dict or None: next/prev/pages links
page.to_json()          # str: full JSON serialization
page.to_compact()       # str: compact text format
len(page)               # int: element count

Element properties

el.id                   # int: unique element ID
el.tag                  # str: HTML tag name
el.role                 # str or None: ARIA role (implicit or explicit)
el.text                 # str or None: visible text content
el.href                 # str or None: link target (resolved to absolute URL)
el.placeholder          # str or None: placeholder text
el.value                # str or None: current value
el.input_type           # str or None: input type attribute
el.name                 # str or None: HTML name attribute
el.label                # str or None: associated label text
el.alert_type           # str or None: "alert", "error", "success", "warning"
el.disabled             # bool or None
el.checked              # bool or None
el.expanded             # bool or None
el.selected             # bool or None
el.required             # bool or None
el.hidden               # bool or None: True if element is hidden
el.bounds               # tuple[int, int, int, int]: (x, y, width, height)

Form interaction

browser = Browser()
page = browser.goto("https://example.com/login")

# Type into fields by element ID
browser.type_text(5, "user@example.com")
browser.type_text(8, "secretpassword")

# Check a "remember me" checkbox
browser.check(10)

# Select a dropdown option
browser.select(12, "en-US")

# Read the updated DOM with form state overlaid
page = browser.dom()

# Submit by clicking the submit button
page = browser.click(15)

Compound actions

For detected form patterns, compound actions handle the full workflow:

# Login (requires Login suggested action on current page)
page = browser.login("user@example.com", "password123")

# Enter verification code (requires EnterCode suggested action)
page = browser.enter_code("123456")

Search

# Search the web (DuckDuckGo by default)
results = browser.search("python web scraping")
for r in results:
    print(r["title"], r["url"], r["snippet"])

Finding elements

# Find by text content (exact substring match)
elements = browser.find_by_text("Sign In")

# Find by text content (case-insensitive substring)
elements = browser.find_by_text_fuzzy("sign in")

# Find by ARIA role
buttons = browser.find_by_role("button")
headings = browser.find_by_role("heading")
links = browser.find_by_role("link")

# Find input by semantic purpose
password_input = browser.find_input_by_purpose("password")
email_input = browser.find_input_by_purpose("email")
search_input = browser.find_input_by_purpose("search")
# Supported purposes: "password", "email", "username", "code", "search", "phone"

# Find verification codes on the page
code = browser.find_verification_code()  # str or None
# Navigate to a URL
page = browser.goto("https://example.com")

# Click a link (navigates to its href)
page = browser.click(3)

# Go back
page = browser.back()

Suggested actions

page = browser.goto("https://example.com/login")

for action in page.suggested_actions():
    print(action)
    # {"action": "Login", "username_id": 5, "password_id": 8, "submit_id": 12}

Each action is a dictionary with an "action" key identifying the type and additional fields with element IDs. See the Action Recipes Reference for all variants.

Viewport configuration

# Mobile viewport
browser = Browser(viewport_width=375, viewport_height=812)

# Desktop viewport (default)
browser = Browser(viewport_width=1920, viewport_height=1080)

The viewport dimensions affect CSS media query evaluation and layout computation, which in turn affects element positions and visibility.

CLI Usage

The browsy CLI provides three commands: fetch for URLs, parse for local HTML files, and serve for the REST API server.

Installation

cargo install browsy

Commands

fetch

Fetch a URL, compute the Spatial DOM, and print the result.

browsy fetch <URL> [OPTIONS]
FlagDescription
--jsonOutput as JSON instead of compact format
--viewport <WxH>Viewport size (default: 1920x1080)
--no-cssSkip fetching external CSS stylesheets
--visible-onlyOnly include visible (non-hidden) elements
--above-foldOnly include elements above the viewport fold

Examples:

# Compact output (default)
browsy fetch https://example.com

# JSON output
browsy fetch https://example.com --json

# Mobile viewport
browsy fetch https://example.com --viewport 375x812

# Skip external CSS for faster parsing
browsy fetch https://example.com --no-css

# Only visible above-fold elements
browsy fetch https://example.com --visible-only --above-fold

parse

Parse a local HTML file and print the Spatial DOM. No network requests are made (external stylesheets are not fetched).

browsy parse <FILE> [OPTIONS]
FlagDescription
--jsonOutput as JSON instead of compact format
--viewport <WxH>Viewport size (default: 1920x1080)

Use - to read from stdin:

echo '<h1>Hello</h1>' | browsy parse -
curl -s https://example.com | browsy parse -

Examples:

# Parse a local file
browsy parse index.html

# Parse with JSON output
browsy parse index.html --json

# Parse from stdin
cat page.html | browsy parse -

serve

Start the REST API + A2A server.

browsy serve [OPTIONS]
FlagDescription
--port <PORT>Port to listen on (default: 3847)
--allow-private-networkAllow fetching private/LAN addresses

Examples:

# Start on default port
browsy serve

# Custom port
browsy serve --port 8080

# Allow local development server access
browsy serve --allow-private-network

The server exposes a REST API and A2A protocol endpoints. See REST API and A2A Protocol.

Output formats

Compact format (default)

The compact format is designed for minimal token usage in LLM contexts:

title: Example Domain
url: https://example.com
vp: 1920x1080
els: 3
---
[1:h1 "Example Domain"]
[2:p "This domain is for use in illustrative examples in documents."]
[3:a "More information..." ->https://www.iana.org/domains/example]

The header shows the page title, URL, viewport dimensions, and element count. Each element line follows the pattern [id:tag "text"] with optional annotations:

  • !id:tag -- hidden element
  • id:input:password -- input type (when not "text")
  • [name] -- HTML name attribute
  • [v] -- checked
  • [*] -- required
  • [=value] -- current value
  • ->url -- href
  • narrow / wide / full -- width relative to viewport
  • @region -- position (only when needed to disambiguate duplicates)

JSON format

The JSON format includes the full SpatialDom structure with all element properties. See the Architecture page for the complete schema.

MCP server mode

browsy also runs as an MCP server for use with Claude Code and other MCP clients. See MCP Server for details.

browsy mcp

Web Search

browsy includes built-in web search via DuckDuckGo and Google. No API keys or external services required -- it fetches search result pages directly and parses the HTML.

Search engines

EngineEndpointReliability
DuckDuckGohttps://html.duckduckgo.com/html/High. Uses the HTML-only endpoint, no JavaScript needed.
Googlehttps://www.google.com/searchVariable. Google may return CAPTCHAs or block automated requests.

DuckDuckGo is the default and recommended engine.

Rust API

#![allow(unused)]
fn main() {
use browsy_core::fetch::{Session, SearchEngine};

let mut session = Session::new()?;
let results = session.search("rust web scraping")?;

for r in &results {
    println!("{}: {} -- {}", r.title, r.url, r.snippet);
}
}

Choosing a search engine

#![allow(unused)]
fn main() {
let results = session.search_with("rust web scraping", SearchEngine::Google)?;
}

Search and read

Search and automatically fetch the top N result pages:

#![allow(unused)]
fn main() {
let pages = session.search_and_read("rust web scraping", 3)?;

for page in &pages {
    println!("--- {} ---", page.result.title);
    if let Some(ref dom) = page.dom {
        println!("  Page type: {:?}", dom.page_type);
        println!("  Elements: {}", dom.els.len());
    } else {
        println!("  (fetch failed)");
    }
}
}

Each SearchPage contains the original SearchResult (title, URL, snippet) and an Option<SpatialDom> for the fetched page. Pages that fail to fetch have dom: None.

#![allow(unused)]
fn main() {
let pages = session.search_and_read_with(
    "rust web scraping",
    5,
    SearchEngine::DuckDuckGo,
)?;
}

Python API

from browsy import Browser

browser = Browser()

# Basic search (DuckDuckGo)
results = browser.search("python asyncio tutorial")
for r in results:
    print(r["title"], r["url"])

Search results are returned as a list of dictionaries, each with title, url, and snippet keys.

MCP API

The search tool accepts a query and optional engine:

{
  "query": "browsy zero-render browser",
  "engine": "duckduckgo"
}

Returns a JSON array of results:

[
  {
    "title": "browsy - Zero-render browser engine",
    "url": "https://example.com/browsy",
    "snippet": "A browser engine for AI agents..."
  }
]

SearchResult struct

#![allow(unused)]
fn main() {
pub struct SearchResult {
    pub title: String,
    pub url: String,
    pub snippet: String,
}
}

How it works

DuckDuckGo

browsy fetches https://html.duckduckgo.com/html/?q=<query>, which returns a pure HTML page with no JavaScript. Results are extracted by finding <div class="result"> containers and parsing the title link (result__a), URL (result__url), and snippet (result__snippet). Redirect URLs are decoded from the uddg query parameter.

Google

browsy fetches https://www.google.com/search?q=<query>&num=10. Results are extracted using a structural pattern: anchor tags containing an <h3> descendant. The title comes from the h3 text, the URL from the anchor href (with /url?q= redirect decoding), and snippets from nearby div elements. The parser targets the #rso results container to skip ads and navigation.

Google results may be less reliable because Google actively detects and blocks automated requests. DuckDuckGo's HTML endpoint is specifically designed for non-JavaScript clients and is the recommended default.

Page Types Reference

browsy classifies every page into a PageType to help agents decide what to do next. The classification is based on structural heuristics applied to the Spatial DOM -- no machine learning, no external services.

Page types are evaluated in priority order. The first match wins.

PageType enum

#![allow(unused)]
fn main() {
pub enum PageType {
    Error,
    Captcha,
    Login,
    TwoFactorAuth,
    OAuthConsent,
    Inbox,
    EmailBody,
    Dashboard,
    Article,
    SearchResults,
    List,
    Search,
    Form,
    Other,          // default
}
}

Detection criteria

Page TypeDetection Criteria
ErrorTitle contains HTTP error codes (404, 500, 403, not found, error) OR page has elements with alert_type == "error".
CaptchaTitle contains CAPTCHA keywords (captcha, verify you're human, robot, security check, just a moment, attention required) OR heading contains CAPTCHA phrases OR a CAPTCHA service (reCAPTCHA, hCaptcha, Turnstile, Cloudflare challenge) is detected in the HTML structure.
LoginPage has a visible <input type="password">.
TwoFactorAuthTitle or heading contains verification keywords (verification, enter code, security code, 2fa, two-factor, otp, one-time, passcode) AND page has a visible text/number/tel input. No password field present (that would be Login).
OAuthConsentTitle or heading contains OAuth keywords (authorize, allow access, grant permission, oauth, consent).
InboxTitle contains inbox keywords (inbox, mail, messages) AND page has 10+ visible links.
EmailBodyPage text contains 3+ of the email markers: from:, to:, subject:, date:.
DashboardTitle or heading contains dashboard keywords (dashboard, welcome back, overview) AND page has both a <nav> and <main> landmark.
ArticlePage has 3+ headings AND enough long paragraphs (>100 chars). When the page has 20+ links, the threshold is 10 long paragraphs (vs 2 for low-link pages). Pages with 15+ headings must have a paragraph-to-heading ratio of at least 0.8 to distinguish articles (Wikipedia) from heading-heavy list pages (BBC News).
SearchResultsPage has a search input (visible or hidden) AND 8+ links AND search context: title/heading contains search-result keywords (search results, results for, search) OR URL contains search query parameters (?q=, ?query=, ?s=, ?search=, /search).
ListPage has 10+ visible links. Evaluated after Article and SearchResults.
SearchPage has a visible search input. Evaluated after List (many list pages have search bars in navigation). Also fires as a fallback when a page has fewer than 5 visible elements but has a hidden search input (common in JS-rendered search engines without JS execution).
FormPage has 2+ visible data-entry inputs (excludes checkbox, radio, hidden, submit, button, and image inputs).
OtherDefault when no heuristic matches.

Evaluation order

The order matters. For example:

  • A login page with a search bar in the nav is classified as Login (password field check comes first), not Search.
  • A search results page with many links is SearchResults, not List, because SearchResults is checked before List.
  • An article with a search bar is Article, not Search, because Article is checked first.
  • An error page with a login form is Error, because error checks come before Login.

Accessing page type

Rust

#![allow(unused)]
fn main() {
use browsy_core::output::PageType;

let dom = browsy_core::parse(html, 1920.0, 1080.0);
match dom.page_type {
    PageType::Login => println!("This is a login page"),
    PageType::Article => println!("This is an article"),
    _ => println!("Page type: {:?}", dom.page_type),
}
}

Python

page = browser.goto("https://example.com")
print(page.page_type())  # "Login", "Article", "Other", etc.

MCP

The page_info tool returns page_type as a string. The browse tool includes it in the JSON output format.

JSON serialization

PageType is serialized as a string. The field is omitted from JSON when the value is Other (via skip_serializing_if).

{
  "page_type": "Login",
  "title": "Sign In",
  "url": "https://example.com/login"
}

Action Recipes Reference

browsy detects structured action patterns on each page and emits them as SuggestedAction variants. Each action provides element IDs that an agent can use directly with click, type_text, check, and select operations.

Actions are detected after page type classification. Multiple actions can coexist on a single page (a login page might also have a Search action for the nav bar and a CookieConsent action for a banner).

SuggestedAction enum

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(tag = "action")]
pub enum SuggestedAction {
    Login { ... },
    Register { ... },
    Contact { ... },
    FillForm { ... },
    Search { ... },
    EnterCode { ... },
    Download { ... },
    CaptchaChallenge { ... },
    CookieConsent { ... },
    Consent { ... },
    SelectFromList { ... },
    Paginate { ... },
}
}

All actions are serialized with a "action" tag field for easy pattern matching.


Login

Detected when the page has a visible password input, a nearby text/email input, and a submit button.

{
  "action": "Login",
  "username_id": 5,
  "password_id": 8,
  "submit_id": 12,
  "remember_me_id": 10
}
FieldTypeDescription
username_idu32Text or email input nearest to the password field (within 500px Y)
password_idu32The <input type="password"> element
submit_idu32Nearest submit button below the password field
remember_me_idOption<u32>Checkbox with "remember" in its label or name

When it fires: Page has a visible password input and a nearby username/email input. Does NOT fire if the page also has registration context (confirm password + registration keywords) -- Register takes priority in that case. When a page has both login and registration sections (like Hacker News), Login takes priority over Register.

Usage: The MCP login tool and Python browser.login() use this action internally. They type into username_id and password_id, then click submit_id.


Register

Detected on registration pages: password field plus either a confirm password field or registration keywords in the title/heading.

{
  "action": "Register",
  "email_id": 3,
  "username_id": 4,
  "password_id": 7,
  "confirm_password_id": 9,
  "name_id": 2,
  "submit_id": 11
}
FieldTypeDescription
email_idOption<u32>Email input
username_idOption<u32>Username text input
password_idu32Primary password input
confirm_password_idOption<u32>Second password input (confirm)
name_idOption<u32>Full name text input
submit_idu32Submit button

When it fires: Page has a visible password field AND either (a) two or more password fields (confirm password pattern) or (b) title/heading contains registration keywords (register, sign up, signup, create account, join, new account). Does not fire when login keywords are present alongside confirm password (dual login/register pages prefer Login).


Contact

Detected on contact forms: a textarea (message body) plus contact-related context in the title or headings.

{
  "action": "Contact",
  "name_id": 2,
  "email_id": 4,
  "message_id": 6,
  "submit_id": 8
}
FieldTypeDescription
name_idOption<u32>Name input
email_idOption<u32>Email input
message_idu32Textarea element
submit_idu32Submit button

When it fires: Page has a visible textarea AND title/heading contains contact keywords (contact us, contact form, get in touch, reach out, send us a message, inquiry).


FillForm

Generic form action for pages classified as Form that don't match a more specific pattern (Login, Register, Contact, Search).

{
  "action": "FillForm",
  "fields": [
    { "id": 3, "label": "First Name", "name": "first_name", "type": "text" },
    { "id": 5, "label": "Email Address", "name": "email", "type": "email" },
    { "id": 7, "label": "Phone", "name": "phone", "type": "tel" }
  ],
  "submit_id": 10
}
FieldTypeDescription
fieldsVec<FormField>Visible data-entry fields with labels
submit_idu32Submit button

Each FormField contains:

FieldTypeDescription
idu32Element ID
labelOption<String>Associated label text (from <label> or placeholder)
nameOption<String>HTML name attribute
input_typeOption<String>Input type attribute

When it fires: Page type is Form (2+ data-entry inputs) AND no more specific form action (Login, Register, Contact, Search) was already detected.


Search

Detected when a search input is present on the page.

{
  "action": "Search",
  "input_id": 15,
  "submit_id": 16
}
FieldTypeDescription
input_idu32Search input element
submit_idu32Submit button

When it fires: Page has an input matching search criteria: type="search", role="searchbox", name="q", name contains "search", or placeholder contains "search". Prefers visible inputs but falls back to hidden ones (for JS-rendered search engines).


EnterCode

Detected on verification/2FA pages with code-related context.

{
  "action": "EnterCode",
  "input_id": 4,
  "submit_id": 6,
  "code_length": 6
}
FieldTypeDescription
input_idu32Code input element (first input if multiple narrow digit inputs)
submit_idu32Submit button
code_lengthOption<usize>Expected code length (set when 4-8 narrow inputs are detected)

When it fires: Title or heading contains verification keywords AND the page has a visible text/number/tel input. Does not fire if a password field is present (that is Login). Detects separate-digit inputs (width < 60px, 4-8 inputs) and reports the code length.

Usage: The MCP enter_code tool and Python browser.enter_code() use this action internally.


Download

Detected when the page has links or buttons with download-related text or file extension hrefs.

{
  "action": "Download",
  "items": [
    { "id": 20, "text": "Download v2.1.0", "href": "https://example.com/release.zip" },
    { "id": 22, "text": "Download PDF", "href": "https://example.com/guide.pdf" }
  ]
}
FieldTypeDescription
itemsVec<DownloadItem>Downloadable links/buttons

Each DownloadItem contains:

FieldTypeDescription
idu32Element ID
textOption<String>Link/button text
hrefOption<String>Download URL

When it fires: Page has visible links or buttons where the text starts with "download" (and is short) or the href ends with a known file extension (.zip, .tar.gz, .dmg, .exe, .msi, .deb, .rpm, .pkg, .appimage, .pdf, .csv, .xlsx).


CaptchaChallenge

Detected when a CAPTCHA service is found in the HTML structure or the page is classified as Captcha.

{
  "action": "CaptchaChallenge",
  "captcha_type": "ReCaptcha",
  "sitekey": "6Le-wvkSAAAAABx7...",
  "submit_id": 15
}
FieldTypeDescription
captcha_typeCaptchaTypeType of CAPTCHA detected
sitekeyOption<String>Site key from data-sitekey attribute
submit_idOption<u32>Submit/verify button

When it fires: Page has a captcha field set (detected CAPTCHA service in HTML) OR page type is Captcha. See CAPTCHA Detection for details.


CookieConsent

Detected when the page has a cookie notice with accept/reject buttons.

{
  "action": "CookieConsent",
  "accept_id": 50,
  "reject_id": 52
}
FieldTypeDescription
accept_idu32Accept/agree button
reject_idOption<u32>Reject button (not always present)

When it fires: Page has a substantial text block (>30 chars) mentioning cookies/GDPR AND a button with accept-related text (accept all, accept cookies, allow cookies, allow all, agree, got it, i understand, i agree).


Detected on OAuth/authorization consent pages with approve/deny buttons.

{
  "action": "Consent",
  "approve_ids": [30],
  "deny_ids": [32]
}
FieldTypeDescription
approve_idsVec<u32>Approve/allow/authorize buttons
deny_idsVec<u32>Deny/cancel/decline buttons

When it fires: Title or heading contains OAuth keywords (authorize, allow access, grant permission, oauth, consent) AND the page has buttons with approve or deny text.


SelectFromList

Detected on pages with many links arranged in a list-like pattern.

{
  "action": "SelectFromList",
  "items": [10, 14, 18, 22, 26]
}
FieldTypeDescription
itemsVec<u32>One link ID per row (the first link in each row group)

When it fires: Page has 5+ visible links that form 5+ distinct rows (links within 30px Y are grouped into the same row). The action provides the first link ID from each row as representative items.


Paginate

Detected when the page has next/previous navigation links or numbered page links.

{
  "action": "Paginate",
  "next_id": 100,
  "prev_id": 98
}
FieldTypeDescription
next_idOption<u32>Next page link
prev_idOption<u32>Previous page link

When it fires: Page has links with pagination text (next, prev, previous, >, >>, <, <<, and Unicode equivalents).


Detection order

Actions are detected in this order:

  1. Register (or Login if no registration context)
  2. EnterCode
  3. Consent
  4. Contact
  5. Search
  6. SelectFromList
  7. CookieConsent
  8. Paginate
  9. FillForm (only if no more specific form action exists)
  10. Download
  11. CaptchaChallenge

Multiple actions can coexist. A login page with a cookie banner and nav search bar will have Login, CookieConsent, and Search actions simultaneously.

CAPTCHA Detection

browsy detects CAPTCHAs from HTML structure alone -- no rendering, no image analysis, no JavaScript execution. Detection works by scanning the raw DOM tree for known CAPTCHA service indicators before the Spatial DOM is generated.

CaptchaType enum

#![allow(unused)]
fn main() {
pub enum CaptchaType {
    ReCaptcha,           // Google reCAPTCHA v2 or v3
    HCaptcha,            // hCaptcha
    Turnstile,           // Cloudflare Turnstile
    CloudflareChallenge, // Cloudflare JS challenge ("Just a moment...")
    ImageGrid,           // Custom image-grid CAPTCHA ("select all images containing...")
    TextCaptcha,         // Text-based CAPTCHA (type characters from an image)
    Unknown,             // CAPTCHA detected but type not identified
}
}

Detection signals

browsy scans the layout tree for these patterns:

Script sources

PatternDetected as
src contains recaptcha or google.com/recaptchaReCaptcha
src contains hcaptcha.comHCaptcha
src contains challenges.cloudflare.com/turnstileTurnstile

Iframe sources

PatternDetected as
src contains recaptcha or google.com/recaptchaReCaptcha
src contains hcaptcha.com or newassets.hcaptcha.comHCaptcha

Div classes

PatternDetected as
Class contains g-recaptchaReCaptcha
Class contains h-captchaHCaptcha
Class contains cf-turnstileTurnstile

Div IDs

PatternDetected as
ID contains challenge-running or cf-challengeCloudflareChallenge

Site key

Any element with a data-sitekey attribute has its value captured. This attribute is used by reCAPTCHA, hCaptcha, and Turnstile to embed the site key.

Title and heading keywords

Page type detection checks title and headings for CAPTCHA-related phrases. These trigger PageType::Captcha even without a known CAPTCHA service:

Title keywords: captcha, verify you're human, verify you are human, robot, security check, challenge, just a moment, attention required, are you human

Heading keywords: captcha, verify you're human, security check, are you human, complete the challenge, human verification

CaptchaInfo struct

#![allow(unused)]
fn main() {
pub struct CaptchaInfo {
    pub captcha_type: CaptchaType,
    pub sitekey: Option<String>,
}
}

The sitekey is populated when a data-sitekey attribute is found. It is the value needed by third-party CAPTCHA solving services.

CaptchaChallenge action

When a CAPTCHA is detected, the CaptchaChallenge suggested action is emitted:

#![allow(unused)]
fn main() {
SuggestedAction::CaptchaChallenge {
    captcha_type: CaptchaType,
    sitekey: Option<String>,
    submit_id: Option<u32>,
}
}

The submit_id is the nearest verify/submit/continue button, if one exists. When no known CAPTCHA service is detected but the page is classified as Captcha, browsy infers the type:

  • 4+ image buttons on the page: ImageGrid
  • Otherwise: Unknown

Session methods

Rust

#![allow(unused)]
fn main() {
let mut session = Session::new()?;
let dom = session.goto("https://example.com")?;

// Check if the current page is a CAPTCHA
if session.is_captcha() {
    println!("CAPTCHA detected!");
}

// Get CAPTCHA details
if let Some(info) = session.captcha_info() {
    println!("Type: {:?}", info.captcha_type);
    if let Some(ref key) = info.sitekey {
        println!("Site key: {}", key);
    }
}
}

Python

browser = Browser()
page = browser.goto("https://example.com")

if page.page_type() == "Captcha":
    for action in page.suggested_actions():
        if action["action"] == "CaptchaChallenge":
            print(f"Type: {action['captcha_type']}")
            print(f"Site key: {action.get('sitekey')}")

MCP behavior

When the browse or click tools return a page detected as Captcha, the output is prefixed with a warning:

CAPTCHA detected (ReCaptcha) -- this page requires human verification to proceed.

The page_info tool includes the full CAPTCHA information:

{
  "page_type": "Captcha",
  "captcha": {
    "captcha_type": "ReCaptcha",
    "sitekey": "6Le-wvkSAAAA..."
  },
  "suggested_actions": [
    {
      "action": "CaptchaChallenge",
      "captcha_type": "ReCaptcha",
      "sitekey": "6Le-wvkSAAAA...",
      "submit_id": 15
    }
  ]
}

What browsy cannot do

browsy detects and classifies CAPTCHAs. It does not solve them. When a CAPTCHA is encountered, the agent has several options:

  1. Human-in-the-loop: Surface the CAPTCHA to a human operator.
  2. Third-party solver: Pass the captcha_type and sitekey to a CAPTCHA solving service (2captcha, Anti-Captcha, etc.), receive the solution token, and inject it.
  3. Alternative approach: Try a different URL, use an API instead of the web interface, or skip the blocked resource.
  4. Wait and retry: Some Cloudflare challenges resolve after a delay.

The sitekey in the CaptchaInfo is the value that third-party solving services typically require.

Detection pipeline

CAPTCHA detection happens at two stages:

  1. Tree scan (detect_captcha_from_tree): Before the Spatial DOM is generated, the layout tree is scanned for CAPTCHA service indicators (script/iframe sources, div classes/IDs, data-sitekey). This produces the CaptchaInfo stored on SpatialDom.captcha.

  2. Page type classification (detect_page_type): After the Spatial DOM is built, the page type heuristic checks for CAPTCHA signals: title keywords, heading keywords, and the presence of captcha on the SpatialDom. If any signal matches, the page is classified as PageType::Captcha.

  3. Action detection (detect_captcha_challenge_action): If captcha is set or the page type is Captcha, the CaptchaChallenge action is emitted with the type, sitekey, and submit button.

CSS Engine

browsy includes a CSS engine built from scratch in Rust. It handles selector matching, property parsing, variable resolution, calc() expressions, @media queries, and specificity ordering. The engine computes the subset of CSS properties needed for layout -- approximately 40 properties that affect bounding box computation.

Architecture

HTML ──> DomNode tree
          │
          ├── <style> blocks ──> parse_stylesheet() ──> Vec<CssRule>
          ├── External <link> CSS ──> fetched + parse_stylesheet()
          ├── Inline style="" ──> parse_inline_style_with_vars()
          │
          └── compute_styles() ──> StyledNode tree (LayoutStyle per node)
                │
                └── Taffy layout ──> bounding boxes

Style computation walks the DOM tree, matching each element against all CSS rules by specificity. Inline styles override stylesheet rules. CSS custom properties (--var) inherit through the tree.

Selector matching

The selector engine supports these selector types:

SelectorExampleDescription
Tagdiv, buttonMatches element tag name
Class.nav-itemMatches class attribute
ID#headerMatches id attribute
Universal*Matches any element
Descendantdiv pMatches p inside any div ancestor
Childdiv > pMatches p that is a direct child of div
Pseudo-class:hover, :first-childParsed but ignored for layout (no interaction state)
Attribute (exists)[disabled]Element has the attribute
Attribute (exact)[type="submit"]Attribute equals value
Attribute (word)[class~="active"]Whitespace-separated word match
Attribute (prefix)[href^="/"]Attribute starts with value
Attribute (suffix)[src$=".png"]Attribute ends with value
Attribute (contains)[class*="btn"]Attribute contains substring
Attribute (hyphen-prefix)[lang|="en"]Exact match or prefix with hyphen
Comma-separatedh1, h2, h3Union of selectors

Specificity

Selectors are ordered by CSS specificity rules:

  • ID selectors: weight 100
  • Class selectors, attribute selectors, pseudo-classes: weight 10
  • Tag selectors, universal: weight 1

Higher specificity rules override lower specificity rules. Equal specificity resolves by source order (later wins). Inline styles always win over stylesheet rules.

Property parsing

Supported properties

The engine parses approximately 40 layout-affecting CSS properties:

CategoryProperties
Box modeldisplay, box-sizing, width, height, min-width, min-height, max-width, max-height
Spacingmargin (+ sides), padding (+ sides), border-width (+ sides)
Positionposition, top, right, bottom, left
Flexboxflex-direction, flex-wrap, flex-grow, flex-shrink, flex-basis, align-items, align-self, justify-content, gap
Gridgrid-template-columns, grid-template-rows, grid-column, grid-row
Typographyfont-size, line-height
Visibilityvisibility, overflow

Shorthand properties are expanded: margin: 10px 20px expands to margin-top, margin-right, margin-bottom, margin-left. Similarly for padding, border-width, flex, and gap.

Dimension types

#![allow(unused)]
fn main() {
pub enum Dimension {
    Px(f32),           // Absolute pixels
    Percent(f32),      // Percentage of parent
    Calc(f32, f32),    // calc() result: (px_component, percent_component)
    Auto,              // Auto sizing
}
}

The engine resolves em values against the element's computed font-size and rem values against the root font size (16px default).

var() resolution

CSS custom properties are collected during style computation and inherited through the DOM tree:

:root {
  --primary-color: #333;
  --spacing: 16px;
}

.container {
  padding: var(--spacing);
  color: var(--primary-color);
}

.card {
  margin: var(--spacing-large, 24px);  /* fallback value */
}

The var() resolver supports:

  • Simple references: var(--name)
  • Fallback values: var(--name, fallback)
  • Nested var() references in fallbacks

calc() expressions

The calc() parser handles full arithmetic expressions with mixed units:

.element {
  width: calc(100% - 32px);
  margin: calc(16px + 1em);
  padding: calc(2 * var(--spacing));
}

Supported operators: +, -, *, /. The parser respects operator precedence and handles parenthesized sub-expressions. Mixed px and % units are preserved as a Calc(px, percent) dimension and resolved during layout.

@media queries

The engine evaluates @media queries against the viewport dimensions provided at parse time:

@media (max-width: 768px) {
  .sidebar { display: none; }
}

@media screen and (min-width: 1024px) {
  .container { max-width: 1200px; }
}

Supported media features

FeatureExampleDescription
min-width(min-width: 768px)Viewport width >= value
max-width(max-width: 1024px)Viewport width <= value
min-height(min-height: 600px)Viewport height >= value
max-height(max-height: 900px)Viewport height <= value
width(width: 1920px)Exact viewport width
height(height: 1080px)Exact viewport height
orientation(orientation: portrait)Portrait or landscape
screenscreenAlways matches
printprintNever matches
allallAlways matches

Multiple conditions joined with and are evaluated conjunctively. The screen and / all and prefix is stripped before evaluating conditions.

External stylesheets

When using the fetch feature (enabled by default), browsy automatically fetches external CSS linked via <link rel="stylesheet"> tags. Fetched CSS is parsed and merged with inline <style> blocks during style computation.

Resource limits prevent abuse:

  • Maximum total CSS bytes (across all external stylesheets)
  • Maximum bytes per individual stylesheet
  • Blocked URL patterns (analytics, tracking, ad-related CSS)
  • Private network and non-HTTP URL blocking

Layout engine

After style computation, browsy feeds the styled tree into Taffy (from the Dioxus project) for layout computation. Taffy handles:

  • Flexbox: All flex container and flex item properties
  • CSS Grid: Template columns/rows, explicit placement
  • Block layout: Standard block flow with margins, padding, borders

Taffy returns bounding boxes (x, y, width, height) for every element, which browsy uses to build the Spatial DOM.

What is NOT supported

The CSS engine focuses on properties that affect element position and size. The following are intentionally not implemented:

  • Visual properties: color, background, border-color, border-radius, box-shadow, opacity, z-index
  • Transforms: transform, translate, rotate, scale
  • Animations: animation, transition, @keyframes
  • Pseudo-elements: ::before, ::after, ::placeholder (no content generation)
  • Advanced selectors: :nth-child(), :not(), ~ (general sibling), + (adjacent sibling)
  • Advanced grid: grid-auto-flow, grid-auto-rows, named grid areas, minmax() in some contexts
  • Columns: column-count, column-width
  • Table layout: table-layout, border-collapse

These omissions are by design. browsy computes where elements are and how large they are, not what they look like. The Spatial DOM output contains position and size data; color and visual styling are irrelevant for agent interaction.

Architecture

browsy is a zero-render browser engine. It converts raw HTML into a flat list of interactive and text elements with bounding boxes, page type classification, and suggested actions -- without rendering pixels or executing JavaScript.

Pipeline

HTML
 │
 ├──────────────────────────────────────────────────────────────────┐
 v                                                                  │
DOM Parser (html5ever)                                              │
 │                                                                  │
 v                                                                  │
DomNode tree ──> External CSS fetch (reqwest) ──> merged CSS text   │
 │                                                                  │
 v                                                                  │
CSS Engine (browsy)                                                 │
 ├── Selector matching (tag, class, ID, attribute, combinators)     │
 ├── Property parsing (var(), calc(), shorthands)                   │
 ├── @media query evaluation                                        │
 └── Specificity + cascade ordering                                 │
 │                                                                  │
 v                                                                  │
StyledNode tree (LayoutStyle per element)                           │
 │                                                                  │
 v                                                                  │
Layout Engine (Taffy)                                               │
 ├── Flexbox                                                        │
 ├── CSS Grid                                                       │
 └── Block flow                                                     │
 │                                                                  │
 v                                                                  │
LayoutNode tree (with bounding boxes)                               │
 │                                                                  │
 v                                                                  │
Spatial DOM Generator (browsy)                                      │
 ├── Element emission (interactive + text + landmark + img)         │
 ├── CAPTCHA detection (from tree scan)                             │
 ├── Deduplication (wrapper skip)                                   │
 ├── Hidden content preservation                                    │
 ├── Text fallback chain (aria-label > title > img alt > svg title) │
 ├── Label association (<label for="id">)                           │
 ├── URL resolution (relative -> absolute)                          │
 ├── Page type classification                                       │
 └── Suggested action detection                                     │
 │                                                                  │
 v                                                                  │
SpatialDom                                                          │
 ├── els: Vec<SpatialElement>  (flat list with IDs + bounds)        │
 ├── page_type: PageType                                            │
 ├── suggested_actions: Vec<SuggestedAction>                        │
 ├── captcha: Option<CaptchaInfo>                                   │
 └── title, url, viewport, scroll                                   │

Entry point

The primary entry point is browsy_core::parse:

#![allow(unused)]
fn main() {
pub fn parse(html: &str, viewport_width: f32, viewport_height: f32) -> SpatialDom {
    let dom_tree = dom::parse_html(html);
    let styled = css::compute_styles_with_viewport(&dom_tree, viewport_width, viewport_height);
    let laid_out = layout::compute_layout(&styled, viewport_width, viewport_height);
    output::generate_spatial_dom(&laid_out, viewport_width, viewport_height)
}
}

For network-aware usage, Session::goto() fetches the HTML, resolves external CSS, and runs the full pipeline.

Project structure

crates/
  core/                 browsy-core library (the engine)
    src/
      lib.rs              Entry point: parse(html, w, h) -> SpatialDom
      dom/mod.rs           HTML -> DomNode tree (thin wrapper around html5ever)
      css/
        mod.rs              Style computation, CSS variable inheritance
        selector.rs         CSS selector matching engine
        properties.rs       CSS property parsing, var() resolution, calc()
      layout/mod.rs        Style tree -> Taffy -> bounding boxes
      output/mod.rs        SpatialDom generation, page type, actions, CAPTCHA
      js/mod.rs            Behavior detection from HTML attributes
      fetch/
        mod.rs              HTTP fetching, form extraction, resource blocking
        session.rs          Session API, search, navigation, form interaction
    tests/
      css_layout.rs        CSS + layout integration tests
      output.rs            Spatial DOM output tests
      benchmark.rs         Detection accuracy benchmark runner
      corpus/              HTML snapshots with ground truth labels

  cli/                  browsy CLI binary
    src/main.rs           fetch and parse commands

  mcp/                  browsy MCP server
    src/
      lib.rs              MCP tool definitions (14 tools)
      main.rs             stdio server entry point

  python/               Python bindings (PyO3)
    src/lib.rs            Browser, Page, Element classes
    browsy/__init__.py    Python module

What is ours vs external

browsy depends on two external crates for foundational work:

CrateRoleWhat it does
html5ever (Mozilla/Servo)HTML parsingConverts raw HTML into a DOM tree. Handles malformed HTML, character encoding, and the full HTML5 parsing algorithm.
Taffy (Dioxus)Layout computationComputes bounding boxes from a style tree. Handles Flexbox, CSS Grid, and block layout.

Everything else is built from scratch in browsy:

ComponentDescription
CSS selector matchingTag, class, ID, attribute selectors (7 operator types), descendant/child combinators, specificity ordering
CSS property parsing~40 layout properties, shorthand expansion, var() resolution with fallbacks, calc() with full expression parser
CSS variablesCustom property collection, inheritance through DOM tree
@media queriesmin-width, max-width, min-height, max-height, orientation, screen/print
Spatial DOM outputElement emission, deduplication, landmark markers, text fallback chains, hidden content exposure, alert detection, table extraction
Page intelligencePage type classification (14 types), suggested action detection (12 action types), CAPTCHA detection (7 CAPTCHA types), pagination detection, verification code extraction
Session APICookie persistence, navigation history, form state overlay, form submission, compound actions (login, enter_code)
Web searchDuckDuckGo and Google result parsing
Behavior detectiononclick/ARIA/Bootstrap pattern inference from HTML attributes

Key design decisions

Hidden content exposure

Elements with display:none, visibility:hidden, aria-hidden="true", or the hidden attribute are NOT discarded. They appear in the Spatial DOM with hidden: true. This is intentional -- agents need to see dropdown menus, accordion panels, modal dialogs, tab content, and other JS-toggled content that is present in the HTML but not visible without JavaScript execution.

Landmark markers

HTML5 landmarks (<nav>, <header>, <footer>, <main>, <aside>, <section>, <form>) and elements with explicit landmark ARIA roles are emitted as structural markers with their role only -- no recursive text collection. Their children carry the actual content. This prevents a <nav> from emitting a giant concatenated string of all its link texts.

Text fallback chain

Interactive elements (links, buttons) that contain no text but only images or icons get their text from a fallback chain:

  1. aria-label attribute
  2. title attribute
  3. Child <img alt> text
  4. Child <svg><title> text

This ensures that icon-only buttons like a hamburger menu or close button have accessible text in the Spatial DOM.

SVG handling

SVG child elements are not emitted (they are visual, not semantic). However, <svg><title> text is extracted and stored as the SVG element's aria-label, making it available through the text fallback chain.

Deduplication

Wrapper elements that only wrap a single interactive child (like <li><a>..., <td><span>..., <p><a>...) are skipped. Only the meaningful child element is emitted. This prevents duplicate text in the output. When a wrapper has its own text that would not be captured by the child, it is emitted with only its own text.

Zero-size skip

Visible elements with zero width and height are skipped as layout artifacts. Hidden elements are always preserved regardless of size.

Element ID assignment

Element IDs are assigned sequentially (1, 2, 3, ...) during a single parse. IDs are NOT stable across page loads -- they are positional, not content-based. The delta diff system uses content keys (tag + text + href + bounds) rather than IDs to match elements across page transitions.

Testing

Integration tests

Tests live in crates/core/tests/ as integration tests:

cargo test -p browsy-core                        # all tests
cargo test -p browsy-core --test css_layout      # CSS + layout
cargo test -p browsy-core --test output          # Spatial DOM output

Detection benchmark

The crates/core/tests/corpus/ directory contains HTML snapshots of real websites with ground truth labels in manifest.json. The benchmark runner parses every snapshot and verifies:

  • Correct page type classification
  • Correct suggested action detection
  • Valid element IDs in all actions (referencing real elements)
  • Verification code extraction accuracy
cargo test -p browsy-core --test benchmark -- --nocapture

Adding a new test case:

  1. Harvest an HTML snapshot with HARVEST_URL and HARVEST_NAME environment variables.
  2. Add the expected labels to corpus/manifest.json.
  3. Run the benchmark to confirm the failure.
  4. Fix the heuristics in output/mod.rs.
  5. Re-run the benchmark to confirm the fix with no regressions.

Output formats

JSON

Full structured output via serde_json. All optional fields use skip_serializing_if to keep the JSON compact.

Compact text

A minimal text format designed for LLM token efficiency:

[1:h1 "Page Title"]
[!2:div "Hidden content"]
[3:input:email [email] [*] "Enter email" wide]
[4:button "Submit" full]
[5:a "Link" ->https://example.com @top-R]

Each element is one line: [id:tag "text"] with annotations for type, name, state, size, href, and position.

Delta format

For page transitions, the delta format shows only what changed:

-[3,5,7]
[+8:h1 "New Heading"]
[+9:a "New Link" ->https://example.com]

Removed element IDs are prefixed with -, added/changed elements with +.