browsy
Zero-render browser engine for AI agents. browsy.dev
browsy converts web pages into a structured Spatial DOM -- a flat list of interactive and text elements with bounding boxes, roles, and states -- without rendering pixels. On top of this, it layers page intelligence: automatic page type detection, suggested actions with stable element IDs, CAPTCHA detection, and hidden content exposure.
$ browsy fetch https://github.com/login
page_type: Login
suggested_actions:
Login { username: 19, password: 21, submit: 34 }
[19:input "Username or email address" @top-C]
[21:input "Password" @mid-C]
[34:button "Sign in" @mid-C]
// 203ms. No Chromium. No LLM needed.
Why browsy?
Every AI agent that touches the web today launches a 300MB Chromium instance, waits 5 seconds for it to render, then asks an LLM "what am I looking at?"
browsy skips all of that:
| Chromium-based tools | browsy | |
|---|---|---|
| Speed | 5-30 seconds per page | ~200ms |
| Dependencies | 282MB+ Chromium | 6MB binary |
| Page intelligence | None (LLM must figure it out) | 12 page types, 13 action recipes |
| Hidden content | Not accessible | Exposed with hidden: true |
| CAPTCHA detection | None | reCAPTCHA, hCaptcha, Turnstile, Cloudflare, image grid |
| Output | Raw accessibility tree | Structured Spatial DOM |
| Deterministic | No (LLM variance) | Yes (same HTML = same output) |
When to use browsy
browsy handles server-rendered HTML -- the 90% of the web that doesn't need a browser to understand. Login forms, search pages, news sites, government portals, documentation, e-commerce product pages.
For JS-rendered SPAs (React, Angular, Vue apps that render client-side), you still need a real browser. browsy is the fast path, not a full browser replacement.
Key features
- Page intelligence -- 12 page types detected automatically, 13 action recipes with element IDs
- CAPTCHA detection -- identifies reCAPTCHA, hCaptcha, Cloudflare Turnstile, image grids with sitekey extraction
- Hidden content exposure -- dropdowns, modals, accordions included with
hidden: true - Session API -- navigate, click, type, select, search -- with cookie persistence
- Built-in web search -- DuckDuckGo and Google, search and fetch results in one call
- Smart deduplication -- 34-42% element reduction on real sites
- Delta output -- only changes after first load
- MCP server -- use browsy from Claude Code or any MCP client
- Python bindings -- PyO3-based, full session API
- 6MB binary -- zero runtime dependencies
Quickstart
This guide covers the core browsy-core workflow: parse HTML, fetch live pages, read page intelligence, and interact with forms.
1. Install
cargo add browsy-core
This pulls in the fetch feature by default, which includes HTTP fetching via reqwest. See Installation for other installation methods.
2. Parse HTML
The simplest entry point is browsy_core::parse. Pass it an HTML string and a viewport size, and it returns a SpatialDom -- a flat list of elements with bounding boxes, roles, and states.
#![allow(unused)] fn main() { let html = r#" <html> <body> <h1>Hello, world</h1> <a href="/about">About</a> <input type="text" placeholder="Search..." /> </body> </html> "#; let dom = browsy_core::parse(html, 1920.0, 1080.0); // Iterate over elements for el in &dom.els { println!("[{}:{} {:?}]", el.id, el.tag, el.text); } }
The viewport dimensions (1920x1080 here) affect layout computation -- elements get positioned and sized as they would in a real browser at that resolution.
SpatialDom serializes to JSON via serde:
#![allow(unused)] fn main() { let json = serde_json::to_string_pretty(&dom).unwrap(); println!("{}", json); }
3. Fetch and parse a live page
The Session API handles HTTP fetching, cookie persistence, and page interaction. It requires the fetch feature (enabled by default).
use browsy_core::fetch::Session; fn main() -> Result<(), Box<dyn std::error::Error>> { let mut session = Session::new()?; let dom = session.goto("https://example.com")?; println!("Title: {}", dom.title); println!("Elements: {}", dom.els.len()); // Elements are accessible by ID if let Some(el) = dom.get(1) { println!("First element: {} {:?}", el.tag, el.text); } // Filter to visible-only or above-the-fold let visible = dom.visible(); let above_fold = dom.above_fold(); Ok(()) }
Sessions persist cookies across navigations. Each call to goto returns a fresh SpatialDom for the new page.
4. Read page intelligence
Every SpatialDom includes two forms of page intelligence: a detected page type and a list of suggested actions with stable element IDs.
use browsy_core::fetch::Session; use browsy_core::output::{PageType, SuggestedAction}; fn main() -> Result<(), Box<dyn std::error::Error>> { let mut session = Session::new()?; let dom = session.goto("https://github.com/login")?; // Page type: Login, Search, Article, Form, List, Dashboard, etc. println!("Page type: {:?}", dom.page_type); // Suggested actions tell the agent exactly what to do for action in &dom.suggested_actions { match action { SuggestedAction::Login { username_id, password_id, submit_id, .. } => { println!("Login form found:"); println!(" Username field: element {}", username_id); println!(" Password field: element {}", password_id); println!(" Submit button: element {}", submit_id); } SuggestedAction::Search { input_id, submit_id } => { println!("Search: input={}, submit={}", input_id, submit_id); } SuggestedAction::EnterCode { input_id, submit_id, code_length } => { println!("2FA code: input={}, submit={}, length={:?}", input_id, submit_id, code_length); } _ => println!("Action: {:?}", action), } } Ok(()) }
Page types
browsy detects 12 page types automatically:
| PageType | Meaning |
|---|---|
Login | Login form with username/password fields |
TwoFactorAuth | Verification code entry (2FA, email confirmation) |
OAuthConsent | OAuth authorization prompt |
Captcha | CAPTCHA challenge page |
Search | Search page (empty query state) |
SearchResults | Search results page |
Inbox | Email or message inbox |
EmailBody | Single email or message view |
Dashboard | Dashboard or admin panel |
Form | Generic form (registration, contact, settings) |
Article | Article, blog post, documentation page |
List | List or catalog page (products, directory) |
Error | Error page (404, 500, access denied) |
Other | No specific type detected |
CAPTCHA detection
When browsy detects a CAPTCHA, it sets page_type to Captcha and populates captcha with details:
#![allow(unused)] fn main() { if dom.page_type == PageType::Captcha { if let Some(captcha) = &dom.captcha { println!("CAPTCHA type: {:?}", captcha.captcha_type); // ReCaptcha, HCaptcha, Turnstile, CloudflareChallenge, ImageGrid, TextCaptcha if let Some(sitekey) = &captcha.sitekey { println!("Site key: {}", sitekey); } } } }
Or use the session convenience methods:
#![allow(unused)] fn main() { if session.is_captcha() { println!("CAPTCHA: {:?}", session.captcha_info()); } }
5. Log in to a site
browsy provides two ways to interact with login forms: manual (using element IDs) and automatic (using session.login).
Manual login
Use the element IDs from SuggestedAction::Login to type credentials and submit:
use browsy_core::fetch::Session; use browsy_core::output::SuggestedAction; fn main() -> Result<(), Box<dyn std::error::Error>> { let mut session = Session::new()?; let dom = session.goto("https://github.com/login")?; // Find the login action for action in &dom.suggested_actions { if let SuggestedAction::Login { username_id, password_id, submit_id, .. } = action { session.type_text(*username_id, "user@example.com")?; session.type_text(*password_id, "my-password")?; let result = session.click(*submit_id)?; println!("After login: {:?}", result.page_type); break; } } Ok(()) }
Automatic login
session.login detects the login form from suggested_actions and fills it in one call:
#![allow(unused)] fn main() { let mut session = Session::new()?; session.goto("https://github.com/login")?; let result = session.login("user@example.com", "my-password")?; println!("After login: {:?}", result.page_type); }
This fails with FetchError::ActionError if no SuggestedAction::Login is detected on the current page.
2FA / verification codes
If the login redirects to a 2FA page, use enter_code:
#![allow(unused)] fn main() { if result.page_type == PageType::TwoFactorAuth { let final_page = session.enter_code("123456")?; println!("After 2FA: {:?}", final_page.page_type); } }
6. Search the web
browsy has built-in web search via DuckDuckGo and Google. No API keys required.
Get search results
use browsy_core::fetch::{Session, SearchEngine}; fn main() -> Result<(), Box<dyn std::error::Error>> { let mut session = Session::new()?; // DuckDuckGo (default) let results = session.search("rust web frameworks")?; for r in &results { println!("{}: {}", r.title, r.url); println!(" {}", r.snippet); } // Google let results = session.search_with("rust web frameworks", SearchEngine::Google)?; Ok(()) }
Search and read pages
search_and_read fetches the top N results and returns each page's SpatialDom:
#![allow(unused)] fn main() { let pages = session.search_and_read("browsy browser engine", 3)?; for page in &pages { println!("--- {} ---", page.result.title); if let Some(dom) = &page.dom { println!(" Page type: {:?}", dom.page_type); println!(" Elements: {}", dom.els.len()); } else { println!(" (fetch failed)"); } } }
Next steps
- Spatial DOM -- understand the output format in detail
- Page Intelligence -- all 13 action recipes explained
- Session API -- full reference for navigation, forms, and interaction
- MCP Server -- use browsy from Claude Code
Installation
browsy is available as a Rust library, a CLI binary, a Python package, and an MCP server.
Rust library
Add browsy-core to your project:
cargo add browsy-core
This enables the fetch feature by default, which includes HTTP fetching, session management, and web search via reqwest.
Without networking
To use browsy as a pure HTML-to-Spatial-DOM parser with no network dependencies:
cargo add browsy-core --no-default-features
This disables the fetch feature. You get browsy_core::parse(html, width, height) and nothing else -- no Session, no HTTP, no reqwest. Useful for embedding browsy in contexts where you handle fetching yourself.
#![allow(unused)] fn main() { // Available without fetch feature let dom = browsy_core::parse(html, 1920.0, 1080.0); // Requires fetch feature (enabled by default) use browsy_core::fetch::Session; let mut session = Session::new()?; }
Feature flags
| Feature | Default | Description |
|---|---|---|
fetch | Yes | HTTP fetching, Session API, web search, cookie persistence |
CLI
Install the browsy CLI binary:
cargo install browsy
Usage:
# Fetch and parse a live page
browsy fetch https://example.com
# Parse local HTML from stdin
cat page.html | browsy parse
# JSON output
browsy fetch https://example.com --format json
REST API server
The CLI includes a built-in REST API + A2A server:
browsy serve
browsy serve --port 8080
browsy serve --allow-private-network
See REST API for endpoint documentation and A2A Protocol for agent-to-agent integration.
Python
browsy has PyO3 bindings published as the browsy-ai package:
pip install browsy-ai
import browsy
# Parse HTML directly
dom = browsy.parse(html, 1920.0, 1080.0)
print(dom.page_type)
print(dom.suggested_actions)
# Session-based browsing
session = browsy.Session()
dom = session.goto("https://example.com")
session.type_text(19, "hello")
session.click(34)
The Python bindings expose the same Session API as the Rust library, including login, search, enter_code, and all form interaction methods.
Framework integrations
Install browsy with framework-specific extras:
pip install browsy-ai[langchain] # LangChain tools
pip install browsy-ai[crewai] # CrewAI tool
pip install browsy-ai[openai] # OpenAI function calling
pip install browsy-ai[autogen] # AutoGen integration
pip install browsy-ai[smolagents] # HuggingFace smolagents
pip install browsy-ai[all] # All integrations
See Framework Integrations for usage guides.
Requirements
- Python 3.9+
- No native dependencies (the compiled extension includes everything)
JavaScript / TypeScript
The browsy-ai npm package provides a TypeScript SDK with integrations for LangChain.js, OpenAI, and Vercel AI SDK:
npm install browsy-ai
import { BrowsyClient, BrowsyContext } from "browsy-ai"; // Core SDK
import { getTools } from "browsy-ai/langchain"; // LangChain.js
import { getToolDefinitions, handleToolCall } from "browsy-ai/openai"; // OpenAI
import { browsyTools } from "browsy-ai/vercel-ai"; // Vercel AI SDK
Framework dependencies are optional peer dependencies -- install only what you need:
npm install browsy-ai @langchain/core # LangChain.js
npm install browsy-ai openai # OpenAI
npm install browsy-ai ai # Vercel AI SDK
Requires Node.js 22+ and the browsy CLI (cargo install browsy) for the REST server.
See JavaScript / TypeScript for the full SDK guide.
MCP Server
browsy ships an MCP server that exposes the full Session API as tools. This works with Claude Code, Claude Desktop, and any MCP-compatible client.
Install
cargo install browsy-mcp
Configure for Claude Code
Add to your Claude Code MCP configuration (.claude/mcp.json or equivalent):
{
"mcpServers": {
"browsy": {
"command": "browsy-mcp",
"args": []
}
}
}
Configure for Claude Desktop
Add to your Claude Desktop config (claude_desktop_config.json):
{
"mcpServers": {
"browsy": {
"command": "browsy-mcp",
"args": []
}
}
}
Available MCP tools
The MCP server exposes these tools:
| Tool | Description |
|---|---|
browse | Navigate to a URL, returns Spatial DOM |
click | Click an element by ID |
type_text | Type into an input field by ID |
check / uncheck | Toggle checkboxes and radio buttons |
select | Select a dropdown option |
get_page | Get the current page DOM with form state |
back | Go back in navigation history |
search | Web search via DuckDuckGo or Google |
find | Find elements by text or ARIA role |
login | Fill and submit a login form |
enter_code | Fill and submit a verification code |
tables | Extract structured table data |
page_info | Get page metadata, type, and suggested actions |
Building from source
git clone https://github.com/GhostPeony/browsy
cd browsy
# Build everything (library + CLI + MCP server)
cargo build --release
# Run tests
cargo test -p browsy-core
# Install CLI and MCP server from local source
cargo install --path crates/cli
cargo install --path crates/mcp
Spatial DOM
The Spatial DOM is the primary output of browsy. It converts an HTML document into a flat list of SpatialElement structs -- each representing an interactive element, text block, or structural landmark -- with bounding boxes, ARIA roles, and form state. No tree traversal, no pixel rendering.
#![allow(unused)] fn main() { use browsy_core::parse; let dom = parse(html, 1920.0, 1080.0); // dom.els: Vec<SpatialElement> -- flat, ordered, ready for agent consumption }
SpatialElement fields
Every element in the Spatial DOM is a SpatialElement with these fields:
| Field | Type | Description |
|---|---|---|
id | u32 | Stable numeric ID, assigned sequentially. Used for all interactions (click, type_text, etc.) |
tag | String | HTML tag name (a, button, input, p, h1, etc.) |
role | Option<String> | ARIA role -- explicit from role attr or implicit from tag. link, button, textbox, heading, navigation, etc. |
text | Option<String> | Visible text content. For images, this is the alt text |
href | Option<String> | Link destination (resolved to absolute URL when parsed via Session) |
b | [i32; 4] | Bounding box: [x, y, width, height] in pixels relative to the document |
hidden | Option<bool> | Some(true) if the element is hidden. Absent (None) when visible |
name | Option<String> | HTML name attribute (form fields only: input, textarea, select) |
val | Option<String> | Current value from the HTML value attribute |
ph | Option<String> | Placeholder text |
label | Option<String> | Associated <label> text (resolved via <label for="id">) |
input_type | Option<String> | Input type (text, password, email, checkbox, radio, search, etc.). Serializes as type in JSON |
checked | Option<bool> | Whether a checkbox/radio is checked |
disabled | Option<bool> | Whether the element is disabled |
expanded | Option<bool> | ARIA expanded state (dropdowns, accordions) |
selected | Option<bool> | ARIA selected state (tabs, options) |
required | Option<bool> | Whether the field is required |
alert_type | Option<String> | Alert classification: "alert", "status", "error", "success", "warning" |
All Option fields use skip_serializing_if -- absent fields are omitted from JSON output to keep payloads compact.
Hidden content exposure
Elements with display: none, visibility: hidden, aria-hidden="true", or the hidden attribute are not discarded. They appear in the Spatial DOM with hidden: Some(true).
This is a deliberate design decision. Without JavaScript execution, browsy cannot toggle visibility. By including hidden elements, agents can see:
- Dropdown menus --
<ul>inside a nav that only appears on hover - Modal dialogs -- login forms, cookie consent, popups
- Accordion panels -- FAQ content behind collapsed sections
- Tab content -- inactive tab panels
- Off-canvas navigation -- mobile menus hidden at desktop widths
#![allow(unused)] fn main() { // All elements including hidden let all = &dom.els; // Only visible elements let visible = dom.visible(); // Hidden elements are distinguishable for el in &dom.els { if el.hidden == Some(true) { // This element is hidden in the rendered page } } }
Hidden elements always have a zero-size exemption -- they are preserved regardless of bounding box dimensions. Visible elements with zero width and height are skipped as layout artifacts.
Deduplication
HTML commonly wraps interactive elements in container tags that carry no additional meaning:
<li><a href="/about">About</a></li>
<td><span><button>Submit</button></span></td>
browsy collapses these wrappers. When a wrapper tag (li, td, th, span, p, dt, dd) contains only interactive children and no meaningful text of its own, the wrapper is skipped. Only the inner interactive element is emitted.
This produces a 34-42% element reduction on real sites without losing any semantic content.
Landmark markers
HTML5 landmark elements (nav, header, footer, main, aside, section, form) and elements with explicit landmark ARIA roles (navigation, banner, contentinfo, complementary, region, main, form) emit as role-only structural markers.
A landmark element appears in the output with its role but no recursive text. Its children carry the actual content as separate elements. This prevents the entire navigation bar's text from being duplicated into a single massive nav element.
{"id": 1, "tag": "nav", "role": "navigation", "b": [0, 0, 1920, 60]},
{"id": 2, "tag": "a", "role": "link", "text": "Home", "href": "/", "b": [20, 10, 80, 40]},
{"id": 3, "tag": "a", "role": "link", "text": "About", "href": "/about", "b": [120, 10, 80, 40]}
Element lookup
The SpatialDom maintains an internal HashMap<u32, usize> index for O(1) element lookup by ID:
#![allow(unused)] fn main() { // O(1) -- does not scan the element list let element = dom.get(42); }
The index is built automatically during parsing and can be rebuilt after mutation:
#![allow(unused)] fn main() { dom.els.push(new_element); dom.rebuild_index(); }
Filtering
#![allow(unused)] fn main() { // Only visible (non-hidden) elements let visible: Vec<&SpatialElement> = dom.visible(); // Elements whose top edge is within the viewport let above: Vec<&SpatialElement> = dom.above_fold(); // Elements whose top edge is below the viewport let below: Vec<&SpatialElement> = dom.below_fold(); // New SpatialDom containing only above-fold elements (for token-limited contexts) let trimmed: SpatialDom = dom.filter_above_fold(); }
The fold line is determined by dom.vp[1] (viewport height, default 1080px).
Tables
dom.tables() extracts structured table data by grouping th and td elements by their Y coordinates:
#![allow(unused)] fn main() { let tables: Vec<TableData> = dom.tables(); for table in &tables { println!("Headers: {:?}", table.headers); // Vec<String> for row in &table.rows { println!("Row: {:?}", row); // Vec<String> } } }
Elements within 5px of the same Y coordinate are grouped into the same row. Cells are sorted left-to-right by X position within each row.
Alerts
dom.alerts() returns elements with a detected alert_type:
#![allow(unused)] fn main() { let alerts: Vec<&SpatialElement> = dom.alerts(); for alert in &alerts { println!("{}: {}", alert.alert_type.as_deref().unwrap(), alert.text.as_deref().unwrap_or("")); // "error: Invalid password" // "success: Account created" } }
Alert types are detected from ARIA role attributes (alert, status) and CSS class patterns (alert-error, msg-danger, flash-success, etc.). Only compound class patterns are matched -- a bare error class is too ambiguous.
Verification codes
dom.find_codes() extracts 4-8 digit verification codes from page text:
#![allow(unused)] fn main() { let codes: Vec<String> = dom.find_codes(); // ["847291"] -- extracted from "Your verification code is 847291" }
Codes are found near keyword context (verification code, security code, your code, otp, passcode, one-time). Year-like 4-digit numbers (1900-2099) are filtered out. Proximity matching also checks nearby elements within 100px Y distance for keyword context.
Text fallback chain
For interactive elements (links, buttons) that contain no direct text -- only images or icons -- browsy walks a fallback chain to find meaningful text:
aria-labelattributetitleattribute- Child
<img>alttext - Child
<svg><title>text
This ensures that icon-only buttons and image links always have text for the agent to read.
Page Intelligence
Page intelligence is browsy's deterministic classification layer. Given a Spatial DOM, browsy computes a page type and a set of suggested actions (action recipes) -- each with concrete element IDs that agents can use directly. No LLM inference, no probabilistic guessing.
#![allow(unused)] fn main() { let dom = session.goto("https://github.com/login")?; assert_eq!(dom.page_type, PageType::Login); // dom.suggested_actions[0] == Login { username_id: 19, password_id: 21, submit_id: 34 } }
Page types
browsy classifies pages into one of 14 types, detected via priority-ordered heuristics applied to the Spatial DOM. The first matching rule wins.
| Page Type | Detection Signal |
|---|---|
Error | Alert elements with alert_type == "error", or title contains 404, 500, 403, not found, error |
Captcha | CAPTCHA service detected in HTML (reCAPTCHA, hCaptcha, Turnstile), or title/heading contains captcha, verify you're human, just a moment |
Login | Visible password input field present |
TwoFactorAuth | Title/heading contains verification keywords (verification, 2fa, otp, one-time, passcode) AND a visible text/number/tel input exists |
OAuthConsent | Title/heading contains authorize, allow access, grant permission, oauth, consent |
Inbox | Title contains inbox, mail, messages AND page has 10+ links |
EmailBody | 3+ email markers present in element text (from:, to:, subject:, date:) |
Dashboard | Title/heading contains dashboard, welcome back, overview AND both nav and main landmarks exist |
Article | 3+ headings AND 2+ long paragraphs (>100 chars). When link count >= 20, requires 10+ long paragraphs. Heading-heavy pages (15+ headings with low paragraph ratio) are excluded |
SearchResults | Search input present AND 8+ links AND (title/heading contains search results/results for OR URL contains search query params like ?q=) |
List | 10+ visible links |
Search | Visible search input (type search, role searchbox, name q, or placeholder/name containing search) |
Form | 2+ visible data-entry inputs (excludes checkbox, radio, hidden, submit, button, image) |
Other | No other type matched |
Detection order matters. A page with a password field and a search bar is classified as Login, not Search, because Login is checked first.
Action recipes
Alongside page type, browsy detects suggested actions -- structured recipes telling the agent exactly what to do and which element IDs to use. Each action maps directly to Session API calls.
Login
Detected when a visible password input exists near a text/email input.
{
"action": "Login",
"username_id": 19,
"password_id": 21,
"submit_id": 34,
"remember_me_id": 36
}
Agent usage: session.type_text(19, "user@example.com"), session.type_text(21, "pass"), session.click(34). Or simply: session.login("user@example.com", "pass").
Register
Detected when a password field is accompanied by a confirm-password field or registration keywords in the title/heading. Login takes priority when both login and registration sections are present on the same page.
{
"action": "Register",
"email_id": 12,
"username_id": 14,
"password_id": 16,
"confirm_password_id": 18,
"name_id": 10,
"submit_id": 22
}
EnterCode
Detected on verification/2FA pages with code-related keywords in the title or heading.
{
"action": "EnterCode",
"input_id": 8,
"submit_id": 12,
"code_length": 6
}
code_length is set when the page uses separate narrow digit inputs (4-8 inputs each <60px wide).
Search
Detected when an input has type search, role searchbox, name q, or a name/placeholder containing search.
{
"action": "Search",
"input_id": 5,
"submit_id": 7
}
Consent
Detected on OAuth/authorization pages with approve/deny buttons.
{
"action": "Consent",
"approve_ids": [15, 18],
"deny_ids": [20]
}
CookieConsent
Detected when a substantial text block mentions cookies/GDPR and accept/reject buttons are present.
{
"action": "CookieConsent",
"accept_id": 42,
"reject_id": 44
}
Contact
Detected on pages with contact-related keywords and a visible textarea for the message body.
{
"action": "Contact",
"name_id": 5,
"email_id": 7,
"message_id": 9,
"submit_id": 11
}
FillForm
Generic form detection. Emitted when visible form fields exist and no more specific action (Login, Register, Contact) matched. Includes labeled field metadata.
{
"action": "FillForm",
"fields": [
{"id": 10, "label": "First Name", "name": "first_name", "type": "text"},
{"id": 12, "label": "Email", "name": "email", "type": "email"}
],
"submit_id": 20
}
SelectFromList
Detected when 5+ links are arranged in distinct vertical rows (list-like layout).
{
"action": "SelectFromList",
"items": [3, 8, 13, 18, 23]
}
Paginate
Detected when next/previous navigation links are found (text matching next, previous, >, >>, etc.).
{
"action": "Paginate",
"next_id": 95,
"prev_id": 91
}
Download
Detected when links point to downloadable file types.
{
"action": "Download",
"items": [{"id": 30, "text": "Report Q4 2024", "href": "/files/report.pdf"}]
}
CaptchaChallenge
Detected when a CAPTCHA service is found in the HTML structure.
{
"action": "CaptchaChallenge",
"captcha_type": "ReCaptcha",
"sitekey": "6LcXxxAAAABBBCCC...",
"submit_id": 50
}
CAPTCHA detection
browsy identifies CAPTCHA services by scanning the HTML structure for known markers:
| Type | Detection |
|---|---|
ReCaptcha | g-recaptcha class, data-sitekey attr, reCAPTCHA script URLs |
HCaptcha | h-captcha class, hCaptcha script URLs |
Turnstile | cf-turnstile class, Turnstile script URLs |
CloudflareChallenge | Cloudflare "Just a moment..." challenge page pattern |
ImageGrid | Custom image-grid CAPTCHA (select matching images) |
TextCaptcha | Text-based CAPTCHA (type characters from an image) |
Unknown | CAPTCHA detected but service not identified |
CAPTCHA info is available at dom.captcha:
#![allow(unused)] fn main() { if let Some(captcha) = &dom.captcha { println!("Type: {:?}", captcha.captcha_type); // CaptchaType::ReCaptcha println!("Sitekey: {:?}", captcha.sitekey); // Some("6Lc...") } }
How detection works
All detection is deterministic, heuristic-based, priority-ordered. No machine learning models, no token costs. The same HTML always produces the same page type and action set.
The detection pipeline:
- Parse HTML into the Spatial DOM (element list with bounding boxes and roles)
- Scan for CAPTCHA markers in the layout tree
- Run
detect_page_type-- walks through page type checks in priority order, returns the first match - Run
detect_suggested_actions-- runs all action detectors independently, collecting all that match
Multiple actions can coexist. A login page might have both Login and CookieConsent actions. A search results page might have Search, SelectFromList, and Paginate.
Example flow
#![allow(unused)] fn main() { use browsy_core::fetch::Session; use browsy_core::output::PageType; let mut session = Session::new()?; let dom = session.goto("https://example.com/login")?; match dom.page_type { PageType::Login => { // Use the Login action recipe directly session.login("user@example.com", "hunter2")?; } PageType::TwoFactorAuth => { session.enter_code("847291")?; } PageType::Captcha => { let info = session.captcha_info(); // Report to the caller -- browsy cannot solve CAPTCHAs } _ => { // Read the page content, follow links, etc. } } }
Session API
The Session API provides stateful web browsing with cookie persistence, form interaction, navigation history, and built-in web search. It is the primary interface for agents interacting with the web through browsy.
#![allow(unused)] fn main() { use browsy_core::fetch::Session; let mut session = Session::new()?; let dom = session.goto("https://example.com")?; }
Requires the fetch feature (enabled by default).
Creating a session
Session::new()
Creates a session with default configuration (1920x1080 viewport, 30s timeout, CSS fetching enabled).
#![allow(unused)] fn main() { let mut session = Session::new()?; }
Session::with_config(config)
Creates a session with custom configuration.
#![allow(unused)] fn main() { use browsy_core::fetch::{Session, SessionConfig}; let config = SessionConfig { viewport_width: 1366.0, viewport_height: 768.0, timeout_secs: 15, fetch_css: false, // Skip external CSS for speed ..Default::default() }; let mut session = Session::with_config(config)?; }
SessionConfig fields
| Field | Type | Default | Description |
|---|---|---|---|
viewport_width | f32 | 1920.0 | Viewport width in pixels. Affects layout computation and fold detection |
viewport_height | f32 | 1080.0 | Viewport height in pixels. Defines the fold line |
user_agent | String | Chrome-like UA | HTTP User-Agent header |
timeout_secs | u64 | 30 | HTTP request timeout |
fetch_css | bool | true | Whether to fetch external CSS stylesheets. Disabling speeds up parsing but reduces layout accuracy |
blocked_patterns | Vec<String> | Analytics/tracking URLs | URL patterns to block (analytics, ads, tracking pixels) |
max_response_bytes | usize | 5MB | Maximum HTML response size |
max_css_bytes_total | usize | 2MB | Maximum total CSS bytes across all stylesheets |
max_css_bytes_per_file | usize | 512KB | Maximum size per individual CSS file |
max_redirects | usize | 10 | Maximum HTTP redirect chain length |
allow_private_network | bool | false | Whether to allow requests to private/internal IPs |
allow_non_http | bool | false | Whether to allow non-HTTP(S) schemes |
Navigation
goto(url) -> Result<SpatialDom, FetchError>
Navigate to a URL. Fetches the page, parses HTML, optionally fetches external CSS, computes layout, and returns the Spatial DOM. Cookies are persisted automatically.
#![allow(unused)] fn main() { let dom = session.goto("https://news.ycombinator.com")?; println!("Title: {}", dom.title); println!("Elements: {}", dom.els.len()); }
back() -> Result<SpatialDom, FetchError>
Navigate to the previous page in history. Returns an error if there is no history.
#![allow(unused)] fn main() { session.goto("https://example.com")?; session.goto("https://example.com/about")?; let dom = session.back()?; // Back to example.com }
url() -> Option<&str>
Returns the current page URL.
#![allow(unused)] fn main() { if let Some(url) = session.url() { println!("Currently at: {}", url); } }
Interaction
click(id) -> Result<SpatialDom, FetchError>
Click an element by ID. Behavior depends on the element type:
- Links (
<a>) -- navigates to thehrefURL. Skipsjavascript:,mailto:,tel:, and anchor-only (#) links. - Buttons / submit inputs -- submits the parent form with all current form values.
- Elements with JS behaviors -- simulated.
onclickhandlers withwindow.locationtrigger navigation. Toggle/show/hide behaviors modify the DOM.
#![allow(unused)] fn main() { let dom = session.goto("https://news.ycombinator.com")?; // Click the first link let dom = session.click(3)?; }
type_text(id, text) -> Result<(), FetchError>
Type text into an input or textarea. The value is stored in the session and overlaid onto the DOM. When a form is submitted via click, these values are included in the form data.
#![allow(unused)] fn main() { session.type_text(19, "user@example.com")?; session.type_text(21, "hunter2")?; }
Returns an error if the element is not an input or textarea.
check(id) -> Result<(), FetchError>
Check a checkbox or radio button.
#![allow(unused)] fn main() { session.check(36)?; // Check "Remember me" }
uncheck(id) -> Result<(), FetchError>
Uncheck a checkbox or radio button.
#![allow(unused)] fn main() { session.uncheck(36)?; }
toggle(id) -> Result<(), FetchError>
Toggle a checkbox or radio button based on its current effective state (considering session overrides and HTML defaults).
#![allow(unused)] fn main() { session.toggle(36)?; // If checked, unchecks. If unchecked, checks. }
select(id, value) -> Result<(), FetchError>
Select an option in a <select> element by value.
#![allow(unused)] fn main() { session.select(15, "california")?; }
Reading page state
dom() -> Option<SpatialDom>
Returns the current Spatial DOM with form state overlaid. Typed values, checked/unchecked states from type_text, check, and uncheck are reflected in the returned DOM.
#![allow(unused)] fn main() { session.type_text(19, "hello")?; let dom = session.dom().unwrap(); let el = dom.get(19).unwrap(); assert_eq!(el.val.as_deref(), Some("hello")); }
dom_ref() -> Option<&SpatialDom>
Returns a reference to the raw Spatial DOM without form state overlay. Reflects the page as parsed, ignoring any type_text/check/uncheck calls.
#![allow(unused)] fn main() { let raw = session.dom_ref().unwrap(); }
delta() -> Option<DeltaDom>
Returns the diff between the current and previous page. Only available after at least two navigations.
#![allow(unused)] fn main() { session.goto("https://example.com")?; session.goto("https://example.com/about")?; if let Some(delta) = session.delta() { println!("Added/changed: {}", delta.changed.len()); println!("Removed IDs: {:?}", delta.removed); } }
element(id) -> Option<&SpatialElement>
O(1) element lookup by ID.
#![allow(unused)] fn main() { if let Some(el) = session.element(42) { println!("{}: {}", el.tag, el.text.as_deref().unwrap_or("")); } }
Finding elements
find_by_text(text) -> Vec<&SpatialElement>
Exact substring match on element text (case-sensitive).
#![allow(unused)] fn main() { let results = session.find_by_text("Sign in"); }
find_by_text_fuzzy(text) -> Vec<&SpatialElement>
Case-insensitive substring match on element text.
#![allow(unused)] fn main() { let results = session.find_by_text_fuzzy("sign in"); // Matches "Sign In", "SIGN IN", "Please sign in", etc. }
find_by_role(role) -> Vec<&SpatialElement>
Find all elements with a specific ARIA role.
#![allow(unused)] fn main() { let headings = session.find_by_role("heading"); let links = session.find_by_role("link"); let buttons = session.find_by_role("button"); }
find_input_by_purpose(purpose) -> Option<&SpatialElement>
Find an input element by its semantic purpose. Matches on input type, name, label, and placeholder.
#![allow(unused)] fn main() { use browsy_core::fetch::InputPurpose; let password = session.find_input_by_purpose(InputPurpose::Password); let email = session.find_input_by_purpose(InputPurpose::Email); let username = session.find_input_by_purpose(InputPurpose::Username); let code = session.find_input_by_purpose(InputPurpose::VerificationCode); let search = session.find_input_by_purpose(InputPurpose::Search); let phone = session.find_input_by_purpose(InputPurpose::Phone); }
| Purpose | Matching logic |
|---|---|
Password | input[type="password"] |
Email | input[type="email"] or name/label contains email |
Username | Text/email input with name/label containing user or login |
VerificationCode | Text/number/tel input with name/label/placeholder containing code, otp, or verify |
Search | input[type="search"], role searchbox, or name containing search |
Phone | input[type="tel"] or name/label containing phone |
find_nearest_button(input_id) -> Option<&SpatialElement>
Find the nearest submit button to a given input element. Prefers buttons below the input, scored by Manhattan distance with Y weighted 2x.
#![allow(unused)] fn main() { if let Some(btn) = session.find_nearest_button(19) { println!("Submit button: {} (id: {})", btn.text.as_deref().unwrap_or(""), btn.id); } }
Compound actions
These methods combine multiple interactions into a single call, using the page intelligence action recipes.
login(username, password) -> Result<SpatialDom, FetchError>
Detects the login form from suggested_actions, fills in credentials, and submits. Returns the resulting page.
#![allow(unused)] fn main() { let dom = session.goto("https://github.com/login")?; let result = session.login("user@example.com", "hunter2")?; }
Returns an error if no Login action recipe was detected on the current page.
enter_code(code) -> Result<SpatialDom, FetchError>
Fills in a verification code and submits the form, using the EnterCode action recipe.
#![allow(unused)] fn main() { let result = session.enter_code("847291")?; }
find_verification_code() -> Option<String>
Extracts a verification code from the current page text (4-8 digit sequences near code-related keywords).
#![allow(unused)] fn main() { // On a page that says "Your verification code is 847291" if let Some(code) = session.find_verification_code() { session.enter_code(&code)?; } }
CAPTCHA detection
is_captcha() -> bool
Returns true if the current page is classified as a CAPTCHA challenge.
#![allow(unused)] fn main() { if session.is_captcha() { println!("CAPTCHA detected -- cannot proceed automatically"); } }
captcha_info() -> Option<&CaptchaInfo>
Returns CAPTCHA details if detected: captcha_type (ReCaptcha, HCaptcha, Turnstile, CloudflareChallenge, ImageGrid, TextCaptcha, Unknown) and optional sitekey.
#![allow(unused)] fn main() { if let Some(info) = session.captcha_info() { match info.captcha_type { CaptchaType::ReCaptcha => { println!("reCAPTCHA sitekey: {:?}", info.sitekey); } CaptchaType::CloudflareChallenge => { println!("Cloudflare challenge -- wait and retry"); } _ => {} } } }
Web search
search(query) -> Result<Vec<SearchResult>, FetchError>
Search the web using DuckDuckGo. Returns structured results with title, URL, and snippet.
#![allow(unused)] fn main() { let results = session.search("rust programming language")?; for r in &results { println!("{}: {} -- {}", r.title, r.url, r.snippet); } }
search_with(query, engine) -> Result<Vec<SearchResult>, FetchError>
Search with a specific engine.
#![allow(unused)] fn main() { use browsy_core::fetch::SearchEngine; let results = session.search_with("browsy", SearchEngine::Google)?; }
Available engines: SearchEngine::DuckDuckGo (default, most reliable) and SearchEngine::Google (may return CAPTCHAs for automated requests).
search_and_read(query, n) -> Result<Vec<SearchPage>, FetchError>
Search and fetch the top N results, returning each page's Spatial DOM alongside the search result metadata.
#![allow(unused)] fn main() { let pages = session.search_and_read("rust web scraping", 3)?; for page in &pages { println!("{}:", page.result.title); if let Some(ref dom) = page.dom { println!(" {} elements, page_type: {:?}", dom.els.len(), dom.page_type); } } }
Behaviors
behaviors() -> Vec<JsBehavior>
Detects JavaScript behaviors from HTML attributes (onclick, data-toggle, data-bs-toggle, etc.). Returns trigger element IDs and inferred actions.
#![allow(unused)] fn main() { let behaviors = session.behaviors(); for b in &behaviors { println!("Element {} triggers {:?}", b.trigger_id, b.action); } }
Error handling
All fallible methods return Result<_, FetchError>. Error variants:
| Variant | Cause |
|---|---|
FetchError::InvalidUrl(msg) | URL could not be parsed |
FetchError::BlockedUrl(url) | URL matched a blocked pattern or is a private network address |
FetchError::Network(msg) | HTTP request failed (timeout, DNS, connection refused) |
FetchError::HttpError(status) | Non-2xx HTTP status code |
FetchError::ResponseTooLarge(size, max) | Response exceeded max_response_bytes |
FetchError::ActionError(msg) | Invalid interaction (element not found, wrong element type, no page loaded) |
Output Formats
browsy supports three output formats for the Spatial DOM: JSON (full fidelity), compact (minimal tokens), and delta (changes only). The choice depends on your token budget and whether you need machine-readable structure or LLM-friendly brevity.
JSON format
The full SpatialDom serialized as JSON. Every field, every element, complete fidelity.
#![allow(unused)] fn main() { let json = serde_json::to_string_pretty(&dom)?; }
{
"url": "https://example.com",
"title": "Example",
"vp": [1920.0, 1080.0],
"scroll": [0.0, 0.0],
"page_type": "Login",
"suggested_actions": [
{
"action": "Login",
"username_id": 19,
"password_id": 21,
"submit_id": 34
}
],
"els": [
{
"id": 1,
"tag": "nav",
"role": "navigation",
"b": [0, 0, 1920, 60]
},
{
"id": 19,
"tag": "input",
"role": "textbox",
"ph": "Username or email address",
"type": "text",
"name": "login",
"label": "Username or email address",
"b": [480, 320, 960, 40]
},
{
"id": 21,
"tag": "input",
"role": "textbox",
"ph": "Password",
"type": "password",
"name": "password",
"label": "Password",
"b": [480, 380, 960, 40]
},
{
"id": 34,
"tag": "button",
"role": "button",
"text": "Sign in",
"b": [480, 440, 960, 44]
}
]
}
Optional fields (text, href, ph, val, name, label, input_type, hidden, checked, disabled, expanded, selected, required, alert_type) are omitted when absent, keeping the JSON compact. The page_type field is omitted when it is Other. The captcha field is omitted when no CAPTCHA is detected.
Use JSON when you need programmatic access to the full DOM structure, or when feeding the output to code rather than an LLM.
Compact format
A one-line-per-element text format designed for minimal token usage. This is the default output format in the MCP server and CLI.
#![allow(unused)] fn main() { use browsy_core::output::to_compact_string; let compact = to_compact_string(&dom); }
Each element is rendered as a bracketed line:
[id:tag "text" ->href]
Full example output:
[1:nav]
[5:h1 "Welcome"]
[19:input [login] "Username or email address" wide]
[21:input:password [password] "Password" wide]
[!25:a "Forgot password?" ->/reset]
[34:button "Sign in" wide]
[40:a "Create an account" ->/signup @bot]
Compact format rules
Basic structure: [id:tag ...] where id is the numeric element ID and tag is the HTML tag.
Input types: Non-text input types are appended after the tag: [21:input:password ...], [30:input:checkbox ...], [35:input:email ...]. Plain text inputs omit the type suffix.
Text content: Quoted strings show the element's text or placeholder: "Sign in", "Enter your email".
Links: Destinations shown with ->: [12:a "About" ->/about].
Form field names: Shown in square brackets: [login], [password], [email].
Checked state: [v] indicates a checked checkbox or radio button.
Required state: [*] indicates a required field.
Current value: [=value] shows the current value of a form field.
Hidden elements: Prefixed with ! to distinguish from visible elements: [!25:a "Forgot password?"].
Size hints: Form elements (input, button, textarea, select) include a width classification relative to viewport:
| Hint | Meaning |
|---|---|
narrow | Width < 15% of viewport |
wide | Width > 50% of viewport |
full | Width > 90% of viewport |
No hint is shown for elements between 15-50% of viewport width.
Position disambiguation: When multiple elements share the same (tag, text) tuple, a position tag is appended to disambiguate: @top-L, @top, @top-R, @mid-L, @mid, @mid-R, @bot-L, @bot, @bot-R, or @below (below the fold). Position tags are only added when needed -- unique elements have no position suffix.
The viewport is divided into a 3x3 grid for classification:
+--------+--------+--------+
| top-L | top | top-R |
+--------+--------+--------+
| mid-L | mid | mid-R |
+--------+--------+--------+
| bot-L | bot | bot-R |
+--------+--------+--------+
Compact format header
When served through the MCP server or CLI, compact output includes a metadata header:
title: GitHub Login
url: https://github.com/login
els: 47
---
[1:nav]
[5:h1 "Sign in to GitHub"]
...
Delta format
After the first page load, subsequent navigations can use delta output -- only the elements that changed. This dramatically reduces token usage for multi-step workflows.
#![allow(unused)] fn main() { use browsy_core::output::{diff, delta_to_compact_string}; let delta = diff(&old_dom, &new_dom); let compact_delta = delta_to_compact_string(&delta); }
The DeltaDom struct contains:
#![allow(unused)] fn main() { pub struct DeltaDom { pub changed: Vec<SpatialElement>, // Added or modified elements pub removed: Vec<u32>, // IDs of removed elements pub vp: [f32; 2], // Viewport for size hints } }
Compact delta format uses + for added/changed elements and - for removed IDs:
-[3,7,12,15]
[+19:input "Search" wide]
[+20:button "Go"]
[+21:h2 "Results"]
[+22:a "First result" ->https://example.com]
Matching between old and new elements is done by content similarity (tag + text + placeholder + href + input type + bounds), not by ID. IDs are assigned sequentially and may differ between page loads.
Using delta in the Session API
#![allow(unused)] fn main() { let mut session = Session::new()?; session.goto("https://example.com")?; session.goto("https://example.com/about")?; if let Some(delta) = session.delta() { let output = delta_to_compact_string(&delta); println!("{}", output); } }
Token comparison
Compact format uses approximately 58 characters per element on average, compared to 96-157 characters for JSON and accessibility-tree-based competitors. On a typical page with 80 elements:
| Format | Approximate tokens |
|---|---|
| Compact | ~1,200 |
| JSON | ~2,500 |
| Raw accessibility tree | ~4,000+ |
Delta format reduces this further on subsequent pages -- a navigation that changes 15 elements and removes 10 produces roughly 200 tokens instead of re-sending the full 1,200.
Choosing a format
| Scenario | Format |
|---|---|
| Programmatic consumption (code, not LLM) | JSON |
| LLM agent with normal context | Compact |
| LLM agent with tight token budget | Compact + filter_above_fold() |
| Multi-step browsing workflow | Compact for first page, delta for subsequent |
| Debugging / inspection | JSON |
MCP Server (Claude Code)
browsy runs as a Model Context Protocol (MCP) server, exposing its browser engine as tools that Claude Code (or any MCP client) can call directly.
Starting the server
browsy mcp
This launches browsy as a stdio-based MCP server. It creates a single persistent Session with cookie jar, navigation history, and form state.
Claude Code configuration
Add browsy to your claude_desktop_config.json:
{
"mcpServers": {
"browsy": {
"command": "browsy",
"args": ["mcp"]
}
}
}
The server advertises itself as browsy-mcp and exposes 14 tools.
Available tools
browse
Navigate to a URL and return the page content.
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | yes | URL to navigate to |
format | string | no | "compact" (default) or "json" |
scope | string | no | "all" (default), "visible", "above_fold", or "visible_above_fold" |
Returns the full Spatial DOM. In compact format, the output begins with a header block:
title: Example Domain
url: https://example.com
els: 12
---
[1:h1 "Example Domain"]
[2:p "This domain is for use in illustrative examples..."]
[3:a "More information..." ->https://www.iana.org/domains/example]
If a CAPTCHA is detected, a warning is prepended to the output:
CAPTCHA detected (ReCaptcha) -- this page requires human verification to proceed.
click
Click an element by its ID. Links navigate to new pages, buttons submit forms.
| Parameter | Type | Required | Description |
|---|---|---|---|
id | u32 | yes | Element ID to click |
Returns the resulting page DOM. Link clicks trigger navigation (fetching the href). Button clicks submit the enclosing form with all typed values and checked states. If a CAPTCHA is detected on the resulting page, a warning is included.
type_text
Type text into an input field or textarea by element ID.
| Parameter | Type | Required | Description |
|---|---|---|---|
id | u32 | yes | Element ID of the text input |
text | string | yes | Text to type into the input |
This stores the value in session state. The value is included in form submissions and reflected in subsequent get_page calls. Only works on <input> and <textarea> elements.
check
Check a checkbox or radio button by element ID.
| Parameter | Type | Required | Description |
|---|---|---|---|
id | u32 | yes | Element ID of the checkbox or radio button |
uncheck
Uncheck a checkbox or radio button by element ID.
| Parameter | Type | Required | Description |
|---|---|---|---|
id | u32 | yes | Element ID of the checkbox or radio button |
select
Select an option in a dropdown/select element.
| Parameter | Type | Required | Description |
|---|---|---|---|
id | u32 | yes | Element ID of the select element |
value | string | yes | Value to select |
get_page
Get the current page DOM with form state overlaid. Use after type_text, check, select, or uncheck to see the updated form values without re-fetching.
| Parameter | Type | Required | Description |
|---|---|---|---|
format | string | no | "compact" (default) or "json" |
scope | string | no | "all" (default), "visible", "above_fold", or "visible_above_fold" |
search
Search the web and return structured results with title, URL, and snippet.
| Parameter | Type | Required | Description |
|---|---|---|---|
query | string | yes | Search query |
engine | string | no | "duckduckgo" (default) or "google" |
Returns a JSON array of search results, each with title, url, and snippet fields.
back
Go back to the previous page in browsing history. No parameters. Returns the previous page's DOM.
login
Fill in a detected login form and submit it. Requires a page with a Login suggested action.
| Parameter | Type | Required | Description |
|---|---|---|---|
username | string | yes | Username or email |
password | string | yes | Password |
This is a compound action: it types the username into the detected username field, types the password into the password field, and clicks the submit button. Returns the resulting page DOM.
enter_code
Enter a verification or 2FA code into the detected code input field. Requires a page with an EnterCode suggested action.
| Parameter | Type | Required | Description |
|---|---|---|---|
code | string | yes | Verification or 2FA code |
Types the code into the detected input and clicks submit. Returns the resulting page DOM.
find
Find elements on the current page by text content or ARIA role.
| Parameter | Type | Required | Description |
|---|---|---|---|
text | string | no | Find elements containing this text |
role | string | no | Find elements with this ARIA role |
At least one of text or role must be provided. Returns a JSON array of matching elements.
tables
Extract structured table data from the current page. No parameters. Returns a JSON array of tables, each with headers (string array) and rows (array of string arrays).
page_info
Get page metadata without the full element list. No parameters. Returns:
{
"title": "Sign In - Example",
"url": "https://example.com/login",
"page_type": "Login",
"suggested_actions": [
{
"action": "Login",
"username_id": 5,
"password_id": 8,
"submit_id": 12
}
],
"alerts": [],
"pagination": null
}
When a CAPTCHA is detected, the response includes a captcha field with captcha_type and optional sitekey.
Example conversation flow
A typical agent interaction with a login-protected site:
- browse
https://app.example.com-- page_type isLogin, suggested_actions includesLoginwith field IDs. - login with username and password -- the agent calls
logindirectly, which fills and submits the form. - The result page might be
TwoFactorAuthwith anEnterCodeaction. - enter_code with the 2FA code -- fills the code input and submits.
- The result page is now
Dashboard-- the agent can proceed with its task.
For pages without compound actions, the lower-level tools work:
- browse the URL.
- type_text to fill form fields by ID.
- check or select for checkboxes and dropdowns.
- get_page to verify the form state looks correct.
- click the submit button to submit.
CAPTCHA warnings
When browse or click returns a page detected as Captcha, a warning line is prepended to the output:
CAPTCHA detected (HCaptcha) -- this page requires human verification to proceed.
The page_info tool also surfaces CAPTCHA details in a structured captcha field. browsy cannot solve CAPTCHAs -- it detects and classifies them so the agent can decide how to proceed (request human help, use a third-party solver, or try a different approach).
Output format
In compact mode (the default), elements are rendered as:
[id:tag "text"]
With additional annotations:
!id:tag-- hidden element (display:none, visibility:hidden, aria-hidden, or hidden attribute)[name]-- HTML name attribute[v]-- checked checkbox/radio[*]-- required field[=value]-- current value->url-- href targetnarrow/wide/full-- size hint for form elements@top-L/@mid/@bot-R-- position hint (only shown to disambiguate duplicate elements)
REST API
browsy includes a built-in HTTP server that exposes the full Session API as REST endpoints. This is the primary integration point for non-Rust, non-Python, and non-MCP clients.
Starting the server
browsy serve --port 3847
The server listens on http://localhost:3847 by default. See CLI Usage for all flags.
Session management
The server manages multiple concurrent browsing sessions. Each session has its own cookie jar, navigation history, and form state.
Sessions are identified by the X-Browsy-Session header:
| Scenario | Behavior |
|---|---|
No X-Browsy-Session header | Server creates a new session and returns the token in the response header |
| Valid token in header | Existing session is reused |
| Invalid or expired token | Server creates a new session and returns the new token |
| Session idle > 30 minutes | Session expires and is cleaned up |
| Server at capacity (default: 100 sessions) | Returns 503 Service Unavailable |
Every response includes the X-Browsy-Session header. Clients should capture it from the first response and include it in all subsequent requests.
# First request -- capture the session token
TOKEN=$(curl -s -D- -o /dev/null http://localhost:3847/api/browse \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}' | grep -i x-browsy-session | tr -d '\r' | cut -d' ' -f2)
# Subsequent requests -- reuse the session
curl http://localhost:3847/api/page-info -H "X-Browsy-Session: $TOKEN"
CORS
The server sends CORS headers on all responses:
Access-Control-Allow-Origin: *Access-Control-Allow-Headers: Content-Type, X-Browsy-SessionAccess-Control-Expose-Headers: X-Browsy-Session
This allows browser-based clients to call the API directly.
Endpoint reference
| Method | Path | Description |
|---|---|---|
POST | /api/browse | Navigate to a URL |
POST | /api/click | Click an element by ID |
POST | /api/type | Type text into an input |
POST | /api/check | Check a checkbox or radio |
POST | /api/uncheck | Uncheck a checkbox or radio |
POST | /api/select | Select a dropdown option |
POST | /api/search | Web search |
POST | /api/login | Fill and submit a login form |
POST | /api/enter-code | Enter a verification code |
POST | /api/find | Find elements by text or role |
POST | /api/back | Go back in history |
GET | /api/page | Get current page DOM |
GET | /api/page-info | Get page metadata |
GET | /api/tables | Extract table data |
GET | /health | Health check |
All POST endpoints accept Content-Type: application/json.
Endpoints
POST /api/browse
Navigate to a URL and return the Spatial DOM.
Request body:
| Field | Type | Required | Description |
|---|---|---|---|
url | string | yes | URL to navigate to |
format | string | no | "compact" (default) or "json" |
scope | string | no | "all" (default), "visible", "above_fold", or "visible_above_fold" |
curl http://localhost:3847/api/browse \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Response: The Spatial DOM in the requested format. Compact format returns plain text; JSON format returns the full structured DOM.
# JSON format with only visible elements
curl http://localhost:3847/api/browse \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "format": "json", "scope": "visible"}'
POST /api/click
Click an element by its ID. Links navigate to new pages; buttons submit forms.
Request body:
| Field | Type | Required | Description |
|---|---|---|---|
id | integer | yes | Element ID to click |
curl http://localhost:3847/api/click \
-H "Content-Type: application/json" \
-H "X-Browsy-Session: $TOKEN" \
-d '{"id": 3}'
Response: The resulting page DOM (after navigation or form submission).
POST /api/type
Type text into an input field or textarea.
Request body:
| Field | Type | Required | Description |
|---|---|---|---|
id | integer | yes | Element ID of the text input |
text | string | yes | Text to type |
curl http://localhost:3847/api/type \
-H "Content-Type: application/json" \
-H "X-Browsy-Session: $TOKEN" \
-d '{"id": 5, "text": "user@example.com"}'
Response: Confirmation. Use GET /api/page to see the updated form state.
POST /api/check
Check a checkbox or radio button.
Request body:
| Field | Type | Required | Description |
|---|---|---|---|
id | integer | yes | Element ID |
curl http://localhost:3847/api/check \
-H "Content-Type: application/json" \
-H "X-Browsy-Session: $TOKEN" \
-d '{"id": 10}'
POST /api/uncheck
Uncheck a checkbox or radio button.
Request body:
| Field | Type | Required | Description |
|---|---|---|---|
id | integer | yes | Element ID |
curl http://localhost:3847/api/uncheck \
-H "Content-Type: application/json" \
-H "X-Browsy-Session: $TOKEN" \
-d '{"id": 10}'
POST /api/select
Select an option in a dropdown.
Request body:
| Field | Type | Required | Description |
|---|---|---|---|
id | integer | yes | Element ID of the select element |
value | string | yes | Value to select |
curl http://localhost:3847/api/select \
-H "Content-Type: application/json" \
-H "X-Browsy-Session: $TOKEN" \
-d '{"id": 12, "value": "en-US"}'
POST /api/search
Search the web and return structured results.
Request body:
| Field | Type | Required | Description |
|---|---|---|---|
query | string | yes | Search query |
engine | string | no | "duckduckgo" (default) or "google" |
curl http://localhost:3847/api/search \
-H "Content-Type: application/json" \
-d '{"query": "rust web framework"}'
Response:
[
{
"title": "Actix Web - Rust Web Framework",
"url": "https://actix.rs",
"snippet": "A powerful, pragmatic, and fast web framework for Rust."
}
]
POST /api/login
Fill and submit a detected login form. Requires a page with a Login suggested action loaded in the session.
Request body:
| Field | Type | Required | Description |
|---|---|---|---|
username | string | yes | Username or email |
password | string | yes | Password |
# First navigate to the login page
curl http://localhost:3847/api/browse \
-H "Content-Type: application/json" \
-H "X-Browsy-Session: $TOKEN" \
-d '{"url": "https://app.example.com/login"}'
# Then submit credentials
curl http://localhost:3847/api/login \
-H "Content-Type: application/json" \
-H "X-Browsy-Session: $TOKEN" \
-d '{"username": "user@example.com", "password": "secretpassword"}'
Response: The resulting page DOM after login submission.
POST /api/enter-code
Enter a verification or 2FA code. Requires a page with an EnterCode suggested action.
Request body:
| Field | Type | Required | Description |
|---|---|---|---|
code | string | yes | Verification or 2FA code |
curl http://localhost:3847/api/enter-code \
-H "Content-Type: application/json" \
-H "X-Browsy-Session: $TOKEN" \
-d '{"code": "847291"}'
Response: The resulting page DOM after code submission.
POST /api/find
Find elements on the current page by text content or ARIA role.
Request body:
| Field | Type | Required | Description |
|---|---|---|---|
text | string | no | Find elements containing this text |
role | string | no | Find elements with this ARIA role |
At least one of text or role must be provided.
# Find by text
curl http://localhost:3847/api/find \
-H "Content-Type: application/json" \
-H "X-Browsy-Session: $TOKEN" \
-d '{"text": "Sign In"}'
# Find by role
curl http://localhost:3847/api/find \
-H "Content-Type: application/json" \
-H "X-Browsy-Session: $TOKEN" \
-d '{"role": "button"}'
Response: JSON array of matching elements.
POST /api/back
Go back to the previous page in browsing history. No request body required.
curl -X POST http://localhost:3847/api/back \
-H "X-Browsy-Session: $TOKEN"
Response: The previous page's DOM.
GET /api/page
Get the current page DOM with form state overlaid. Use after type, check, select, or uncheck to see updated form values without re-fetching.
Query parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
scope | string | no | "all" (default), "visible", "above_fold", or "visible_above_fold" |
format | string | no | "compact" (default) or "json" |
curl "http://localhost:3847/api/page?format=json&scope=visible" \
-H "X-Browsy-Session: $TOKEN"
GET /api/page-info
Get page metadata without the full element list. No parameters.
curl http://localhost:3847/api/page-info \
-H "X-Browsy-Session: $TOKEN"
Response:
{
"title": "Sign In - Example",
"url": "https://example.com/login",
"page_type": "Login",
"suggested_actions": [
{
"action": "Login",
"username_id": 5,
"password_id": 8,
"submit_id": 12
}
],
"alerts": [],
"pagination": null
}
GET /api/tables
Extract structured table data from the current page. No parameters.
curl http://localhost:3847/api/tables \
-H "X-Browsy-Session: $TOKEN"
Response:
[
{
"headers": ["Name", "Price", "Stock"],
"rows": [
["Widget A", "$9.99", "In stock"],
["Widget B", "$14.99", "Out of stock"]
]
}
]
GET /health
Health check endpoint. No session required.
curl http://localhost:3847/health
Response:
{
"status": "ok"
}
Scopes
The scope parameter controls which elements are included in the output:
| Scope | Description |
|---|---|
all | All elements including hidden ones (default) |
visible | Only non-hidden elements |
above_fold | Only elements with top edge within the viewport height |
visible_above_fold | Non-hidden elements above the fold |
Output formats
The format parameter controls the response format:
| Format | Content-Type | Description |
|---|---|---|
compact | text/plain | Minimal token-efficient text format (default) |
json | application/json | Full structured Spatial DOM |
See Output Formats for details on both formats.
Error responses
Errors return JSON with an error field:
{
"error": "Element 999 not found"
}
| Status | Cause |
|---|---|
400 | Invalid request body or parameters |
404 | Element not found, no page loaded, or no matching action |
503 | Server at session capacity |
Example: complete login flow
# Start the server
browsy serve --port 3847 &
# Browse to login page (captures session token)
TOKEN=$(curl -s -D- http://localhost:3847/api/browse \
-H "Content-Type: application/json" \
-d '{"url": "https://app.example.com/login"}' \
| grep -i x-browsy-session | tr -d '\r' | cut -d' ' -f2)
# Check page type
curl -s http://localhost:3847/api/page-info \
-H "X-Browsy-Session: $TOKEN" | jq .page_type
# "Login"
# Submit credentials
curl -s http://localhost:3847/api/login \
-H "Content-Type: application/json" \
-H "X-Browsy-Session: $TOKEN" \
-d '{"username": "user@example.com", "password": "secret"}'
# Check if 2FA is needed
curl -s http://localhost:3847/api/page-info \
-H "X-Browsy-Session: $TOKEN" | jq .page_type
# "TwoFactorAuth"
# Enter 2FA code
curl -s http://localhost:3847/api/enter-code \
-H "Content-Type: application/json" \
-H "X-Browsy-Session: $TOKEN" \
-d '{"code": "847291"}'
# Now on the dashboard -- extract tables
curl -s http://localhost:3847/api/tables \
-H "X-Browsy-Session: $TOKEN" | jq .
A2A Protocol
browsy implements Google's Agent-to-Agent (A2A) protocol, enabling agent discovery and task delegation over HTTP. Any A2A-compatible agent can discover browsy's capabilities and delegate web browsing tasks to it.
Overview
A2A is a standard for agents to find and communicate with each other. browsy's A2A support consists of two parts:
- Agent card -- a JSON manifest at a well-known URL describing browsy's capabilities.
- Task execution -- an endpoint that accepts goals, executes them as browsing tasks, and streams status events back via SSE.
Both are served automatically by browsy serve.
browsy serve --port 3847
Agent card
The agent card is served at GET /.well-known/agent.json and describes browsy's identity and capabilities.
curl http://localhost:3847/.well-known/agent.json
Response:
{
"name": "browsy",
"description": "Zero-render browser engine for AI agents. Navigates, extracts, and interacts with web pages without rendering pixels.",
"url": "http://localhost:3847",
"version": "1.0",
"capabilities": {
"streaming": true,
"pushNotifications": false
},
"skills": [
{
"id": "web-browse",
"name": "Web Browsing",
"description": "Navigate to URLs, interact with pages, extract content, fill forms, and search the web.",
"tags": ["browse", "scrape", "extract", "search", "login", "forms"]
}
]
}
Agents discover browsy by fetching this card and inspecting the skills array. The streaming: true capability indicates that task responses are delivered as Server-Sent Events (SSE).
Task execution
POST /a2a/tasks
Submit a task for browsy to execute. The response is an SSE event stream with status updates.
Request body:
| Field | Type | Required | Description |
|---|---|---|---|
goal | string | yes | Natural language description of the task |
params | object | no | Structured parameters (see below) |
Params fields:
| Field | Type | Description |
|---|---|---|
url | string | Target URL to browse |
credentials | object | { "username": "...", "password": "..." } for login tasks |
search_query | string | Query string for search tasks |
extract | string | What to extract from the page (e.g., "tables", "links", "text") |
browsy infers the task intent from the goal text and params fields. Explicit params take priority over goal parsing.
Intent detection
browsy maps each task to one of these intents:
| Intent | Trigger | Behavior |
|---|---|---|
Search | search_query param, or goal contains "search" | Performs a web search, returns results |
Login | credentials param, or goal contains "login"/"sign in" | Navigates to URL, fills login form, submits |
Extract | extract param (not "tables"), or goal contains "extract"/"scrape" | Navigates to URL, returns page content |
ExtractTables | extract: "tables", or goal contains "table" | Navigates to URL, extracts structured table data |
FillForm | Goal contains "fill"/"form"/"submit" | Navigates to URL, interacts with form elements |
Browse | Default fallback | Navigates to URL, returns the Spatial DOM |
SSE event stream
The response uses Content-Type: text/event-stream. Each event is a JSON object with the following structure:
data: {"id":"task_abc123","status":"working","steps":[{"description":"Navigating to https://example.com"}]}
data: {"id":"task_abc123","status":"completed","steps":[{"description":"Navigating to https://example.com"},{"description":"Page loaded: Example Domain (3 elements)"}],"result":{"page_type":"Other","title":"Example Domain","elements":3}}
Event fields:
| Field | Type | Description |
|---|---|---|
id | string | Unique task identifier |
status | string | "working", "completed", or "failed" |
steps | array | List of { "description": "..." } objects showing progress |
result | object | Present when status is "completed". Contains extracted data |
error | string | Present when status is "failed". Describes what went wrong |
The stream always ends with a terminal event ("completed" or "failed").
Examples
Browse a page
curl -N http://localhost:3847/a2a/tasks \
-H "Content-Type: application/json" \
-d '{
"goal": "Browse the Hacker News front page",
"params": { "url": "https://news.ycombinator.com" }
}'
Event stream:
data: {"id":"task_1","status":"working","steps":[{"description":"Navigating to https://news.ycombinator.com"}]}
data: {"id":"task_1","status":"completed","steps":[{"description":"Navigating to https://news.ycombinator.com"},{"description":"Page loaded: Hacker News (120 elements)"}],"result":{"page_type":"List","title":"Hacker News","elements":120}}
Search the web
curl -N http://localhost:3847/a2a/tasks \
-H "Content-Type: application/json" \
-d '{
"goal": "Search for Rust web frameworks",
"params": { "search_query": "rust web framework 2026" }
}'
Login to a site
curl -N http://localhost:3847/a2a/tasks \
-H "Content-Type: application/json" \
-d '{
"goal": "Login to the application",
"params": {
"url": "https://app.example.com/login",
"credentials": { "username": "user@example.com", "password": "secret" }
}
}'
Event stream:
data: {"id":"task_3","status":"working","steps":[{"description":"Navigating to https://app.example.com/login"}]}
data: {"id":"task_3","status":"working","steps":[{"description":"Navigating to https://app.example.com/login"},{"description":"Login page detected, submitting credentials"}]}
data: {"id":"task_3","status":"completed","steps":[{"description":"Navigating to https://app.example.com/login"},{"description":"Login page detected, submitting credentials"},{"description":"Login successful, redirected to Dashboard"}],"result":{"page_type":"Dashboard","title":"Dashboard - App"}}
Extract table data
curl -N http://localhost:3847/a2a/tasks \
-H "Content-Type: application/json" \
-d '{
"goal": "Extract the pricing table",
"params": {
"url": "https://example.com/pricing",
"extract": "tables"
}
}'
Extract page content
curl -N http://localhost:3847/a2a/tasks \
-H "Content-Type: application/json" \
-d '{
"goal": "Extract the main article text",
"params": {
"url": "https://example.com/blog/post",
"extract": "text"
}
}'
Fill a form
curl -N http://localhost:3847/a2a/tasks \
-H "Content-Type: application/json" \
-d '{
"goal": "Fill out the contact form with name John and email john@example.com",
"params": { "url": "https://example.com/contact" }
}'
Task status polling
A stub endpoint exists for polling task status by ID:
GET /a2a/tasks/{task_id}
curl http://localhost:3847/a2a/tasks/task_abc123
This returns the last known state of the task. Since tasks execute synchronously over SSE, polling is primarily useful for checking whether a task completed after a disconnection.
Error handling
When a task fails, the final SSE event includes an error field:
data: {"id":"task_5","status":"failed","steps":[{"description":"Navigating to https://invalid.example"}],"error":"Network error: DNS resolution failed"}
Common failure causes:
| Error | Cause |
|---|---|
| Network error | DNS failure, connection refused, timeout |
| CAPTCHA detected | Target page requires human verification |
| No login form found | Login intent but page has no detected login action |
| Element not found | Form interaction referenced a nonexistent element |
Framework Integrations
browsy provides native integrations for popular AI/agent frameworks in both Python and JavaScript/TypeScript. Each integration wraps browsy as framework-compatible tools, so agents can browse the web using their native tool-calling patterns.
JavaScript / TypeScript
The browsy-ai npm package provides integrations for LangChain.js, OpenAI, and Vercel AI SDK. Install the core package and whichever framework you use:
npm install browsy-ai # Core SDK
npm install browsy-ai @langchain/core # + LangChain.js
npm install browsy-ai openai # + OpenAI
npm install browsy-ai ai # + Vercel AI SDK
LangChain.js
import { getTools } from "browsy-ai/langchain";
const tools = getTools(); // -> 14 LangChain tool instances
OpenAI function calling
import { getToolDefinitions, handleToolCall } from "browsy-ai/openai";
const tools = getToolDefinitions();
const result = await handleToolCall("browsy_browse", { url: "https://example.com" });
Vercel AI SDK
import { browsyTools } from "browsy-ai/vercel-ai";
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
const result = await generateText({
model: openai("gpt-4o"),
tools: browsyTools(),
prompt: "Go to example.com and summarize it",
maxSteps: 10,
});
See the full JavaScript / TypeScript guide for complete examples and API reference.
Python
Install browsy with the extras for your framework:
pip install browsy-ai[langchain] # LangChain tools
pip install browsy-ai[crewai] # CrewAI tool
pip install browsy-ai[openai] # OpenAI function calling
pip install browsy-ai[autogen] # AutoGen integration
pip install browsy-ai[smolagents] # HuggingFace smolagents
pip install browsy-ai[all] # All integrations
All Python integrations share a lazily-initialized Browser instance. You can pass your own Browser for custom viewport configuration.
LangChain
The LangChain integration provides individual tools that plug directly into LangChain agents and chains.
from browsy.langchain import get_tools
Available tools
| Tool class | Description |
|---|---|
BrowsyBrowseTool | Navigate to a URL, returns Spatial DOM |
BrowsyClickTool | Click an element by ID |
BrowsyTypeTextTool | Type text into an input field |
BrowsySearchTool | Web search via DuckDuckGo or Google |
BrowsyLoginTool | Fill and submit a login form |
BrowsyPageInfoTool | Get page metadata and suggested actions |
Quick start
from browsy.langchain import get_tools
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
llm = ChatOpenAI(model="gpt-4o")
tools = get_tools()
agent = create_react_agent(llm, tools)
result = agent.invoke({
"messages": [{"role": "user", "content": "Go to news.ycombinator.com and list the top 5 stories"}]
})
Custom browser
Pass a Browser instance to control viewport size or other settings:
from browsy import Browser
from browsy.langchain import get_tools
browser = Browser(viewport_width=375, viewport_height=812)
tools = get_tools(browser=browser)
Using individual tools
from browsy.langchain import BrowsyBrowseTool, BrowsyClickTool
browse = BrowsyBrowseTool()
page = browse.invoke({"url": "https://example.com"})
click = BrowsyClickTool()
result = click.invoke({"id": 3})
CrewAI
The CrewAI integration wraps all browsy actions into a single tool that CrewAI agents can call.
from browsy.crewai import BrowsyTool
Quick start
from browsy.crewai import BrowsyTool
from crewai import Agent, Task, Crew
browsy_tool = BrowsyTool()
researcher = Agent(
role="Web Researcher",
goal="Find and summarize information from web pages",
backstory="You are an expert at navigating websites and extracting key information.",
tools=[browsy_tool],
verbose=True,
)
task = Task(
description="Go to https://news.ycombinator.com and summarize the top 3 stories.",
expected_output="A summary of the top 3 Hacker News stories with titles and URLs.",
agent=researcher,
)
crew = Crew(agents=[researcher], tasks=[task])
result = crew.kickoff()
print(result)
Tool actions
The BrowsyTool accepts a JSON string with an action field and action-specific parameters:
# Browse
browsy_tool.run('{"action": "browse", "url": "https://example.com"}')
# Click
browsy_tool.run('{"action": "click", "id": 3}')
# Type
browsy_tool.run('{"action": "type", "id": 5, "text": "hello"}')
# Search
browsy_tool.run('{"action": "search", "query": "rust web framework"}')
# Login
browsy_tool.run('{"action": "login", "username": "user@example.com", "password": "secret"}')
# Page info
browsy_tool.run('{"action": "page_info"}')
OpenAI function calling
The OpenAI integration provides tool definitions compatible with the OpenAI Chat Completions API and a dispatcher to handle tool calls.
from browsy.openai import get_tool_definitions, handle_tool_call
Tool definitions
get_tool_definitions() returns a list of OpenAI-compatible tool schemas:
from browsy.openai import get_tool_definitions
tools = get_tool_definitions()
# Returns list of {"type": "function", "function": {"name": ..., "parameters": ...}}
Handling tool calls
handle_tool_call(name, args) dispatches a tool call to browsy and returns the result as a string:
from browsy.openai import handle_tool_call
result = handle_tool_call("browsy_browse", {"url": "https://example.com"})
Complete example
import json
from openai import OpenAI
from browsy.openai import get_tool_definitions, handle_tool_call
client = OpenAI()
tools = get_tool_definitions()
messages = [
{"role": "user", "content": "Go to example.com and tell me what's on the page."}
]
# Initial request
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
# Tool call loop
while response.choices[0].message.tool_calls:
msg = response.choices[0].message
messages.append(msg)
for tool_call in msg.tool_calls:
args = json.loads(tool_call.function.arguments)
result = handle_tool_call(tool_call.function.name, args)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result,
})
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
)
print(response.choices[0].message.content)
Available functions
| Function name | Parameters | Description |
|---|---|---|
browsy_browse | url, format?, scope? | Navigate to a URL |
browsy_click | id | Click an element |
browsy_type_text | id, text | Type into an input |
browsy_search | query, engine? | Web search |
browsy_login | username, password | Login to a site |
browsy_page_info | (none) | Get page metadata |
AutoGen
The AutoGen integration provides a BrowsyBrowser class compatible with Microsoft AutoGen's ConversableAgent.
from browsy.autogen import BrowsyBrowser
Quick start
from browsy.autogen import BrowsyBrowser
from autogen import ConversableAgent, UserProxyAgent
browser = BrowsyBrowser()
assistant = ConversableAgent(
name="web_assistant",
system_message="You help users browse the web and extract information.",
llm_config={"config_list": [{"model": "gpt-4o"}]},
)
# Register browsy tools with the agent
browser.register(assistant)
user = UserProxyAgent(
name="user",
human_input_mode="NEVER",
code_execution_config=False,
)
browser.register(user)
user.initiate_chat(
assistant,
message="Go to https://example.com and describe what you see.",
)
Custom browser
from browsy import Browser
from browsy.autogen import BrowsyBrowser
custom = Browser(viewport_width=1366, viewport_height=768)
browser = BrowsyBrowser(browser=custom)
Smolagents
The smolagents integration provides a tool compatible with HuggingFace's smolagents framework.
from browsy.smolagents import BrowsyTool
Quick start
from browsy.smolagents import BrowsyTool
from smolagents import CodeAgent, HfApiModel
tool = BrowsyTool()
agent = CodeAgent(
tools=[tool],
model=HfApiModel("Qwen/Qwen2.5-Coder-32B-Instruct"),
)
result = agent.run("Go to https://example.com and extract the main heading text.")
print(result)
Custom browser
from browsy import Browser
from browsy.smolagents import BrowsyTool
browser = Browser(viewport_width=1920, viewport_height=1080)
tool = BrowsyTool(browser=browser)
OpenClaw / SimpleClaw
The openclaw-browsy plugin integrates browsy as a first-class tool in OpenClaw and compatible frameworks like SimpleClaw. Unlike the Python integrations above, this is a TypeScript/Node.js plugin that manages its own browsy server process.
npm install openclaw-browsy
import { register } from "openclaw-browsy";
export default { register };
The plugin auto-starts a browsy serve process and injects 14 browsing tools into every agent. It can also intercept built-in Playwright browser tools for a transparent speed upgrade.
See the full OpenClaw / SimpleClaw integration guide for configuration, standalone usage, and custom orchestrator support.
Shared Browser instance
All integrations lazily initialize a Browser instance with default settings (1920x1080 viewport) if none is provided. The Browser instance is shared across all tool calls within the same integration, maintaining session state (cookies, history, form values) across interactions.
To share a single Browser across multiple integrations:
from browsy import Browser
from browsy.langchain import get_tools as get_langchain_tools
from browsy.openai import get_tool_definitions
browser = Browser(viewport_width=1920, viewport_height=1080)
# Both use the same session
langchain_tools = get_langchain_tools(browser=browser)
openai_tools = get_tool_definitions(browser=browser)
JavaScript / TypeScript
The browsy-ai npm package provides a TypeScript SDK for the browsy REST API, plus ready-made integrations for LangChain.js, OpenAI, and Vercel AI SDK.
Installation
npm install browsy-ai
The package uses ESM and requires Node.js 22+. Framework dependencies are optional peer dependencies — install only what you need.
Core SDK
The core SDK manages the browsy server process, HTTP communication, and per-agent session isolation.
import { BrowsyClient, BrowsyContext, ServerManager } from "browsy-ai";
BrowsyContext
The simplest way to use browsy. BrowsyContext is a facade that coordinates the client, server manager, and session manager.
import { BrowsyContext } from "browsy-ai";
const ctx = new BrowsyContext({ port: 3847 });
// Execute tool calls — server auto-starts, sessions auto-managed
const page = await ctx.executeToolCall("browse", { url: "https://example.com" });
console.log(page);
const info = await ctx.executeToolCall("pageInfo", {});
console.log(info);
BrowsyClient
Lower-level HTTP client for direct API calls. Use this when you manage the server and sessions yourself.
import { BrowsyClient } from "browsy-ai";
const client = new BrowsyClient(3847);
// Navigate
const res = await client.browse({ url: "https://example.com" });
console.log(res.body);
// Interact using the session from the response
await client.typeText({ id: 5, text: "hello" }, res.session);
await client.click({ id: 12 }, res.session);
// Extract data
const tables = await client.tables(res.session);
const info = await client.pageInfo(res.session);
Configuration
import { BrowsyContext } from "browsy-ai";
const ctx = new BrowsyContext({
port: 3847, // REST server port (default: 3847)
autoStart: true, // Auto-start browsy serve (default: true)
allowPrivateNetwork: false, // Allow private network URLs (default: false)
serverTimeout: 10_000, // Startup timeout in ms (default: 10000)
});
When autoStart is true, the SDK finds the browsy binary in your PATH (or via the BROWSY_BIN environment variable) and spawns browsy serve --port <port>.
Session isolation
Each agent gets its own isolated browsing session with independent cookies, history, and form state:
const ctx = new BrowsyContext();
// Different agents get different sessions
const page1 = await ctx.executeToolCall("browse", { url: "https://a.com" }, "agent-1");
const page2 = await ctx.executeToolCall("browse", { url: "https://b.com" }, "agent-2");
LangChain.js
npm install browsy-ai @langchain/core
import { getTools } from "browsy-ai/langchain";
Quick start
import { getTools } from "browsy-ai/langchain";
import { ChatOpenAI } from "@langchain/openai";
import { createReactAgent } from "@langchain/langgraph/prebuilt";
const tools = getTools({ port: 3847 });
const llm = new ChatOpenAI({ model: "gpt-4o" });
const agent = createReactAgent({ llm, tools });
const result = await agent.invoke({
messages: [{ role: "user", content: "Go to news.ycombinator.com and list the top 5 stories" }],
});
Custom context
Pass a BrowsyContext for full control:
import { BrowsyContext } from "browsy-ai";
import { getTools } from "browsy-ai/langchain";
const ctx = new BrowsyContext({ port: 9000, autoStart: false });
const tools = getTools(ctx);
Available tools
getTools() returns 14 LangChain tool instances:
| Tool name | Parameters | Description |
|---|---|---|
browsy_browse | url, format?, scope? | Navigate to a URL |
browsy_click | id | Click an element by ID |
browsy_type_text | id, text | Type into an input field |
browsy_check | id | Check a checkbox/radio |
browsy_uncheck | id | Uncheck a checkbox/radio |
browsy_select | id, value | Select a dropdown option |
browsy_search | query, engine? | Web search |
browsy_login | username, password | Log in using detected form |
browsy_enter_code | code | Enter 2FA/verification code |
browsy_find | text?, role? | Find elements by text or role |
browsy_get_page | format?, scope? | Get current page with form state |
browsy_page_info | — | Page metadata and suggested actions |
browsy_tables | — | Extract structured table data |
browsy_back | — | Go back in history |
OpenAI
npm install browsy-ai openai
import { getToolDefinitions, handleToolCall } from "browsy-ai/openai";
Quick start
import OpenAI from "openai";
import { getToolDefinitions, handleToolCall, createToolCallHandler } from "browsy-ai/openai";
const client = new OpenAI();
const tools = getToolDefinitions();
const messages = [
{ role: "user" as const, content: "Go to example.com and tell me what's there." },
];
let response = await client.chat.completions.create({
model: "gpt-4o",
messages,
tools,
});
// Tool call loop
while (response.choices[0].message.tool_calls?.length) {
const msg = response.choices[0].message;
messages.push(msg);
for (const toolCall of msg.tool_calls!) {
const args = JSON.parse(toolCall.function.arguments);
const result = await handleToolCall(toolCall.function.name, args);
messages.push({
role: "tool" as const,
tool_call_id: toolCall.id,
content: result,
});
}
response = await client.chat.completions.create({
model: "gpt-4o",
messages,
tools,
});
}
console.log(response.choices[0].message.content);
Bound handler
Use createToolCallHandler() to get a pre-bound handler:
import { getToolDefinitions, createToolCallHandler } from "browsy-ai/openai";
const tools = getToolDefinitions();
const handle = createToolCallHandler({ port: 3847 });
// In your tool call loop:
const result = await handle(toolCall.function.name, args);
Vercel AI SDK
npm install browsy-ai ai
import { browsyTools } from "browsy-ai/vercel-ai";
Quick start
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
import { browsyTools } from "browsy-ai/vercel-ai";
const result = await generateText({
model: openai("gpt-4o"),
tools: browsyTools(),
prompt: "Go to news.ycombinator.com and list the top 5 stories",
maxSteps: 10,
});
console.log(result.text);
Custom context
import { BrowsyContext } from "browsy-ai";
import { browsyTools } from "browsy-ai/vercel-ai";
const ctx = new BrowsyContext({ port: 9000 });
const tools = browsyTools(ctx);
Zod schemas
All tool parameter schemas are exported as Zod objects for use in custom integrations:
import {
BrowseParams,
ClickParams,
TypeTextParams,
SearchParams,
TOOL_DESCRIPTIONS,
TOOL_SCHEMAS,
} from "browsy-ai";
// Use in your own tool definitions
const parsed = BrowseParams.parse({ url: "https://example.com" });
// Iterate over all tools
for (const { name, method, schema } of TOOL_SCHEMAS) {
console.log(name, TOOL_DESCRIPTIONS[name]);
}
Prerequisites
The SDK talks to a browsy REST server. You need the browsy CLI installed:
cargo install browsy
With autoStart: true (the default), the SDK starts the server automatically. With autoStart: false, start it manually:
browsy serve --port 3847
OpenClaw Integration
browsy integrates with OpenClaw as a first-class plugin, giving every agent fast, zero-render browsing capabilities without Playwright or Chromium.
Why use browsy in OpenClaw?
OpenClaw's built-in browser uses Playwright + CDP: ~300MB RAM, 2-5s per page. browsy handles 70%+ of agent browsing tasks at 10x speed and 60x less memory. The plugin auto-starts a browsy server and injects 14 browsing tools into every agent.
| Built-in Browser | browsy Plugin | |
|---|---|---|
| Engine | Chromium via Playwright | Zero-render Spatial DOM |
| Memory | ~300MB/page | ~5MB/page |
| Latency | 2-5s/page | <100ms/page |
| JS support | Full | Hidden content exposure |
| Setup | Bundled | npm install openclaw-browsy + browsy CLI |
Installation
# Install the OpenClaw plugin
npm install openclaw-browsy
# Install the browsy CLI (needed for the server)
cargo install browsy
Configuration
Add to your OpenClaw config:
{
"plugins": {
"openclaw-browsy": {
"port": 3847,
"autoStart": true,
"allowPrivateNetwork": false,
"preferBrowsy": true,
"serverTimeout": 10000
}
}
}
| Option | Default | Description |
|---|---|---|
port | 3847 | Port for the browsy REST server |
autoStart | true | Start browsy serve automatically on plugin init |
allowPrivateNetwork | false | Allow fetching private/internal network URLs |
preferBrowsy | true | Intercept built-in browser tool calls and redirect through browsy |
serverTimeout | 10000 | Timeout (ms) waiting for server startup |
Plugin registration
// openclaw.config.ts
import { register } from "openclaw-browsy";
export default { register };
The plugin registers four components following OpenClaw's standard pattern:
preToolExecutionhook — intercepts built-in browser tools (browser,web_browser,playwright_browser,browse_web) and redirects them through browsy whenpreferBrowsyis enabledagent:bootstraphook — injects 14 browsy tools into every agent's toolset at startupbrowsy-serverservice — manages thebrowsy serveprocess lifecycle (auto-start, health polling, shutdown)- Gateway methods + CLI commands —
browsy.status,browsy.restart,/browsy-status,/browsy-sessions
Available tools
Every agent gets these 14 tools automatically:
| Tool | Parameters | Description |
|---|---|---|
browsy_browse | url, format?, scope? | Navigate to a URL |
browsy_click | id | Click an element by ID |
browsy_type_text | id, text | Type text into an input field |
browsy_check | id | Check a checkbox or radio button |
browsy_uncheck | id | Uncheck a checkbox or radio button |
browsy_select | id, value | Select a dropdown option |
browsy_search | query, engine? | Search the web (DuckDuckGo or Google) |
browsy_login | username, password | Log in using detected form fields |
browsy_enter_code | code | Enter a verification or 2FA code |
browsy_find | text?, role? | Find elements by text or ARIA role |
browsy_get_page | format?, scope? | Get current page DOM with form state |
browsy_page_info | — | Get page metadata and suggested actions |
browsy_tables | — | Extract structured table data |
browsy_back | — | Go back in browsing history |
How it works
The plugin is a pure proxy — it talks to browsy's REST API via fetch() and manages sessions:
Agent → browsy_browse("https://example.com")
→ Plugin ensures browsy server is running
→ Plugin gets/creates session for this agent
→ POST /api/browse with X-Browsy-Session header
→ browsy fetches, parses, and returns Spatial DOM
→ Plugin updates session token
→ Agent receives page content
Each agent gets its own isolated session with independent cookies, history, and form state.
SimpleClaw and other OpenClaw-compatible frameworks
The openclaw-browsy plugin works with any framework that implements the OpenClaw plugin API. This includes SimpleClaw and other lightweight agent orchestrators built on the OpenClaw standard.
SimpleClaw quick start
import { SimpleClaw } from "simpleclaw";
import { register } from "openclaw-browsy";
const claw = new SimpleClaw({
plugins: [{ register }],
config: {
"openclaw-browsy": {
port: 3847,
preferBrowsy: true,
},
},
});
// Agents automatically get browsy tools
const agent = claw.createAgent({
name: "researcher",
instructions: "You browse the web and extract information.",
});
const result = await agent.run("Search for 'Rust web frameworks' and summarize the top 3 results");
Standalone usage (no framework)
You can also use the browsy client directly without OpenClaw:
import { BrowsyContext } from "openclaw-browsy";
const ctx = new BrowsyContext({ port: 3847, autoStart: false });
// Assumes browsy serve is already running
const page = await ctx.executeToolCall("browse", { url: "https://example.com" });
console.log(page);
const info = await ctx.executeToolCall("pageInfo", {});
console.log(info);
Custom agent orchestrators
Any orchestrator that implements the four-method OpenClawPluginApi interface can use the plugin:
interface OpenClawPluginApi {
registerHook(name: string, handler: (...args: unknown[]) => unknown): void;
registerService(name: string, service: { start?: () => void | Promise<void>; stop?: () => void | Promise<void> }): void;
registerGatewayMethod(name: string, handler: (...args: unknown[]) => unknown): void;
registerCommand(name: string, handler: (...args: unknown[]) => unknown): void;
}
Call register(api) with your implementation and browsy tools become available to your agents.
preferBrowsy mode
When preferBrowsy is enabled (the default), the plugin intercepts calls to built-in browser tools and aborts them with a message directing the agent to use browsy_browse instead. This provides a transparent speed upgrade for agents that were previously using Playwright.
The intercepted tool names are:
browserweb_browserplaywright_browserbrowse_web
To disable interception and run browsy alongside the built-in browser:
{
"openclaw-browsy": {
"preferBrowsy": false
}
}
When to fall back to a full browser
browsy handles server-rendered pages, forms, search, and data extraction. Fall back to Playwright for:
- JS-heavy SPAs — React, Vue, Angular apps that render entirely client-side
- Screenshots — when you need visual/pixel-level inspection
- Complex JS interactions — drag-and-drop, infinite scroll, WebSocket-driven UIs
- PDF generation — print-to-PDF workflows
With preferBrowsy: false, both browsy and the built-in browser are available. Agents can choose the right tool for each task.
Bundled skills
The plugin includes three runtime skills for common browsing patterns:
browse-and-extract
Navigate to a URL and extract data, automatically handling cookie consent and login walls.
web-research
Search the web, visit multiple pages, and compile a research summary with source attribution.
form-filler
Detect form fields using browsy's page intelligence, fill them with provided data, and submit.
Python Bindings
browsy provides Python bindings via PyO3. The API closely mirrors the Rust Session API.
Installation
pip install browsy-ai
The package ships a compiled native extension (_core.pyd / _core.so). No Rust toolchain required for installation from wheels.
Module contents
from browsy import Browser, Page, Element
| Class | Description |
|---|---|
Browser | A browsing session with cookie persistence and form state |
Page | A parsed page (the Spatial DOM) |
Element | A single element in the Spatial DOM |
Basic usage: parsing HTML
The Browser class can parse local HTML without network access:
from browsy import Browser
browser = Browser(viewport_width=1920, viewport_height=1080)
page = browser.load_html('<h1>Hello</h1><a href="/about">About</a>', 'https://example.com')
print(page.title) # ""
print(len(page)) # 2
for el in page.elements:
print(el.id, el.tag, el.text)
# 1 h1 Hello
# 2 a About
Browsing: navigating URLs
from browsy import Browser
browser = Browser()
page = browser.goto("https://example.com")
print(page.title) # "Example Domain"
print(page.url) # "https://example.com"
print(page.page_type()) # "Other"
Page properties and methods
page.title # str: page title
page.url # str: current URL
page.elements # list[Element]: all elements
page.visible() # list[Element]: non-hidden elements only
page.above_fold() # list[Element]: elements with top edge within viewport
page.get(id) # Element or None: lookup by ID
page.page_type() # str: "Login", "Search", "Article", "List", etc.
page.suggested_actions() # list[dict]: detected action recipes
page.alerts() # list[Element]: elements with alert_type set
page.tables() # list[dict]: extracted table data (headers + rows)
page.pagination() # dict or None: next/prev/pages links
page.to_json() # str: full JSON serialization
page.to_compact() # str: compact text format
len(page) # int: element count
Element properties
el.id # int: unique element ID
el.tag # str: HTML tag name
el.role # str or None: ARIA role (implicit or explicit)
el.text # str or None: visible text content
el.href # str or None: link target (resolved to absolute URL)
el.placeholder # str or None: placeholder text
el.value # str or None: current value
el.input_type # str or None: input type attribute
el.name # str or None: HTML name attribute
el.label # str or None: associated label text
el.alert_type # str or None: "alert", "error", "success", "warning"
el.disabled # bool or None
el.checked # bool or None
el.expanded # bool or None
el.selected # bool or None
el.required # bool or None
el.hidden # bool or None: True if element is hidden
el.bounds # tuple[int, int, int, int]: (x, y, width, height)
Form interaction
browser = Browser()
page = browser.goto("https://example.com/login")
# Type into fields by element ID
browser.type_text(5, "user@example.com")
browser.type_text(8, "secretpassword")
# Check a "remember me" checkbox
browser.check(10)
# Select a dropdown option
browser.select(12, "en-US")
# Read the updated DOM with form state overlaid
page = browser.dom()
# Submit by clicking the submit button
page = browser.click(15)
Compound actions
For detected form patterns, compound actions handle the full workflow:
# Login (requires Login suggested action on current page)
page = browser.login("user@example.com", "password123")
# Enter verification code (requires EnterCode suggested action)
page = browser.enter_code("123456")
Search
# Search the web (DuckDuckGo by default)
results = browser.search("python web scraping")
for r in results:
print(r["title"], r["url"], r["snippet"])
Finding elements
# Find by text content (exact substring match)
elements = browser.find_by_text("Sign In")
# Find by text content (case-insensitive substring)
elements = browser.find_by_text_fuzzy("sign in")
# Find by ARIA role
buttons = browser.find_by_role("button")
headings = browser.find_by_role("heading")
links = browser.find_by_role("link")
# Find input by semantic purpose
password_input = browser.find_input_by_purpose("password")
email_input = browser.find_input_by_purpose("email")
search_input = browser.find_input_by_purpose("search")
# Supported purposes: "password", "email", "username", "code", "search", "phone"
# Find verification codes on the page
code = browser.find_verification_code() # str or None
Navigation
# Navigate to a URL
page = browser.goto("https://example.com")
# Click a link (navigates to its href)
page = browser.click(3)
# Go back
page = browser.back()
Suggested actions
page = browser.goto("https://example.com/login")
for action in page.suggested_actions():
print(action)
# {"action": "Login", "username_id": 5, "password_id": 8, "submit_id": 12}
Each action is a dictionary with an "action" key identifying the type and additional fields with element IDs. See the Action Recipes Reference for all variants.
Viewport configuration
# Mobile viewport
browser = Browser(viewport_width=375, viewport_height=812)
# Desktop viewport (default)
browser = Browser(viewport_width=1920, viewport_height=1080)
The viewport dimensions affect CSS media query evaluation and layout computation, which in turn affects element positions and visibility.
CLI Usage
The browsy CLI provides three commands: fetch for URLs, parse for local HTML files, and serve for the REST API server.
Installation
cargo install browsy
Commands
fetch
Fetch a URL, compute the Spatial DOM, and print the result.
browsy fetch <URL> [OPTIONS]
| Flag | Description |
|---|---|
--json | Output as JSON instead of compact format |
--viewport <WxH> | Viewport size (default: 1920x1080) |
--no-css | Skip fetching external CSS stylesheets |
--visible-only | Only include visible (non-hidden) elements |
--above-fold | Only include elements above the viewport fold |
Examples:
# Compact output (default)
browsy fetch https://example.com
# JSON output
browsy fetch https://example.com --json
# Mobile viewport
browsy fetch https://example.com --viewport 375x812
# Skip external CSS for faster parsing
browsy fetch https://example.com --no-css
# Only visible above-fold elements
browsy fetch https://example.com --visible-only --above-fold
parse
Parse a local HTML file and print the Spatial DOM. No network requests are made (external stylesheets are not fetched).
browsy parse <FILE> [OPTIONS]
| Flag | Description |
|---|---|
--json | Output as JSON instead of compact format |
--viewport <WxH> | Viewport size (default: 1920x1080) |
Use - to read from stdin:
echo '<h1>Hello</h1>' | browsy parse -
curl -s https://example.com | browsy parse -
Examples:
# Parse a local file
browsy parse index.html
# Parse with JSON output
browsy parse index.html --json
# Parse from stdin
cat page.html | browsy parse -
serve
Start the REST API + A2A server.
browsy serve [OPTIONS]
| Flag | Description |
|---|---|
--port <PORT> | Port to listen on (default: 3847) |
--allow-private-network | Allow fetching private/LAN addresses |
Examples:
# Start on default port
browsy serve
# Custom port
browsy serve --port 8080
# Allow local development server access
browsy serve --allow-private-network
The server exposes a REST API and A2A protocol endpoints. See REST API and A2A Protocol.
Output formats
Compact format (default)
The compact format is designed for minimal token usage in LLM contexts:
title: Example Domain
url: https://example.com
vp: 1920x1080
els: 3
---
[1:h1 "Example Domain"]
[2:p "This domain is for use in illustrative examples in documents."]
[3:a "More information..." ->https://www.iana.org/domains/example]
The header shows the page title, URL, viewport dimensions, and element count. Each element line follows the pattern [id:tag "text"] with optional annotations:
!id:tag-- hidden elementid:input:password-- input type (when not "text")[name]-- HTML name attribute[v]-- checked[*]-- required[=value]-- current value->url-- hrefnarrow/wide/full-- width relative to viewport@region-- position (only when needed to disambiguate duplicates)
JSON format
The JSON format includes the full SpatialDom structure with all element properties. See the Architecture page for the complete schema.
MCP server mode
browsy also runs as an MCP server for use with Claude Code and other MCP clients. See MCP Server for details.
browsy mcp
Web Search
browsy includes built-in web search via DuckDuckGo and Google. No API keys or external services required -- it fetches search result pages directly and parses the HTML.
Search engines
| Engine | Endpoint | Reliability |
|---|---|---|
| DuckDuckGo | https://html.duckduckgo.com/html/ | High. Uses the HTML-only endpoint, no JavaScript needed. |
https://www.google.com/search | Variable. Google may return CAPTCHAs or block automated requests. |
DuckDuckGo is the default and recommended engine.
Rust API
Basic search
#![allow(unused)] fn main() { use browsy_core::fetch::{Session, SearchEngine}; let mut session = Session::new()?; let results = session.search("rust web scraping")?; for r in &results { println!("{}: {} -- {}", r.title, r.url, r.snippet); } }
Choosing a search engine
#![allow(unused)] fn main() { let results = session.search_with("rust web scraping", SearchEngine::Google)?; }
Search and read
Search and automatically fetch the top N result pages:
#![allow(unused)] fn main() { let pages = session.search_and_read("rust web scraping", 3)?; for page in &pages { println!("--- {} ---", page.result.title); if let Some(ref dom) = page.dom { println!(" Page type: {:?}", dom.page_type); println!(" Elements: {}", dom.els.len()); } else { println!(" (fetch failed)"); } } }
Each SearchPage contains the original SearchResult (title, URL, snippet) and an Option<SpatialDom> for the fetched page. Pages that fail to fetch have dom: None.
#![allow(unused)] fn main() { let pages = session.search_and_read_with( "rust web scraping", 5, SearchEngine::DuckDuckGo, )?; }
Python API
from browsy import Browser
browser = Browser()
# Basic search (DuckDuckGo)
results = browser.search("python asyncio tutorial")
for r in results:
print(r["title"], r["url"])
Search results are returned as a list of dictionaries, each with title, url, and snippet keys.
MCP API
The search tool accepts a query and optional engine:
{
"query": "browsy zero-render browser",
"engine": "duckduckgo"
}
Returns a JSON array of results:
[
{
"title": "browsy - Zero-render browser engine",
"url": "https://example.com/browsy",
"snippet": "A browser engine for AI agents..."
}
]
SearchResult struct
#![allow(unused)] fn main() { pub struct SearchResult { pub title: String, pub url: String, pub snippet: String, } }
How it works
DuckDuckGo
browsy fetches https://html.duckduckgo.com/html/?q=<query>, which returns a pure HTML page with no JavaScript. Results are extracted by finding <div class="result"> containers and parsing the title link (result__a), URL (result__url), and snippet (result__snippet). Redirect URLs are decoded from the uddg query parameter.
browsy fetches https://www.google.com/search?q=<query>&num=10. Results are extracted using a structural pattern: anchor tags containing an <h3> descendant. The title comes from the h3 text, the URL from the anchor href (with /url?q= redirect decoding), and snippets from nearby div elements. The parser targets the #rso results container to skip ads and navigation.
Google results may be less reliable because Google actively detects and blocks automated requests. DuckDuckGo's HTML endpoint is specifically designed for non-JavaScript clients and is the recommended default.
Page Types Reference
browsy classifies every page into a PageType to help agents decide what to do next. The classification is based on structural heuristics applied to the Spatial DOM -- no machine learning, no external services.
Page types are evaluated in priority order. The first match wins.
PageType enum
#![allow(unused)] fn main() { pub enum PageType { Error, Captcha, Login, TwoFactorAuth, OAuthConsent, Inbox, EmailBody, Dashboard, Article, SearchResults, List, Search, Form, Other, // default } }
Detection criteria
| Page Type | Detection Criteria |
|---|---|
| Error | Title contains HTTP error codes (404, 500, 403, not found, error) OR page has elements with alert_type == "error". |
| Captcha | Title contains CAPTCHA keywords (captcha, verify you're human, robot, security check, just a moment, attention required) OR heading contains CAPTCHA phrases OR a CAPTCHA service (reCAPTCHA, hCaptcha, Turnstile, Cloudflare challenge) is detected in the HTML structure. |
| Login | Page has a visible <input type="password">. |
| TwoFactorAuth | Title or heading contains verification keywords (verification, enter code, security code, 2fa, two-factor, otp, one-time, passcode) AND page has a visible text/number/tel input. No password field present (that would be Login). |
| OAuthConsent | Title or heading contains OAuth keywords (authorize, allow access, grant permission, oauth, consent). |
| Inbox | Title contains inbox keywords (inbox, mail, messages) AND page has 10+ visible links. |
| EmailBody | Page text contains 3+ of the email markers: from:, to:, subject:, date:. |
| Dashboard | Title or heading contains dashboard keywords (dashboard, welcome back, overview) AND page has both a <nav> and <main> landmark. |
| Article | Page has 3+ headings AND enough long paragraphs (>100 chars). When the page has 20+ links, the threshold is 10 long paragraphs (vs 2 for low-link pages). Pages with 15+ headings must have a paragraph-to-heading ratio of at least 0.8 to distinguish articles (Wikipedia) from heading-heavy list pages (BBC News). |
| SearchResults | Page has a search input (visible or hidden) AND 8+ links AND search context: title/heading contains search-result keywords (search results, results for, search) OR URL contains search query parameters (?q=, ?query=, ?s=, ?search=, /search). |
| List | Page has 10+ visible links. Evaluated after Article and SearchResults. |
| Search | Page has a visible search input. Evaluated after List (many list pages have search bars in navigation). Also fires as a fallback when a page has fewer than 5 visible elements but has a hidden search input (common in JS-rendered search engines without JS execution). |
| Form | Page has 2+ visible data-entry inputs (excludes checkbox, radio, hidden, submit, button, and image inputs). |
| Other | Default when no heuristic matches. |
Evaluation order
The order matters. For example:
- A login page with a search bar in the nav is classified as
Login(password field check comes first), notSearch. - A search results page with many links is
SearchResults, notList, because SearchResults is checked before List. - An article with a search bar is
Article, notSearch, because Article is checked first. - An error page with a login form is
Error, because error checks come before Login.
Accessing page type
Rust
#![allow(unused)] fn main() { use browsy_core::output::PageType; let dom = browsy_core::parse(html, 1920.0, 1080.0); match dom.page_type { PageType::Login => println!("This is a login page"), PageType::Article => println!("This is an article"), _ => println!("Page type: {:?}", dom.page_type), } }
Python
page = browser.goto("https://example.com")
print(page.page_type()) # "Login", "Article", "Other", etc.
MCP
The page_info tool returns page_type as a string. The browse tool includes it in the JSON output format.
JSON serialization
PageType is serialized as a string. The field is omitted from JSON when the value is Other (via skip_serializing_if).
{
"page_type": "Login",
"title": "Sign In",
"url": "https://example.com/login"
}
Action Recipes Reference
browsy detects structured action patterns on each page and emits them as SuggestedAction variants. Each action provides element IDs that an agent can use directly with click, type_text, check, and select operations.
Actions are detected after page type classification. Multiple actions can coexist on a single page (a login page might also have a Search action for the nav bar and a CookieConsent action for a banner).
SuggestedAction enum
#![allow(unused)] fn main() { #[derive(Debug, Clone, Serialize, Deserialize)] #[serde(tag = "action")] pub enum SuggestedAction { Login { ... }, Register { ... }, Contact { ... }, FillForm { ... }, Search { ... }, EnterCode { ... }, Download { ... }, CaptchaChallenge { ... }, CookieConsent { ... }, Consent { ... }, SelectFromList { ... }, Paginate { ... }, } }
All actions are serialized with a "action" tag field for easy pattern matching.
Login
Detected when the page has a visible password input, a nearby text/email input, and a submit button.
{
"action": "Login",
"username_id": 5,
"password_id": 8,
"submit_id": 12,
"remember_me_id": 10
}
| Field | Type | Description |
|---|---|---|
username_id | u32 | Text or email input nearest to the password field (within 500px Y) |
password_id | u32 | The <input type="password"> element |
submit_id | u32 | Nearest submit button below the password field |
remember_me_id | Option<u32> | Checkbox with "remember" in its label or name |
When it fires: Page has a visible password input and a nearby username/email input. Does NOT fire if the page also has registration context (confirm password + registration keywords) -- Register takes priority in that case. When a page has both login and registration sections (like Hacker News), Login takes priority over Register.
Usage: The MCP login tool and Python browser.login() use this action internally. They type into username_id and password_id, then click submit_id.
Register
Detected on registration pages: password field plus either a confirm password field or registration keywords in the title/heading.
{
"action": "Register",
"email_id": 3,
"username_id": 4,
"password_id": 7,
"confirm_password_id": 9,
"name_id": 2,
"submit_id": 11
}
| Field | Type | Description |
|---|---|---|
email_id | Option<u32> | Email input |
username_id | Option<u32> | Username text input |
password_id | u32 | Primary password input |
confirm_password_id | Option<u32> | Second password input (confirm) |
name_id | Option<u32> | Full name text input |
submit_id | u32 | Submit button |
When it fires: Page has a visible password field AND either (a) two or more password fields (confirm password pattern) or (b) title/heading contains registration keywords (register, sign up, signup, create account, join, new account). Does not fire when login keywords are present alongside confirm password (dual login/register pages prefer Login).
Contact
Detected on contact forms: a textarea (message body) plus contact-related context in the title or headings.
{
"action": "Contact",
"name_id": 2,
"email_id": 4,
"message_id": 6,
"submit_id": 8
}
| Field | Type | Description |
|---|---|---|
name_id | Option<u32> | Name input |
email_id | Option<u32> | Email input |
message_id | u32 | Textarea element |
submit_id | u32 | Submit button |
When it fires: Page has a visible textarea AND title/heading contains contact keywords (contact us, contact form, get in touch, reach out, send us a message, inquiry).
FillForm
Generic form action for pages classified as Form that don't match a more specific pattern (Login, Register, Contact, Search).
{
"action": "FillForm",
"fields": [
{ "id": 3, "label": "First Name", "name": "first_name", "type": "text" },
{ "id": 5, "label": "Email Address", "name": "email", "type": "email" },
{ "id": 7, "label": "Phone", "name": "phone", "type": "tel" }
],
"submit_id": 10
}
| Field | Type | Description |
|---|---|---|
fields | Vec<FormField> | Visible data-entry fields with labels |
submit_id | u32 | Submit button |
Each FormField contains:
| Field | Type | Description |
|---|---|---|
id | u32 | Element ID |
label | Option<String> | Associated label text (from <label> or placeholder) |
name | Option<String> | HTML name attribute |
input_type | Option<String> | Input type attribute |
When it fires: Page type is Form (2+ data-entry inputs) AND no more specific form action (Login, Register, Contact, Search) was already detected.
Search
Detected when a search input is present on the page.
{
"action": "Search",
"input_id": 15,
"submit_id": 16
}
| Field | Type | Description |
|---|---|---|
input_id | u32 | Search input element |
submit_id | u32 | Submit button |
When it fires: Page has an input matching search criteria: type="search", role="searchbox", name="q", name contains "search", or placeholder contains "search". Prefers visible inputs but falls back to hidden ones (for JS-rendered search engines).
EnterCode
Detected on verification/2FA pages with code-related context.
{
"action": "EnterCode",
"input_id": 4,
"submit_id": 6,
"code_length": 6
}
| Field | Type | Description |
|---|---|---|
input_id | u32 | Code input element (first input if multiple narrow digit inputs) |
submit_id | u32 | Submit button |
code_length | Option<usize> | Expected code length (set when 4-8 narrow inputs are detected) |
When it fires: Title or heading contains verification keywords AND the page has a visible text/number/tel input. Does not fire if a password field is present (that is Login). Detects separate-digit inputs (width < 60px, 4-8 inputs) and reports the code length.
Usage: The MCP enter_code tool and Python browser.enter_code() use this action internally.
Download
Detected when the page has links or buttons with download-related text or file extension hrefs.
{
"action": "Download",
"items": [
{ "id": 20, "text": "Download v2.1.0", "href": "https://example.com/release.zip" },
{ "id": 22, "text": "Download PDF", "href": "https://example.com/guide.pdf" }
]
}
| Field | Type | Description |
|---|---|---|
items | Vec<DownloadItem> | Downloadable links/buttons |
Each DownloadItem contains:
| Field | Type | Description |
|---|---|---|
id | u32 | Element ID |
text | Option<String> | Link/button text |
href | Option<String> | Download URL |
When it fires: Page has visible links or buttons where the text starts with "download" (and is short) or the href ends with a known file extension (.zip, .tar.gz, .dmg, .exe, .msi, .deb, .rpm, .pkg, .appimage, .pdf, .csv, .xlsx).
CaptchaChallenge
Detected when a CAPTCHA service is found in the HTML structure or the page is classified as Captcha.
{
"action": "CaptchaChallenge",
"captcha_type": "ReCaptcha",
"sitekey": "6Le-wvkSAAAAABx7...",
"submit_id": 15
}
| Field | Type | Description |
|---|---|---|
captcha_type | CaptchaType | Type of CAPTCHA detected |
sitekey | Option<String> | Site key from data-sitekey attribute |
submit_id | Option<u32> | Submit/verify button |
When it fires: Page has a captcha field set (detected CAPTCHA service in HTML) OR page type is Captcha. See CAPTCHA Detection for details.
CookieConsent
Detected when the page has a cookie notice with accept/reject buttons.
{
"action": "CookieConsent",
"accept_id": 50,
"reject_id": 52
}
| Field | Type | Description |
|---|---|---|
accept_id | u32 | Accept/agree button |
reject_id | Option<u32> | Reject button (not always present) |
When it fires: Page has a substantial text block (>30 chars) mentioning cookies/GDPR AND a button with accept-related text (accept all, accept cookies, allow cookies, allow all, agree, got it, i understand, i agree).
Consent
Detected on OAuth/authorization consent pages with approve/deny buttons.
{
"action": "Consent",
"approve_ids": [30],
"deny_ids": [32]
}
| Field | Type | Description |
|---|---|---|
approve_ids | Vec<u32> | Approve/allow/authorize buttons |
deny_ids | Vec<u32> | Deny/cancel/decline buttons |
When it fires: Title or heading contains OAuth keywords (authorize, allow access, grant permission, oauth, consent) AND the page has buttons with approve or deny text.
SelectFromList
Detected on pages with many links arranged in a list-like pattern.
{
"action": "SelectFromList",
"items": [10, 14, 18, 22, 26]
}
| Field | Type | Description |
|---|---|---|
items | Vec<u32> | One link ID per row (the first link in each row group) |
When it fires: Page has 5+ visible links that form 5+ distinct rows (links within 30px Y are grouped into the same row). The action provides the first link ID from each row as representative items.
Paginate
Detected when the page has next/previous navigation links or numbered page links.
{
"action": "Paginate",
"next_id": 100,
"prev_id": 98
}
| Field | Type | Description |
|---|---|---|
next_id | Option<u32> | Next page link |
prev_id | Option<u32> | Previous page link |
When it fires: Page has links with pagination text (next, prev, previous, >, >>, <, <<, and Unicode equivalents).
Detection order
Actions are detected in this order:
- Register (or Login if no registration context)
- EnterCode
- Consent
- Contact
- Search
- SelectFromList
- CookieConsent
- Paginate
- FillForm (only if no more specific form action exists)
- Download
- CaptchaChallenge
Multiple actions can coexist. A login page with a cookie banner and nav search bar will have Login, CookieConsent, and Search actions simultaneously.
CAPTCHA Detection
browsy detects CAPTCHAs from HTML structure alone -- no rendering, no image analysis, no JavaScript execution. Detection works by scanning the raw DOM tree for known CAPTCHA service indicators before the Spatial DOM is generated.
CaptchaType enum
#![allow(unused)] fn main() { pub enum CaptchaType { ReCaptcha, // Google reCAPTCHA v2 or v3 HCaptcha, // hCaptcha Turnstile, // Cloudflare Turnstile CloudflareChallenge, // Cloudflare JS challenge ("Just a moment...") ImageGrid, // Custom image-grid CAPTCHA ("select all images containing...") TextCaptcha, // Text-based CAPTCHA (type characters from an image) Unknown, // CAPTCHA detected but type not identified } }
Detection signals
browsy scans the layout tree for these patterns:
Script sources
| Pattern | Detected as |
|---|---|
src contains recaptcha or google.com/recaptcha | ReCaptcha |
src contains hcaptcha.com | HCaptcha |
src contains challenges.cloudflare.com/turnstile | Turnstile |
Iframe sources
| Pattern | Detected as |
|---|---|
src contains recaptcha or google.com/recaptcha | ReCaptcha |
src contains hcaptcha.com or newassets.hcaptcha.com | HCaptcha |
Div classes
| Pattern | Detected as |
|---|---|
Class contains g-recaptcha | ReCaptcha |
Class contains h-captcha | HCaptcha |
Class contains cf-turnstile | Turnstile |
Div IDs
| Pattern | Detected as |
|---|---|
ID contains challenge-running or cf-challenge | CloudflareChallenge |
Site key
Any element with a data-sitekey attribute has its value captured. This attribute is used by reCAPTCHA, hCaptcha, and Turnstile to embed the site key.
Title and heading keywords
Page type detection checks title and headings for CAPTCHA-related phrases. These trigger PageType::Captcha even without a known CAPTCHA service:
Title keywords: captcha, verify you're human, verify you are human, robot, security check, challenge, just a moment, attention required, are you human
Heading keywords: captcha, verify you're human, security check, are you human, complete the challenge, human verification
CaptchaInfo struct
#![allow(unused)] fn main() { pub struct CaptchaInfo { pub captcha_type: CaptchaType, pub sitekey: Option<String>, } }
The sitekey is populated when a data-sitekey attribute is found. It is the value needed by third-party CAPTCHA solving services.
CaptchaChallenge action
When a CAPTCHA is detected, the CaptchaChallenge suggested action is emitted:
#![allow(unused)] fn main() { SuggestedAction::CaptchaChallenge { captcha_type: CaptchaType, sitekey: Option<String>, submit_id: Option<u32>, } }
The submit_id is the nearest verify/submit/continue button, if one exists. When no known CAPTCHA service is detected but the page is classified as Captcha, browsy infers the type:
- 4+ image buttons on the page:
ImageGrid - Otherwise:
Unknown
Session methods
Rust
#![allow(unused)] fn main() { let mut session = Session::new()?; let dom = session.goto("https://example.com")?; // Check if the current page is a CAPTCHA if session.is_captcha() { println!("CAPTCHA detected!"); } // Get CAPTCHA details if let Some(info) = session.captcha_info() { println!("Type: {:?}", info.captcha_type); if let Some(ref key) = info.sitekey { println!("Site key: {}", key); } } }
Python
browser = Browser()
page = browser.goto("https://example.com")
if page.page_type() == "Captcha":
for action in page.suggested_actions():
if action["action"] == "CaptchaChallenge":
print(f"Type: {action['captcha_type']}")
print(f"Site key: {action.get('sitekey')}")
MCP behavior
When the browse or click tools return a page detected as Captcha, the output is prefixed with a warning:
CAPTCHA detected (ReCaptcha) -- this page requires human verification to proceed.
The page_info tool includes the full CAPTCHA information:
{
"page_type": "Captcha",
"captcha": {
"captcha_type": "ReCaptcha",
"sitekey": "6Le-wvkSAAAA..."
},
"suggested_actions": [
{
"action": "CaptchaChallenge",
"captcha_type": "ReCaptcha",
"sitekey": "6Le-wvkSAAAA...",
"submit_id": 15
}
]
}
What browsy cannot do
browsy detects and classifies CAPTCHAs. It does not solve them. When a CAPTCHA is encountered, the agent has several options:
- Human-in-the-loop: Surface the CAPTCHA to a human operator.
- Third-party solver: Pass the
captcha_typeandsitekeyto a CAPTCHA solving service (2captcha, Anti-Captcha, etc.), receive the solution token, and inject it. - Alternative approach: Try a different URL, use an API instead of the web interface, or skip the blocked resource.
- Wait and retry: Some Cloudflare challenges resolve after a delay.
The sitekey in the CaptchaInfo is the value that third-party solving services typically require.
Detection pipeline
CAPTCHA detection happens at two stages:
-
Tree scan (
detect_captcha_from_tree): Before the Spatial DOM is generated, the layout tree is scanned for CAPTCHA service indicators (script/iframe sources, div classes/IDs, data-sitekey). This produces theCaptchaInfostored onSpatialDom.captcha. -
Page type classification (
detect_page_type): After the Spatial DOM is built, the page type heuristic checks for CAPTCHA signals: title keywords, heading keywords, and the presence ofcaptchaon the SpatialDom. If any signal matches, the page is classified asPageType::Captcha. -
Action detection (
detect_captcha_challenge_action): Ifcaptchais set or the page type isCaptcha, theCaptchaChallengeaction is emitted with the type, sitekey, and submit button.
CSS Engine
browsy includes a CSS engine built from scratch in Rust. It handles selector matching, property parsing, variable resolution, calc() expressions, @media queries, and specificity ordering. The engine computes the subset of CSS properties needed for layout -- approximately 40 properties that affect bounding box computation.
Architecture
HTML ──> DomNode tree
│
├── <style> blocks ──> parse_stylesheet() ──> Vec<CssRule>
├── External <link> CSS ──> fetched + parse_stylesheet()
├── Inline style="" ──> parse_inline_style_with_vars()
│
└── compute_styles() ──> StyledNode tree (LayoutStyle per node)
│
└── Taffy layout ──> bounding boxes
Style computation walks the DOM tree, matching each element against all CSS rules by specificity. Inline styles override stylesheet rules. CSS custom properties (--var) inherit through the tree.
Selector matching
The selector engine supports these selector types:
| Selector | Example | Description |
|---|---|---|
| Tag | div, button | Matches element tag name |
| Class | .nav-item | Matches class attribute |
| ID | #header | Matches id attribute |
| Universal | * | Matches any element |
| Descendant | div p | Matches p inside any div ancestor |
| Child | div > p | Matches p that is a direct child of div |
| Pseudo-class | :hover, :first-child | Parsed but ignored for layout (no interaction state) |
| Attribute (exists) | [disabled] | Element has the attribute |
| Attribute (exact) | [type="submit"] | Attribute equals value |
| Attribute (word) | [class~="active"] | Whitespace-separated word match |
| Attribute (prefix) | [href^="/"] | Attribute starts with value |
| Attribute (suffix) | [src$=".png"] | Attribute ends with value |
| Attribute (contains) | [class*="btn"] | Attribute contains substring |
| Attribute (hyphen-prefix) | [lang|="en"] | Exact match or prefix with hyphen |
| Comma-separated | h1, h2, h3 | Union of selectors |
Specificity
Selectors are ordered by CSS specificity rules:
- ID selectors: weight 100
- Class selectors, attribute selectors, pseudo-classes: weight 10
- Tag selectors, universal: weight 1
Higher specificity rules override lower specificity rules. Equal specificity resolves by source order (later wins). Inline styles always win over stylesheet rules.
Property parsing
Supported properties
The engine parses approximately 40 layout-affecting CSS properties:
| Category | Properties |
|---|---|
| Box model | display, box-sizing, width, height, min-width, min-height, max-width, max-height |
| Spacing | margin (+ sides), padding (+ sides), border-width (+ sides) |
| Position | position, top, right, bottom, left |
| Flexbox | flex-direction, flex-wrap, flex-grow, flex-shrink, flex-basis, align-items, align-self, justify-content, gap |
| Grid | grid-template-columns, grid-template-rows, grid-column, grid-row |
| Typography | font-size, line-height |
| Visibility | visibility, overflow |
Shorthand properties are expanded: margin: 10px 20px expands to margin-top, margin-right, margin-bottom, margin-left. Similarly for padding, border-width, flex, and gap.
Dimension types
#![allow(unused)] fn main() { pub enum Dimension { Px(f32), // Absolute pixels Percent(f32), // Percentage of parent Calc(f32, f32), // calc() result: (px_component, percent_component) Auto, // Auto sizing } }
The engine resolves em values against the element's computed font-size and rem values against the root font size (16px default).
var() resolution
CSS custom properties are collected during style computation and inherited through the DOM tree:
:root {
--primary-color: #333;
--spacing: 16px;
}
.container {
padding: var(--spacing);
color: var(--primary-color);
}
.card {
margin: var(--spacing-large, 24px); /* fallback value */
}
The var() resolver supports:
- Simple references:
var(--name) - Fallback values:
var(--name, fallback) - Nested var() references in fallbacks
calc() expressions
The calc() parser handles full arithmetic expressions with mixed units:
.element {
width: calc(100% - 32px);
margin: calc(16px + 1em);
padding: calc(2 * var(--spacing));
}
Supported operators: +, -, *, /. The parser respects operator precedence and handles parenthesized sub-expressions. Mixed px and % units are preserved as a Calc(px, percent) dimension and resolved during layout.
@media queries
The engine evaluates @media queries against the viewport dimensions provided at parse time:
@media (max-width: 768px) {
.sidebar { display: none; }
}
@media screen and (min-width: 1024px) {
.container { max-width: 1200px; }
}
Supported media features
| Feature | Example | Description |
|---|---|---|
min-width | (min-width: 768px) | Viewport width >= value |
max-width | (max-width: 1024px) | Viewport width <= value |
min-height | (min-height: 600px) | Viewport height >= value |
max-height | (max-height: 900px) | Viewport height <= value |
width | (width: 1920px) | Exact viewport width |
height | (height: 1080px) | Exact viewport height |
orientation | (orientation: portrait) | Portrait or landscape |
screen | screen | Always matches |
print | print | Never matches |
all | all | Always matches |
Multiple conditions joined with and are evaluated conjunctively. The screen and / all and prefix is stripped before evaluating conditions.
External stylesheets
When using the fetch feature (enabled by default), browsy automatically fetches external CSS linked via <link rel="stylesheet"> tags. Fetched CSS is parsed and merged with inline <style> blocks during style computation.
Resource limits prevent abuse:
- Maximum total CSS bytes (across all external stylesheets)
- Maximum bytes per individual stylesheet
- Blocked URL patterns (analytics, tracking, ad-related CSS)
- Private network and non-HTTP URL blocking
Layout engine
After style computation, browsy feeds the styled tree into Taffy (from the Dioxus project) for layout computation. Taffy handles:
- Flexbox: All flex container and flex item properties
- CSS Grid: Template columns/rows, explicit placement
- Block layout: Standard block flow with margins, padding, borders
Taffy returns bounding boxes (x, y, width, height) for every element, which browsy uses to build the Spatial DOM.
What is NOT supported
The CSS engine focuses on properties that affect element position and size. The following are intentionally not implemented:
- Visual properties:
color,background,border-color,border-radius,box-shadow,opacity,z-index - Transforms:
transform,translate,rotate,scale - Animations:
animation,transition,@keyframes - Pseudo-elements:
::before,::after,::placeholder(no content generation) - Advanced selectors:
:nth-child(),:not(),~(general sibling),+(adjacent sibling) - Advanced grid:
grid-auto-flow,grid-auto-rows, named grid areas,minmax()in some contexts - Columns:
column-count,column-width - Table layout:
table-layout,border-collapse
These omissions are by design. browsy computes where elements are and how large they are, not what they look like. The Spatial DOM output contains position and size data; color and visual styling are irrelevant for agent interaction.
Architecture
browsy is a zero-render browser engine. It converts raw HTML into a flat list of interactive and text elements with bounding boxes, page type classification, and suggested actions -- without rendering pixels or executing JavaScript.
Pipeline
HTML
│
├──────────────────────────────────────────────────────────────────┐
v │
DOM Parser (html5ever) │
│ │
v │
DomNode tree ──> External CSS fetch (reqwest) ──> merged CSS text │
│ │
v │
CSS Engine (browsy) │
├── Selector matching (tag, class, ID, attribute, combinators) │
├── Property parsing (var(), calc(), shorthands) │
├── @media query evaluation │
└── Specificity + cascade ordering │
│ │
v │
StyledNode tree (LayoutStyle per element) │
│ │
v │
Layout Engine (Taffy) │
├── Flexbox │
├── CSS Grid │
└── Block flow │
│ │
v │
LayoutNode tree (with bounding boxes) │
│ │
v │
Spatial DOM Generator (browsy) │
├── Element emission (interactive + text + landmark + img) │
├── CAPTCHA detection (from tree scan) │
├── Deduplication (wrapper skip) │
├── Hidden content preservation │
├── Text fallback chain (aria-label > title > img alt > svg title) │
├── Label association (<label for="id">) │
├── URL resolution (relative -> absolute) │
├── Page type classification │
└── Suggested action detection │
│ │
v │
SpatialDom │
├── els: Vec<SpatialElement> (flat list with IDs + bounds) │
├── page_type: PageType │
├── suggested_actions: Vec<SuggestedAction> │
├── captcha: Option<CaptchaInfo> │
└── title, url, viewport, scroll │
Entry point
The primary entry point is browsy_core::parse:
#![allow(unused)] fn main() { pub fn parse(html: &str, viewport_width: f32, viewport_height: f32) -> SpatialDom { let dom_tree = dom::parse_html(html); let styled = css::compute_styles_with_viewport(&dom_tree, viewport_width, viewport_height); let laid_out = layout::compute_layout(&styled, viewport_width, viewport_height); output::generate_spatial_dom(&laid_out, viewport_width, viewport_height) } }
For network-aware usage, Session::goto() fetches the HTML, resolves external CSS, and runs the full pipeline.
Project structure
crates/
core/ browsy-core library (the engine)
src/
lib.rs Entry point: parse(html, w, h) -> SpatialDom
dom/mod.rs HTML -> DomNode tree (thin wrapper around html5ever)
css/
mod.rs Style computation, CSS variable inheritance
selector.rs CSS selector matching engine
properties.rs CSS property parsing, var() resolution, calc()
layout/mod.rs Style tree -> Taffy -> bounding boxes
output/mod.rs SpatialDom generation, page type, actions, CAPTCHA
js/mod.rs Behavior detection from HTML attributes
fetch/
mod.rs HTTP fetching, form extraction, resource blocking
session.rs Session API, search, navigation, form interaction
tests/
css_layout.rs CSS + layout integration tests
output.rs Spatial DOM output tests
benchmark.rs Detection accuracy benchmark runner
corpus/ HTML snapshots with ground truth labels
cli/ browsy CLI binary
src/main.rs fetch and parse commands
mcp/ browsy MCP server
src/
lib.rs MCP tool definitions (14 tools)
main.rs stdio server entry point
python/ Python bindings (PyO3)
src/lib.rs Browser, Page, Element classes
browsy/__init__.py Python module
What is ours vs external
browsy depends on two external crates for foundational work:
| Crate | Role | What it does |
|---|---|---|
| html5ever (Mozilla/Servo) | HTML parsing | Converts raw HTML into a DOM tree. Handles malformed HTML, character encoding, and the full HTML5 parsing algorithm. |
| Taffy (Dioxus) | Layout computation | Computes bounding boxes from a style tree. Handles Flexbox, CSS Grid, and block layout. |
Everything else is built from scratch in browsy:
| Component | Description |
|---|---|
| CSS selector matching | Tag, class, ID, attribute selectors (7 operator types), descendant/child combinators, specificity ordering |
| CSS property parsing | ~40 layout properties, shorthand expansion, var() resolution with fallbacks, calc() with full expression parser |
| CSS variables | Custom property collection, inheritance through DOM tree |
| @media queries | min-width, max-width, min-height, max-height, orientation, screen/print |
| Spatial DOM output | Element emission, deduplication, landmark markers, text fallback chains, hidden content exposure, alert detection, table extraction |
| Page intelligence | Page type classification (14 types), suggested action detection (12 action types), CAPTCHA detection (7 CAPTCHA types), pagination detection, verification code extraction |
| Session API | Cookie persistence, navigation history, form state overlay, form submission, compound actions (login, enter_code) |
| Web search | DuckDuckGo and Google result parsing |
| Behavior detection | onclick/ARIA/Bootstrap pattern inference from HTML attributes |
Key design decisions
Hidden content exposure
Elements with display:none, visibility:hidden, aria-hidden="true", or the hidden attribute are NOT discarded. They appear in the Spatial DOM with hidden: true. This is intentional -- agents need to see dropdown menus, accordion panels, modal dialogs, tab content, and other JS-toggled content that is present in the HTML but not visible without JavaScript execution.
Landmark markers
HTML5 landmarks (<nav>, <header>, <footer>, <main>, <aside>, <section>, <form>) and elements with explicit landmark ARIA roles are emitted as structural markers with their role only -- no recursive text collection. Their children carry the actual content. This prevents a <nav> from emitting a giant concatenated string of all its link texts.
Text fallback chain
Interactive elements (links, buttons) that contain no text but only images or icons get their text from a fallback chain:
aria-labelattributetitleattribute- Child
<img alt>text - Child
<svg><title>text
This ensures that icon-only buttons like a hamburger menu or close button have accessible text in the Spatial DOM.
SVG handling
SVG child elements are not emitted (they are visual, not semantic). However, <svg><title> text is extracted and stored as the SVG element's aria-label, making it available through the text fallback chain.
Deduplication
Wrapper elements that only wrap a single interactive child (like <li><a>..., <td><span>..., <p><a>...) are skipped. Only the meaningful child element is emitted. This prevents duplicate text in the output. When a wrapper has its own text that would not be captured by the child, it is emitted with only its own text.
Zero-size skip
Visible elements with zero width and height are skipped as layout artifacts. Hidden elements are always preserved regardless of size.
Element ID assignment
Element IDs are assigned sequentially (1, 2, 3, ...) during a single parse. IDs are NOT stable across page loads -- they are positional, not content-based. The delta diff system uses content keys (tag + text + href + bounds) rather than IDs to match elements across page transitions.
Testing
Integration tests
Tests live in crates/core/tests/ as integration tests:
cargo test -p browsy-core # all tests
cargo test -p browsy-core --test css_layout # CSS + layout
cargo test -p browsy-core --test output # Spatial DOM output
Detection benchmark
The crates/core/tests/corpus/ directory contains HTML snapshots of real websites with ground truth labels in manifest.json. The benchmark runner parses every snapshot and verifies:
- Correct page type classification
- Correct suggested action detection
- Valid element IDs in all actions (referencing real elements)
- Verification code extraction accuracy
cargo test -p browsy-core --test benchmark -- --nocapture
Adding a new test case:
- Harvest an HTML snapshot with
HARVEST_URLandHARVEST_NAMEenvironment variables. - Add the expected labels to
corpus/manifest.json. - Run the benchmark to confirm the failure.
- Fix the heuristics in
output/mod.rs. - Re-run the benchmark to confirm the fix with no regressions.
Output formats
JSON
Full structured output via serde_json. All optional fields use skip_serializing_if to keep the JSON compact.
Compact text
A minimal text format designed for LLM token efficiency:
[1:h1 "Page Title"]
[!2:div "Hidden content"]
[3:input:email [email] [*] "Enter email" wide]
[4:button "Submit" full]
[5:a "Link" ->https://example.com @top-R]
Each element is one line: [id:tag "text"] with annotations for type, name, state, size, href, and position.
Delta format
For page transitions, the delta format shows only what changed:
-[3,5,7]
[+8:h1 "New Heading"]
[+9:a "New Link" ->https://example.com]
Removed element IDs are prefixed with -, added/changed elements with +.