Skip to content

Skill: Retrieve Site

The Retrieve Site skill fetches the HTML content of a public website and returns a sanitized version that agents can safely parse and reason about. This is commonly used when a user wants to reference an existing website as inspiration, extract content from a page, or analyze the structure of a competitor’s site.

ParameterTypeRequiredDescription
urlstringYesThe full URL of the website to retrieve, including the protocol (e.g., https://example.com). Must be a publicly accessible HTTP or HTTPS URL.

A user might trigger this skill by saying:

“Take a look at https://example-bakery.com and use it as inspiration for my site.”

The agent invokes the skill with:

  • url: https://example-bakery.com

The skill fetches the page, sanitizes the HTML, and returns it to the agent. The agent can then analyze the structure, layout, and content to inform a subsequent Create Site call.

Another common scenario:

“Can you grab the text content from our company’s current homepage at https://acme-corp.com?”

The agent retrieves the page and extracts the relevant text content from the sanitized HTML to present to the user.

The skill returns a JSON object containing the sanitized HTML:

{
"url": "https://example-bakery.com",
"html": "<!DOCTYPE html><html><head><title>Example Bakery</title>...</html>",
"status_code": 200,
"content_type": "text/html"
}
  • url — the URL that was fetched (after any redirects).
  • html — the sanitized HTML content of the page.
  • status_code — the HTTP status code returned by the target server.
  • content_type — the Content-Type header from the response.

This skill enforces strict protections against server-side request forgery (SSRF) attacks. The following restrictions are applied before any request is made:

  • Private IP ranges are blocked — requests to 10.x.x.x, 172.16.x.x-172.31.x.x, 192.168.x.x, 127.x.x.x, and 169.254.x.x are rejected.
  • Internal hostnames are blocked — requests to localhost, 0.0.0.0, and any hostname that resolves to a private IP are rejected.
  • Protocol restriction — only http:// and https:// protocols are allowed. file://, ftp://, gopher://, and other schemes are rejected.
  • DNS rebinding protection — the resolved IP address is validated after DNS resolution to prevent DNS rebinding attacks.
  • Redirect limits — a maximum of 5 redirects are followed. Each redirect target is re-validated against the same SSRF rules.

The returned HTML is sanitized to remove potentially dangerous elements before being passed to the agent:

  • Script removal — all <script> tags and their contents are stripped.
  • Event handler removal — inline event handlers (onclick, onerror, etc.) are removed from all elements.
  • Iframe removal<iframe> and <frame> elements are stripped.
  • External resource preservation<link>, <img>, and other resource references are preserved but not fetched. The agent sees the references but does not load them.
  • Timeout — the skill enforces a 15-second timeout for the HTTP request. Sites that do not respond within this window return a timeout error.
  • Size limit — responses larger than 5 MB are truncated. The agent will receive the first 5 MB of HTML content.
  • Non-HTML content — if the URL returns a non-HTML Content-Type (e.g., JSON, PDF, image), the skill returns an error indicating that only HTML pages are supported.
  • Authentication — the skill does not support authenticated requests. Only publicly accessible pages can be retrieved.

POST /api/web/v1.0/retrieve-site

See the Skill Execution API for details on authentication and request format.


Related skills: Create Site | Update Site | List Sites