fetch_via

Use this inside a product-sync parser to fetch a feed URL through Octocom's stable outbound IP when the source blocks the default one.

fetch_via_proxy(url: str, *, headers: dict = None, timeout: int = 60, method: str = "GET", data: bytes = None) -> bytes

Fetch a URL through Octocom's HTTP proxy. Routes the request through the same stable egress IP that our backend uses for product-feed scraping. Returns the response body as bytes. Raises on non-2xx (urllib.error.HTTPError) or network failure.

Scope. This helper is only available in product-sync parsers. It is not injected into custom actions, condition providers, event handlers, or sidebar widgets — those contexts have proxy_request / proxy_session instead, which route through the same egress IP.

Why this exists

Parsers normally fetch via plain urllib.request.urlopen(), which goes out via our default outbound IP. That works for most public feeds, but some sources block well-known cloud IP ranges. You'll see signs like:

HTTPError 403
HTTPError 429
URLError <urlopen error timed out> on URLs that work in a browser
Empty bodies, captcha pages, or other "soft" blocks

When that happens, route through fetch_via_proxy. Sources are far more likely to allow Octocom's stable egress IP than our default outbound IP.

If the download_via_proxy MCP tool could fetch the feed during parser authoring, this helper will work at sync time — they share the same egress IP.

Parameters

Name	Type	Description
`url`	`str`	Absolute URL to fetch.
`headers`	`dict[str, str]` or `None`	Optional request headers. Merged on top of a default `User-Agent`. Use this to pass `Authorization` or `Accept`.
`timeout`	`int`	Per-request timeout in seconds. Defaults to `60`. Bump to `120`–`200` for large XML feeds.
`method`	`str`	HTTP method. Defaults to `"GET"`.
`data`	`bytes` or `None`	Request body (for `POST`/`PUT`).

Returns

bytes — the response body. Decode or stream into a file from the caller.

Policy

Default to urllib.request.urlopen(). It's simpler and one fewer hop. Switch to fetch_via_proxy only after seeing a blocked response. The standard pattern in the reference parsers is try direct, fall back to proxy:

def _download(url, dest_path):
    headers = {"User-Agent": "Mozilla/5.0 (compatible; OctocomSync/1.0)"}
    try:
        req = urllib.request.Request(url, headers=headers)
        with urllib.request.urlopen(req, timeout=180) as r:
            with open(dest_path, "wb") as f:
                while True:
                    chunk = r.read(1024 * 256)
                    if not chunk:
                        break
                    f.write(chunk)
        return
    except (urllib.error.HTTPError, urllib.error.URLError):
        pass
    body = fetch_via_proxy(url, headers=headers, timeout=200)
    with open(dest_path, "wb") as f:
        f.write(body)

When you already know the feed blocks the default outbound IP (e.g. you saw a 401/403 during MCP-based authoring), it's fine to skip the direct attempt and call fetch_via_proxy first — every cold start would otherwise pay a wasted round-trip.

Examples

Authenticated XML feed

The origin returns 401 Unauthorized to anyone but a whitelisted Octocom IP. Route straight through the proxy and pass the basic-auth header:

import urllib.error
import urllib.request


AUTH_HEADER = "Basic <base64-credentials>"


def _download(url, dest_path):
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; OctocomSync/1.0)",
        "Authorization": AUTH_HEADER,
    }
    try:
        req = urllib.request.Request(url, headers=headers)
        with urllib.request.urlopen(req, timeout=180) as r:
            with open(dest_path, "wb") as f:
                while True:
                    chunk = r.read(1024 * 256)
                    if not chunk:
                        break
                    f.write(chunk)
        return
    except (urllib.error.HTTPError, urllib.error.URLError):
        pass

    body = fetch_via_proxy(url, headers=headers, timeout=200)
    with open(dest_path, "wb") as f:
        f.write(body)

Paginated JSON API behind Cloudflare

The origin lets the proxy IP through but rate-limits the default outbound IP. The reference Example 3 parser uses this exact pattern:

import json
import urllib.error
import urllib.parse
import urllib.request


def _fetch_page(page_num):
    url = f"https://example.com/api/products?page={page_num}"
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; OctocomSync/1.0)",
        "Accept": "application/json",
    }
    try:
        req = urllib.request.Request(url, headers=headers)
        with urllib.request.urlopen(req, timeout=30) as r:
            body = r.read()
    except (urllib.error.HTTPError, urllib.error.URLError):
        body = fetch_via_proxy(url, headers=headers, timeout=60)
    return json.loads(body).get("Products", [])

fetch_via_proxy