Ai knowledge and logicHelpers

fetch_via_proxy

Fetch a URL through Octocom's HTTP proxy (stable egress IP) instead of the Azure sandbox. Available in product-sync parsers.

fetch_via_proxy(url: str, *, headers: dict = None, timeout: int = 60, method: str = "GET", data: bytes = None) -> bytes

Fetch a URL through Octocom's HTTP proxy. Routes the request through the same stable egress IP (74.242.171.127) that our backend uses for product-feed scraping. Returns the response body as bytes. Raises on non-2xx (urllib.error.HTTPError) or network failure.

Scope. This helper is only available in product-sync parsers. It is not injected into custom actions, condition providers, event handlers, or sidebar widgets.


Why this exists

Parsers normally fetch via plain urllib.request.urlopen(), which goes out via the Azure sandbox's IP. That works for most public feeds, but some sources block well-known cloud IP ranges (Azure, AWS, GCP). You'll see signs like:

  • HTTPError 403
  • HTTPError 429
  • URLError <urlopen error timed out> on URLs that work in a browser
  • Empty bodies, captcha pages, or other "soft" blocks

When that happens, route through fetch_via_proxy. Sources are far more likely to allow Octocom's backend IP than a fresh Azure sandbox IP.

If the download_via_proxy MCP tool could fetch the feed during parser authoring, this helper will work at sync time — they share the same egress IP.


Parameters

NameTypeDescription
urlstrAbsolute URL to fetch.
headersdict[str, str] or NoneOptional request headers. Merged on top of a default User-Agent. Use this to pass Authorization or Accept.
timeoutintPer-request timeout in seconds. Defaults to 60. Bump to 120200 for large XML feeds.
methodstrHTTP method. Defaults to "GET".
databytes or NoneRequest body (for POST/PUT).

Returns

bytes — the response body. Decode or stream into a file from the caller.


Policy

Default to urllib.request.urlopen(). It's simpler and one fewer hop. Switch to fetch_via_proxy only after seeing a blocked response. The standard pattern in the reference parsers is try direct, fall back to proxy:

def _download(url, dest_path):
    headers = {"User-Agent": "Mozilla/5.0 (compatible; OctocomSync/1.0)"}
    try:
        req = urllib.request.Request(url, headers=headers)
        with urllib.request.urlopen(req, timeout=180) as r:
            with open(dest_path, "wb") as f:
                while True:
                    chunk = r.read(1024 * 256)
                    if not chunk:
                        break
                    f.write(chunk)
        return
    except (urllib.error.HTTPError, urllib.error.URLError):
        pass
    body = fetch_via_proxy(url, headers=headers, timeout=200)
    with open(dest_path, "wb") as f:
        f.write(body)

When you already know the feed blocks the sandbox IP (e.g. you saw a 401/403 during MCP-based authoring), it's fine to skip the direct attempt and call fetch_via_proxy first — every cold start would otherwise pay a wasted round-trip.


Examples

Authenticated XML feed

The origin returns 401 Unauthorized to anyone but a whitelisted Octocom IP. Route straight through the proxy and pass the basic-auth header:

import urllib.error
import urllib.request


AUTH_HEADER = "Basic <base64-credentials>"


def _download(url, dest_path):
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; OctocomSync/1.0)",
        "Authorization": AUTH_HEADER,
    }
    try:
        req = urllib.request.Request(url, headers=headers)
        with urllib.request.urlopen(req, timeout=180) as r:
            with open(dest_path, "wb") as f:
                while True:
                    chunk = r.read(1024 * 256)
                    if not chunk:
                        break
                    f.write(chunk)
        return
    except (urllib.error.HTTPError, urllib.error.URLError):
        pass

    body = fetch_via_proxy(url, headers=headers, timeout=200)
    with open(dest_path, "wb") as f:
        f.write(body)

Paginated JSON API behind Cloudflare

The origin lets the proxy IP through but rate-limits the sandbox IP. The reference superhome parser uses this exact pattern:

import json
import urllib.error
import urllib.parse
import urllib.request


def _fetch_page(page_num):
    url = f"https://example.com/api/products?page={page_num}"
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; OctocomSync/1.0)",
        "Accept": "application/json",
    }
    try:
        req = urllib.request.Request(url, headers=headers)
        with urllib.request.urlopen(req, timeout=30) as r:
            body = r.read()
    except (urllib.error.HTTPError, urllib.error.URLError):
        body = fetch_via_proxy(url, headers=headers, timeout=60)
    return json.loads(body).get("Products", [])

See also

On this page