fetch_via_proxy
Fetch a URL through Octocom's HTTP proxy (stable egress IP) instead of the Azure sandbox. Available in product-sync parsers.
fetch_via_proxy(url: str, *, headers: dict = None, timeout: int = 60, method: str = "GET", data: bytes = None) -> bytesFetch a URL through Octocom's HTTP proxy. Routes the request through the same stable egress IP (74.242.171.127) that our backend uses for product-feed scraping. Returns the response body as bytes. Raises on non-2xx (urllib.error.HTTPError) or network failure.
Scope. This helper is only available in product-sync parsers. It is not injected into custom actions, condition providers, event handlers, or sidebar widgets.
Why this exists
Parsers normally fetch via plain urllib.request.urlopen(), which goes out via the Azure sandbox's IP. That works for most public feeds, but some sources block well-known cloud IP ranges (Azure, AWS, GCP). You'll see signs like:
HTTPError 403HTTPError 429URLError <urlopen error timed out>on URLs that work in a browser- Empty bodies, captcha pages, or other "soft" blocks
When that happens, route through fetch_via_proxy. Sources are far more likely to allow Octocom's backend IP than a fresh Azure sandbox IP.
If the download_via_proxy MCP tool could fetch the feed during parser authoring, this helper will work at sync time — they share the same egress IP.
Parameters
| Name | Type | Description |
|---|---|---|
url | str | Absolute URL to fetch. |
headers | dict[str, str] or None | Optional request headers. Merged on top of a default User-Agent. Use this to pass Authorization or Accept. |
timeout | int | Per-request timeout in seconds. Defaults to 60. Bump to 120–200 for large XML feeds. |
method | str | HTTP method. Defaults to "GET". |
data | bytes or None | Request body (for POST/PUT). |
Returns
bytes — the response body. Decode or stream into a file from the caller.
Policy
Default to urllib.request.urlopen(). It's simpler and one fewer hop. Switch to fetch_via_proxy only after seeing a blocked response. The standard pattern in the reference parsers is try direct, fall back to proxy:
def _download(url, dest_path):
headers = {"User-Agent": "Mozilla/5.0 (compatible; OctocomSync/1.0)"}
try:
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req, timeout=180) as r:
with open(dest_path, "wb") as f:
while True:
chunk = r.read(1024 * 256)
if not chunk:
break
f.write(chunk)
return
except (urllib.error.HTTPError, urllib.error.URLError):
pass
body = fetch_via_proxy(url, headers=headers, timeout=200)
with open(dest_path, "wb") as f:
f.write(body)When you already know the feed blocks the sandbox IP (e.g. you saw a 401/403 during MCP-based authoring), it's fine to skip the direct attempt and call fetch_via_proxy first — every cold start would otherwise pay a wasted round-trip.
Examples
Authenticated XML feed
The origin returns 401 Unauthorized to anyone but a whitelisted Octocom IP. Route straight through the proxy and pass the basic-auth header:
import urllib.error
import urllib.request
AUTH_HEADER = "Basic <base64-credentials>"
def _download(url, dest_path):
headers = {
"User-Agent": "Mozilla/5.0 (compatible; OctocomSync/1.0)",
"Authorization": AUTH_HEADER,
}
try:
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req, timeout=180) as r:
with open(dest_path, "wb") as f:
while True:
chunk = r.read(1024 * 256)
if not chunk:
break
f.write(chunk)
return
except (urllib.error.HTTPError, urllib.error.URLError):
pass
body = fetch_via_proxy(url, headers=headers, timeout=200)
with open(dest_path, "wb") as f:
f.write(body)Paginated JSON API behind Cloudflare
The origin lets the proxy IP through but rate-limits the sandbox IP. The reference superhome parser uses this exact pattern:
import json
import urllib.error
import urllib.parse
import urllib.request
def _fetch_page(page_num):
url = f"https://example.com/api/products?page={page_num}"
headers = {
"User-Agent": "Mozilla/5.0 (compatible; OctocomSync/1.0)",
"Accept": "application/json",
}
try:
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req, timeout=30) as r:
body = r.read()
except (urllib.error.HTTPError, urllib.error.URLError):
body = fetch_via_proxy(url, headers=headers, timeout=60)
return json.loads(body).get("Products", [])See also
- Product Sync Parsers — the contract this helper lives in.
download_via_proxyMCP tool — the agent-side equivalent, useful while authoring a parser.