Product Sync Parsers

A product-sync parser is a small Python script that pulls a store's product feed and turns it into rows Octocom can search. You only need one when the store's catalog can't be connected through a standard integration (Shopify, WooCommerce, BigCommerce, etc.) — i.e. the products live in a custom XML/JSON feed or a bespoke API. It's written by a technical integrator or an LLM agent.

The parser runs on a schedule to populate a business's product catalog from its source feed. You write two functions; Octocom drives them in a loop, validates each mapped product against a schema, and writes the valid ones straight to the product index.

This page is the reference for the contract. For end-to-end examples, see the parsers in the sidebar:

Example 1 — single static XML feed (simplest shape)
Example 2 — multiple static XML feeds joined via in-memory lookup tables
Example 3 — paginated JSON API

The contract

Your parser must define exactly two top-level functions. The runtime will look them up by name in the module's globals() and call them in a loop until the feed is exhausted.

A few rules the runtime enforces:

fetch_next_chunk must be cheap-ish. Each call has to finish in under ~150 seconds (sandbox per-call ceiling is 220s, dryrun default is configurable). Make one chunk = one HTTP round-trip or one streamed batch from disk.
map_one must be pure. No I/O, no network, no global mutation. It runs once per raw item and is expected to be fast.
Two consecutive empty chunks ends the run. A single empty chunk is treated as transient; the runtime will call fetch_next_chunk again. If you return [] twice in a row, the run finalises.
Five fetch errors also ends the run. Errors raised from fetch_next_chunk are caught and counted; once fetch_errors >= 5 the run finalises with a failure sample.

The `state` dict

state is a module-level dict that the runtime owns and passes to fetch_next_chunk on every call. The same dict instance is reused across calls within a sync run, so anything you stash in it survives. The keys that matter when you're writing a parser:

Key	Type	Owner	Notes
`context`	dict	runtime	The same dict your `map_one(raw, context)` receives. See below.
`cursor`	any	both	Opaque per-run cursor. Initialised from `context.get('cursor')`. You may overwrite it.
`empty_streak`	int	runtime	Consecutive empty fetch results. ≥2 ends the run.
`fetch_errors`	int	runtime	Times `fetch_next_chunk` raised. ≥5 ends the run.

The runtime also tracks a handful of read-only counters you never set yourself:

You own anything else you put in state. The reference parsers all stash their per-parser state under state['parser']:

You can also use state['cursor'] directly when a single opaque value (page number, since-id, watermark) is enough.

The `context` dict

context is built once per run from the business's product-sync configuration and passed to both init() (internally) and map_one(raw, context). Treat it as read-only.

When you call dryrun_product_sync(..., parserContext={...}) the extra keys are merged into context. Scheduled production runs don't get a parserContext — anything you need at runtime that isn't on the business config has to be in the parser source or a metafield on the business.

Output schema (`map_one` return value)

The runtime validates each non-None return value against the V4ProductInput schema (the full field set is documented below). The shape is flat — no nested variants, no nested prices, no nested collections. You collapse the source's structure into summary scalars + searchable text + an opaque rawJson blob that carries everything else.

Return None from map_one to skip an item without counting it as an error — useful for inactive items, gift cards, or filtered categories.

Stock semantics: the three `has*Variants` booleans

These three booleans together describe what the customer can do with the product, and product search reads them to filter results. Get them right or the bot will either (a) tell customers an unavailable item is in stock, or (b) silently skip an item the store is happy to sell.

`hasInStockVariants`	`hasPreorderableVariants`	`hasOnSaleVariants`	Meaning
`true`	any	any	At least one variant physically has inventory. Search returns it for in-stock queries.
`false`	`true`	any	No inventory, but at least one variant is orderable (backorder / pre-order / made-to-order).
`false`	`false`	any	Nothing is sellable. The bot will not surface this product for purchase intent.
any	any	`true`	At least one variant is on sale. Used for "what's on sale" queries.

Be literal. hasInStockVariants is true only when the source says inventory exists. Don't fudge it to keep search happy — hasPreorderableVariants exists exactly so back-orderable items are still discoverable.

Surface the actual availability state in rawJson (e.g. include an availability field like "in stock" / "backorder" / "out of stock") so the bot can quote a concrete label when it hydrates, instead of guessing from the boolean.

Worked example for a feed with an explicit availability field per variant:

Runtime helpers

The runtime pre-injects a few things into the parser's globals. You don't need to import them.

`fetch_via_proxy(url, *, headers=None, timeout=60, method='GET', data=None) -> bytes`

Fetch a URL through Octocom's HTTP proxy (a stable egress IP) instead of the sandbox's IP. Use it when the origin blocks the sandbox — symptoms include HTTPError 403, HTTPError 429, or unexplained urlopen timeouts on URLs that work fine in a browser or via the download_via_proxy MCP tool.

See the full fetch_via_proxy helper docs for the standard try-direct-then-fallback pattern.

`V4ProductInput`

The schema class used to validate map_one's return. You usually don't reference it directly — the runtime constructs it for you and reports validation failures in errors_sample.

Iterating on a parser

Sketch the parser by hand or with an LLM. The three reference parsers cover the common feed shapes — adapt them to the schema as you go.
Dry-run via the dryrun_product_sync MCP tool. It runs the parser end-to-end and produces a JSONL artifact in blob storage without touching the live catalog. Returns an executionId; poll get_product_sync_execution for the result and download the JSONL via the returned SAS URL.
Inspect the output. Spot-check products you know — does hasInStockVariants match the source? Are prices sane? Does embeddableContent read like a description a human would search for?
Persist + go live via upsert_product_sync_parser. This both saves the source and (on the first call for a business) switches the business to the AI-generated product provider, enables auto-sync, and schedules the first real sync for the next scheduler tick (≤1 minute). On subsequent updates the schedule is left alone so you can iterate without immediately re-running production.

Common pitfalls

Non-deterministic id. Hashing volatile data (timestamps, prices, computed flags) means every sync writes a fresh row and the catalog grows unboundedly. Hash the source's stable id only.
Setting hasInStockVariants=true for backorder items. Tempting (you want the bot to recommend them) but wrong — the bot will claim they're in stock. The right answer is hasInStockVariants=false + hasPreorderableVariants=true. See the Stock semantics section above.
Forgetting rawJson. The bot reads it when it decides to mention a product; without it, the bot has only name + shortDescription to work with. Always include "rawJson": json.dumps(raw, ensure_ascii=False).
Dumping JSON into embeddableContent. The embedding model is trained on natural language. JSON tokens dilute the signal. Build a plain-text description instead.
Empty embeddableContent. The row will still upload, but the embedding job has nothing to work with and the product won't surface in semantic search. At minimum: the name and one of the descriptions.
Mutating the chunk's raw items in map_one. map_one should be pure; if you need to derive lookup tables, build them once in fetch_next_chunk's initialisation block.
Not clearing iter-parsed elements. When using xml.etree.ElementTree.iterparse, call elem.clear() after consuming each element or memory grows unboundedly on large feeds.

Product Sync Parsers

The two functions you write

Runtime internals (read-only)

Example: stashing parser state

The context dict

Example: a complete map_one

Output schema — field-by-field

Example: deriving the stock booleans

On this page