Automation

Extraction

Extraction uses JSON schemas with element counters, CSS selectors, attributes, and saved persist descriptors.

Extraction

Extraction turns a page into structured JSON. The schema is explicit, replayable, and can be saved under a persist key for reuse later.

Schema fields

Each leaf field can target one of these sources:

  • { element: 3 }
  • { selector: ".price" }
  • { source: "current_url" }

The top level must be a JSON object.

element and selector fields can also read an attribute:

{
  name: { element: 3 },
  href: { element: 4, attribute: "href" },
  canonicalUrl: { source: "current_url" }
}

URL-list attributes such as srcset, imagesrcset, and ping normalize to one value.

Basic extraction

const data = await opensteer.extract({
  schema: {
    title: { element: 2 },
    price: { selector: ".price" },
    url: { source: "current_url" },
  },
});

Save and replay an extraction descriptor

Save the descriptor while you explore:

const data = await opensteer.extract({
  persist: "product summary",
  schema: {
    title: { element: 2 },
    price: { element: 5 },
    url: { source: "current_url" },
  },
});

Replay it later without resending the schema:

const replayed = await opensteer.extract({
  persist: "product summary",
});

Arrays

Array fields use one or more sample row objects:

const data = await opensteer.extract({
  schema: {
    items: [
      {
        title: { element: 10 },
        price: { element: 11 },
        href: { element: 12, attribute: "href" },
      },
    ],
  },
});

If a page has multiple repeating row shapes, include multiple sample row objects in the array. OpenSteer treats each object as a variant when matching rows.

CLI usage

The CLI takes the schema as a positional JSON object:

opensteer extract '{"title":{"element":2},"url":{"source":"current_url"}}' \
  --workspace demo

Save the descriptor while extracting:

opensteer extract '{"title":{"element":2},"url":{"source":"current_url"}}' \
  --workspace demo \
  --persist "page summary"

Good workflow

  1. Take snapshot extraction.
  2. Pick counters from the current snapshot.
  3. Write the smallest schema that proves the page shape.
  4. Save a persist key if you expect to reuse it.
  5. Rebuild the schema if the site layout changes substantially.