outerproduct.agentic module provides agent-driven pipelines that turn unstructured or semi-structured document sources into typed structured features. The resulting datasets are fully compatible with op.reasoning.fit() and Trainer.configure(), so you can move from raw documents to a trained ReasoningModel without any manual feature engineering.
All public symbols for document tabularization live in the outerproduct.agentic.documents submodule.
outerproduct.agentic.documents
Thedocuments submodule contains two functions, induce_schema and tabularize, that work together in a two-step pipeline:
induce_schemaanalyzes a sample of your documents and returns aDocumentSchemadescribing the tabular columns to extract.tabularizeapplies that schema to your full document set and returns aDocumentDatasetready for training.
End-to-end example
The snippet below shows the complete flow from a list of raw documents to a trainedReasoningModel.
Functions
induce_schema
DocumentSchema that describes the tabular columns to extract. When questions is provided, the schema is built around those specific questions; otherwise the agent infers a useful column set from the document content.
The documents to analyze. Each element may be a string (raw text) or a supported document object. The agent reads a representative sample to determine the schema.
An optional list of
DocumentQuestion objects that specify exactly which questions the agent should answer for each document and what the resulting column should be named. When None, the agent infers its own set of questions from the document content.A
DocumentSchema object describing the inferred (or question-driven) tabular structure. Pass this to tabularize() to extract the columns from your full document set.tabularize
DocumentSchema to a collection of documents and returns a DocumentDataset: a typed structured dataset where each row corresponds to one document and each column corresponds to a DocumentQuestion in the schema.
The full collection of documents to tabularize. These can be the same documents passed to
induce_schema or a larger dataset that shares the same structure.The
DocumentSchema returned by induce_schema(). Defines the columns the agent will extract from each document.A typed structured dataset derived from the source documents. Each row represents one document; each column represents one answered
DocumentQuestion. Extends Dataset, so it can be passed directly to op.reasoning.fit() or Trainer.configure().Classes
DocumentQuestion
Represents a single question that the agent answers for each document, and the name to give the resulting column in the tabular output.
The natural-language question the agent answers for every document in the dataset (e.g.
"What is the contract value in USD?"). Phrase this as a specific, answerable question about the document content.The name to assign to the output column in the tabularized dataset (e.g.
"contract_value_usd"). Use lowercase snake_case for best compatibility with downstream tooling.DocumentSchema
Describes the full set of columns to extract from a document collection. Returned by induce_schema() and consumed by tabularize().
The ordered list of
DocumentQuestion objects that define the schema’s columns. Each entry maps to one output column in the DocumentDataset. When you pass explicit questions to induce_schema, this list mirrors them; when you let the agent infer the schema, it is populated automatically.DocumentDataset
A typed tabular dataset produced by tabularize(). Extends the base Dataset class, so every method available on a Dataset (including passing it to op.reasoning.fit() or Trainer.configure()) works without any conversion step.
Inherits all properties and methods of
Dataset. Each row corresponds to one source document; each column corresponds to one DocumentQuestion in the schema used to create it.DocumentDataset is a drop-in replacement for any Dataset argument across the OuterProduct SDK. You can pass it directly to op.reasoning.fit(), Trainer.configure(), or any other API that accepts a Dataset.