Engineering Blog: How Extropy Works
Extropy is a population simulation engine. You describe a population and a scenario in plain language, and Extropy produces thousands of statistically grounded synthetic agents who reason through that scenario independently, influence each other through social networks, and produce distributional behavioral forecasts. It is designed for one job: predicting how real populations will actually respond to events that haven't happened yet.
This post explains the engineering decisions behind Extropy. We use one example throughout: simulating the US population's response to the arrival of AGI.
Pipeline Overview
Extropy separates compile-time intelligence from runtime execution. At compile time, it builds explicit contracts for population, scenario context, and persona rendering. At runtime, it deterministically instantiates agents, builds their social graph, and simulates behavior over timesteps.
The core pipeline is:
Each stage produces an inspectable artifact that the next stage consumes. This, coupled with a CLI for interaction, makes the system auditable via agentic harnesses. Expensive reasoning is done once, and execution stages can be rerun quickly with fixed seeds.
Curating a Base Population
We needed a generic system that can generate thousands of agents quickly while preserving diversity, realism, and meaningful outliers, for populations as broad as US adults or as specific as African traders in Guangzhou.
A census-only workflow was not enough. Census data is a strong anchor, but for niche or decision-specific populations it often becomes reductionist and requires heavy manual curation.
If you model a subpopulation by filtering from a broad census base, you often lose behavioral and structural variables needed for micro-realism. You match top-line demographic totals, but miss the cross-attribute structure that actually drives decisions. So we use a ground-up specification approach with external grounding data, instead of only slicing preexisting census tables.
At the core, the spec defines attributes and how each one is generated. We use universal, population-specific, context-specific, and personality attributes, each carrying distributions, constraints, and dependency links. For example, age is universal, technology_adoption is population-specific, and openness is personality-specific.
Dependencies are the key reason this works. Real attributes are coupled: income with education, household structure with spending flexibility, and work schedule with financial resilience.
Attributes are not sampled in isolation. Each attribute starts from a base distribution and can then be adjusted through modifiers, which are explicit if-then rules applied when conditions are met. In the AGI example, modifiers can raise expected income for high-demand technical roles and lower it for roles with higher automation exposure. This preserves realistic heterogeneity instead of collapsing the population to a single average profile.
Derived attributes are deterministic when the relationship is arithmetic rather than probabilistic. For example, economic_buffer_months can be computed as liquid_savings / monthly_expenses once those upstream values are set.
Finally, dependencies are compiled into a strict sampling order, so each field is generated only after its prerequisites exist. This prevents dependency-order errors and sets up a stable and auditable base to build on.
Modeling Scenario Context
The scenario stage takes the base population and adds two layers: scenario-specific attributes and scenario dynamics. The base population provides stable structure. The scenario layer adds the information environment and decision context for this specific event.
Some attributes only matter in a specific scenario, so they do not belong in the base population. In the AGI example, this includes perceived AGI exposure, role-replacement anxiety, trust in frontier labs, and adaptation intent. These are researched and encoded as first-class attributes, then merged into the same dependency graph and sampling order as base attributes.
This separation is about composability. One base population can support multiple scenarios, each with its own context-specific assumptions.
The scenario defines how information enters the system through event metadata such as source, credibility, ambiguity, and framing. It also defines exposure channels and rules so different groups receive information through different paths, at different times, and with different probabilities.
This explicitly models information asymmetry. In real settings, people do not receive the same information at the same time or from equally trusted sources. That asymmetry changes behavior, so it has to be part of the scenario contract.
Scenarios can be static or evolving. Static scenarios model one primary event and its diffusion. Evolving scenarios add timeline events that introduce new information over time. This supports updates, reversals, and second-order effects instead of freezing context at timestep zero.
The scenario also defines what gets measured. We focus on categorical and open-ended outcomes. Categorical outcomes keep the decision space explicit and trackable. Open-ended outcomes capture emergent reasoning that fixed options may miss. Option-level friction can also be encoded so actions reflect real execution difficulty, not just stated preference.
Household Configuration
When household context matters, the scenario carries household configuration including household-type distributions by age bracket, partner-correlation settings, and dependent-generation rules. This preserves decision context because many real decisions are made at household level, not by isolated individuals. This directly feeds scope-aware sampling in the next stage.
Persona Compilation
Raw structured attributes are not a good interface for downstream reasoning. The persona stage compiles a PersonaConfig that renders each sampled agent into a consistent first-person narrative.
The main design choice is compile once, apply everywhere. Persona rules are generated once per scenario and then reused deterministically for all agents, so there are no per-agent persona LLM calls at runtime.
Each attribute is assigned a treatment. Concrete treatment renders direct values when absolute numbers matter. Relative treatment renders position against population context when comparative standing matters more than raw value. Categorical and boolean fields also get deterministic first-person phrasing templates, so the same underlying state is expressed consistently across agents. //could be changed.
For example, if an agent has monthly_income = 6200, concrete treatment can render: "I earn about $6,200 per month." If that same agent has technology_optimism = 0.78, relative treatment can render: "I'm more optimistic about new technology than most people like me." Concrete keeps absolute magnitude when it matters. Relative keeps comparative meaning when raw scalars are less interpretable.
The renderer groups attributes into readable sections and prioritizes decision-relevant fields supplied by the scenario. The result is compact persona text that preserves signal, improves consistency, and gives the simulation stage a stable language layer over structured data.
Sampling at Scale
Sampling is where the compiled contracts become concrete agents. It consumes the merged population specification, household configuration, and persona configuration, then instantiates agents deterministically. With the same spec and seed, you get the same sampled population.
The sampler runs attribute generation in dependency order. Independent attributes are sampled directly from declared distributions. Conditional attributes are sampled from a base distribution and then adjusted by matching modifiers. Derived attributes are computed from formulas using already available fields. Hard numeric constraints are applied as clamping, and distribution parameters can also be formula-driven when bounds or means depend on upstream context.
When household semantics are active, sampling shifts from isolated individuals to structured household realization. A primary adult is sampled first, household type is selected from configured age-bracket distributions, then partner and dependent members are generated according to household rules. Attribute scope controls propagation: individual fields vary per person, household fields are shared, and partner-correlated fields are sampled with assortative correlation logic. In practice, partner attributes are not copied blindly or sampled independently. They are generated using configurable same-value or same-group probabilities, plus controlled offsets for numeric traits like age, so partner pairs preserve realistic correlation structure without collapsing into identical profiles.
After realization, the sampler runs deterministic reconciliation to enforce coherence. This aligns partner and marital consistency, household size and composition consistency, and shared household naming consistency across members and NPC context. These checks are not cosmetic. They prevent downstream network and simulation stages from inheriting broken household state.
A core design choice is preserving meaningful tails while blocking contradictions. Hard bounds allow realistic extremes, modifiers preserve structured heterogeneity, and constraints block impossible combinations. Post-sampling quality gates separate impossible from implausible outcomes. Impossible violations are hard failures. Implausible patterns are measured as reconciliation burden and surfaced as warnings or failures based on strictness settings. Condition-evaluation warnings are also tracked and can be promoted to hard failures in strict mode.
Network Generation and Topology Gating
The network stage turns sampled people into a usable interaction graph. Sampling gives us realistic profiles, but each agent is still isolated until we define who can influence whom. We use the scenario-local network configuration as our default across all studies, serving as the single source of truth for relationship semantics and topology targets. This keeps network behavior tied to scenario intent instead of hidden runtime defaults. The output is a seeded, reproducible graph that is deterministic under seed, inspectable in storage, and reusable across reruns.
We use a hybrid edge model rather than a single mechanism. Similarity edges capture soft social affinity from weighted attributes in the config, while structural edges encode hard ties that should exist regardless of similarity. We treat ties like partner and household as mandatory anchors, then add other channels under a controlled budget. This prevents the graph from drifting into either pure randomness or over-hardcoded structure. In practice, it gives us both realism and controllability.
At runtime, we do not compare every person to every other person up front. We first build a shortlist of plausible candidates using a few socially meaningful blocking attributes, then compute detailed similarity only inside those shortlists. If coverage diagnostics show the shortlist is too narrow, we expand candidates in staged steps. This keeps runtime predictable while preserving enough signal for community detection and calibration.
After candidate similarity is computed, the generator runs a calibration loop. It samples edges, measures the realized topology, and adjusts within-community versus cross-community edge pressure toward configured targets. Structural edge budgets keep mandatory ties intact while similarity edges shape global structure. Repair passes raise clustering and improve connectivity using controlled local edits. Rewiring runs as a final refinement step.
Quality control uses explicit topology gates. The final graph is checked against configured bounds for degree, clustering, modularity, connectivity, and minimum edge count. If rewiring worsens a previously passing graph, the system reverts to the pre-rewire graph. In strict mode, failed runs return non-zero and can be quarantined under a suffixed network id for inspection. This provides a hard go/no-go signal before simulation.
Simulating Behavioral Dynamics
Simulation is where Extropy turns static artifacts into behavior over time. Each timestep follows a fixed phase order: apply exposures, run reasoning, run conversations (if enabled), record summary metrics, then checkpoint completion. Exposure itself is layered as seed rules, timeline events, then network propagation, which lets scenario information and peer diffusion coexist without collapsing into one channel. Timeline events tag each update with an info_epoch and can force agents to think again when new information arrives. Direct timeline exposure is capped by intensity, and network sharing is one-shot per source-target-position, so spread stays realistic instead of exploding. In practice, this gives you controlled cascades instead of all-agent spikes that destroy signal quality.
Reasoning is trigger-based, not all-agent every step. Agents reason when newly aware, when multi-touch thresholds are met, or when newer forced timeline epochs arrive; committed agents are protected from routine re-reasoning unless scenario dynamics explicitly escalate. The default two-pass design separates role-play generation from structured classification, which reduces central-tendency collapse and produces cleaner outcome extraction. Fidelity settings then control how deep runtime social dynamics go: low skips conversations, medium and high interleave bounded conversations after reasoning chunks with strict per-agent and per-timestep budgets.
State updates are modeled as public and private tracks so expression and behavior do not get conflated. Public sentiment and conviction can change each step, private state changes more slowly, and strongly held views do not flip easily unless new evidence is strong enough. Action friction is also encoded, so high-effort behavior changes require stronger conviction and social reinforcement than low-friction choices. Agents who do not reason in a timestep undergo controlled conviction decay, which prevents stale certainty from freezing the system. Runs stop by explicit conditions or convergence rules, with timeline-aware safeguards against premature stopping, and resume safely from timestep and chunk checkpoints. The result is a simulation runtime built for production reliability: clear state transitions, recoverable long runs, and outputs that are immediately usable for downstream analysis.
If you want to see this engine in action, we ran a full simulation of how the US population reacts to the Iran strikes in our US Attacks Iran study.