Introduction
Income-producing real estate is priced inefficiently relative to other asset classes. Each property is a unique bundle of location, condition, unit configuration, vintage, and local operating cost environment. Pricing any single asset requires assembling a bespoke assumption set before a return calculation is possible. The assumptions are the work; the math is mechanical.
This creates a research cost problem. When assumption-setting costs 2 to 4 hours per asset, investors screen fewer opportunities, fewer assets receive rigorous price discovery, and markets clear less efficiently. A typical institutional screening pipeline (data ingestion, rent estimation, operating assumptions, ranking, template population, return computation, interpretation) runs 30 to 50 analyst-hours for 265 listings narrowed to 10 underwritten deals.
We present an empirical case in which an autonomous system on the Apers platform completed this pipeline in a single session. The system was given a dataset and three instructions. It independently designed its own analytical methodology (scoring architecture, formula selection, ranking weights, underwriting defaults), proposed that methodology to the human for confirmation, and executed the full pipeline through AQ-001, a pre-defined quick acquisition screener for income-producing assets. The human made three decisions across the entire process. Two context window exhaustion events required restarts. The pipeline delivered 10 fully functional underwriting files with live sensitivity engines and diagnostic interpretation of the results.
Autonomous methodology design
The system was given a CSV of 265 multifamily listings and three instructions: screen everything, pick the top 10, underwrite them. It was not given a methodology.
The system independently proposed: a four-tier submarket rent estimation structure implemented as VLOOKUP formulas against a visible lookup table; a PERCENTRANK-based blended scoring function combining going-in cap rate (50% weight), inverted price per bed (30%), and year built (20%); a set of underwriting defaults (65% LTV, 6.75% rate, 30-year amortization, 3% NOI growth, 5-year hold, exit cap equal to going-in plus 50 basis points); and a state-file checkpointing pattern for recovery across context window failures.
Each of these is a design decision with alternatives the system could have chosen. PERCENTRANK over z-scores (bounded output, interpretable without statistical context). Formula-driven lookups over hardcoded values (editable post-handoff versus disposable). Blended multi-factor score over single-metric sort (cap rate alone ignores basis and asset quality). The system articulated the rationale for each choice and asked for confirmation before executing.
The human confirmed the assumption set and ranking weights in a single exchange. No methodology was prescribed, revised, or corrected at this stage. The system designed the analytical pipeline; the human validated its inputs.
Screening, ranking, and underwriting pipeline
Ingestion and enrichment. The system parsed 265 listings, filtered to rows with complete data, and wrote the parsed dataset to a recovery state file. Rent estimation used the four-tier submarket lookup confirmed by the human. Operating expenses were modeled at a flat 55% ratio. All downstream calculations (NOI, cap rate, score) were implemented as live formulas referencing the lookup table, so changing one rent tier recalculates all 265 rankings.
Ranking and selection. The blended PERCENTRANK score ranked all 265 listings. The system extracted the top 10, performed a sanity check (verifying that high-scoring listings reflected plausible combinations of cap rate, basis, and vintage rather than data errors), and presented the selection before proceeding.
Underwriting. Each of the top 10 was underwritten through AQ-001, producing stabilized NOI, going-in cap rate, DSCR, levered IRR, equity multiple, cash-on-cash yield, and pass/fail against hurdle rates. The 10-year cash flow projection and sensitivity matrix in each file remained fully functional. The file-duplication strategy used to achieve this is described in the next section.
Diagnostics. The system did not report results as a table and stop. It identified the structural reason the deals failed (negative leverage spread: going-in yields below debt cost), flagged a modeling artifact (renovation budget applied to new construction), recommended a capital structure diagnostic (unlevered re-run to isolate asset quality), and stated the actionable conclusion with specific next steps for the operator.
Template integrity in pre-defined pricing models
The system's first underwriting attempt used tab-level reconstruction: nine new worksheet tabs inside a single AQ-001 copy, with template formulas written into each new tab. This approach failed.
AQ-001's sensitivity matrix depends on 25 helper cells that replicate the cash flow engine under varied assumptions. These were not copied to the new tabs. DSCR calculated correctly; IRR and equity multiple returned blanks on 9 of 10 tabs. The system produced nine partially functional underwriting files without detecting the incomplete output.
The human corrected: duplicate the entire file ten times instead. File-level duplication preserves every internal dependency. The system adapted, created ten copies, batch-populated input cells in parallel, and all ten files produced complete results.
This finding generalizes. Institutional pricing models contain internal engines (sensitivity matrices, waterfall calculators, debt schedules, promote structures) with cross-references that break under partial reconstruction. Any autonomous system operating on pre-defined pricing models must treat the template as an atomic unit. File-level duplication is structurally safe. Tab-level or cell-level reconstruction introduces a failure surface proportional to the template's internal complexity.
Results
All 10 deals failed the 15% IRR / 1.75x equity multiple hurdle under levered assumptions. Going-in caps ranged from 2.8% to 7.6% against a 6.75% debt cost. Five deals could not cover debt service. Leverage was compressing rather than amplifying returns across the set.
An unlevered re-run (requested by the human, executed by the system across all 10 files simultaneously) isolated asset quality from capital structure and produced a different ranking than the levered case. The system identified the top performer under each capital structure, flagged a modeling artifact that distorted the ranking, and produced a diagnostic recommendation.
The specific market conclusions are secondary to the demonstration: 265 heterogeneous assets passed through a standardized pricing model with live sensitivity engines, and the output included not just return metrics but interpretive analysis a capital allocator could act on.
Human intervention analysis
Three human interventions across the full pipeline.
Intervention 1: Assumption confirmation. The system proposed rent tiers, OpEx ratio, ranking weights, and underwriting defaults. The human confirmed in a single exchange. Type: confirmatory. This prevented a full methodology rebuild had assumptions been wrong after 265 rows of formulas were written.
Intervention 2: Template duplication correction. The system attempted tab-level reconstruction; the human diagnosed the failure and directed file-level duplication. Type: corrective. This restored template integrity on 9 of 10 files.
Intervention 3: Unlevered diagnostic pivot. The human requested a 0% LTV re-run to isolate asset quality from capital structure. Type: directive. This produced the cleaner analytical frame for a market where going-in caps sat below debt cost.
Everything else was autonomous: methodology design, formula selection, CSV parsing, data validation, rent-tier application, scoring, ranking, sanity checking, file duplication, input population, return computation, result interpretation, artifact identification, and recommendation generation.
Intervention 1 was confirmatory (system proposed, human approved). Intervention 2 was corrective (system failed, human diagnosed). Intervention 3 was directive (human reframed the analytical question). A production system that enforces atomic template duplication and automatically runs both levered and unlevered cases would reduce the human role to confirmation only.
Context window constraints and checkpointing
The LLM's context window was exhausted twice during execution, requiring session restarts. The first occurred during enrichment (holding 265 parsed listings while constructing formulas). The second occurred during the tab-level reconstruction attempt (writing template formulas to nine sheets while maintaining conversation history).
The system mitigated both failures using a state-file checkpointing pattern: at each phase boundary, it wrote intermediate results and confirmed decisions to disk. On restart, the next session resumed from the checkpoint without re-executing completed phases. The pattern is analogous to checkpointing in distributed compute, with the constraint being memory (context tokens) rather than compute time.
Current context windows (100K to 200K tokens) are sufficient for the computation but insufficient for holding the full dataset, template structure, and conversation history simultaneously on a 265-row pipeline. This is the primary technical constraint on scaling to larger datasets or more complex model templates. Larger context windows, structured memory systems, or pipeline decomposition into independently executable phases would address it. The state-file pattern is a practical mitigation but not a solution: it introduces latency, requires human re-engagement, and depends on the system writing comprehensive state before exhaustion occurs.
Implications for pricing efficiency in heterogeneous asset markets
The efficient pricing of heterogeneous real assets has been constrained by research cost per asset. When screening 265 listings manually, the rational response is to apply rough filters and underwrite 5 to 10 that pass. The remaining 255 receive no rigorous price discovery.
The pipeline described here applied the same analytical rigor to all 265 listings that a manual process would apply to 5. If autonomous systems reduce per-asset research cost by an order of magnitude, three consequences follow.
First, the optimal screening aperture widens. A team that screens one market per quarter could screen four at the same analyst headcount and higher analytical resolution.
Second, pricing convergence accelerates. Heterogeneous assets resist efficient pricing because research costs create information asymmetry. If more buyers evaluate more properties at higher rigor, pricing converges toward fundamental value faster. The mechanism parallels what electronic trading brought to equities: lower transaction costs in research (not execution) compressing spreads on less liquid assets.
Third, the locus of human judgment shifts. The binding constraint moves from "how many deals can we underwrite" to "how good are our assumptions." Practitioner expertise becomes most valuable at the input layer (validating rent tiers, adjusting OpEx for local conditions, overriding defaults for atypical properties) rather than at the computation layer.
Limitations
The rent estimation methodology used submarket-tier averages, not unit-level comps. Actual rents within a tier vary by 15% or more. A production deployment would source rental comps for the final candidates before underwriting.
Operating expenses were modeled as a flat ratio. Bottom-up expense modeling would produce materially different results on properties with recently reassessed taxes or deferred capital needs.
The system did not autonomously override the renovation budget default for new construction, a failure of contextual reasoning that distorted one deal's ranking.
Sensitivity of the ranking to assumption changes was not formally tested. The formula-driven workbook structure supports this analysis, but it was not performed.
Two context window exhaustion events required human re-engagement. The state-file pattern mitigated data loss but did not eliminate the disruption.
The study covers one asset class, one market, and one pricing model. Generalizability to other asset classes and model architectures (full institutional pro formas with waterfall structures and multi-tranche capital stacks) requires further study.
Conclusion
Pre-defined pricing models convert investment screening from a serial research task into a parallelizable one. An autonomous system that designs its own analytical methodology, compresses assumption-setting into a confirmable set, treats pricing models as atomic units during duplication, and produces diagnostic interpretation can compress a multi-week analyst pipeline into a single session with minimal human intervention.
We think this approach generalizes to any asset class where standardized pricing models exist and input data is structured but incomplete. The constraint that will determine adoption is assumption quality. The harder the assumptions are to standardize (office with 80-tenant rent rolls, development with entitlement and construction risk), the less compressible the pipeline. For asset classes with relatively uniform operating characteristics (net lease, self-storage, manufactured housing, small multifamily), the compression ratio demonstrated here should be achievable.
The implication is structural: the efficient pricing of heterogeneous real assets has been gated by research cost, not analytical capability. Removing that gate changes how many assets receive price discovery and how quickly markets reflect fundamental value.