We built a system that constructs institutional-grade financial models, audits them against industry practice, and self-corrects until the output meets institutional standards. No human intervenes in the repair cycle. On a complex multifamily opportunistic pro forma, the system went from a 64% audit score to 97.8% across three iterations.
The architecture is a form of reinforcement learning from AI feedback (RLAIF) applied to a specific vertical. One AI system builds the model. A second AI system evaluates it against 99 domain-specific checks. A third reasons over the evaluation and issues corrections. The loop repeats. The AI feedback is the entire supervision signal.
Two findings surprised us. First, the foundation model powering the evaluator matters as much as the one powering the builder, and the best builder isn't necessarily the best auditor. Second, the evaluator's own rigidity became visible on this deal: a 97.8% score still flagged UNRELIABLE. That result surfaces a distinction we think matters beyond real estate, between a defined problem and a well-defined one.
The Test Case
AQ-141 is a multifamily opportunistic acquisition model: 80 units, distressed property, phased renovation, bridge-to-permanent debt, S-curve lease-up, GP/LP waterfall with preferred returns and promote structures. Ten interconnected sheets. The kind of spreadsheet an institutional real estate shop builds when underwriting a heavy-lift deal.
A single assumption change in this model, say a 50-basis-point move in exit cap rate, propagates through bridge debt draws, interest reserve depletion, permanent loan sizing, waterfall distributions, and return metrics simultaneously. It has every structural feature that makes LLM-generated spreadsheets fail: debt that starts as a construction-style facility and converts to permanent financing mid-hold, an interest reserve that depletes month-by-month and must be tracked against actual drawn balances, and a waterfall that distributes proceeds through four sequential tiers.
How It Works
The system has five stages. The first two are linear. The last three form a loop.
Specification. A human writes the model spec: what asset class, what strategy, what each sheet should contain, how formulas should reference each other.
Blueprint. Claude Opus 4.6 reads the spec and produces a construction-ready blueprint. Every sheet, every section, every formula relationship, every design decision documented. The AQ-141 blueprint runs to several thousand words, specifying details like "Phase 2 draw formula should return $0 for months outside the phase window, not Boolean FALSE."
XL-2 construction. XL-2, our model construction engine, takes the blueprint and builds the spreadsheet. It has six modules that handle the translation from natural-language specification to functioning Excel workbook, and 47 inline verification checks that catch structural errors during construction: formula references that don't resolve, cross-tab inconsistencies, calculations that don't reconcile. These checks are domain-agnostic. They verify computational integrity.
AUReM evaluation. The completed model goes to AUReM, our audit framework, which runs 99 checks across 12 categories. Each check is severity-weighted: Critical (3×), Major (2×), Minor (1×). These checks encode CRE domain knowledge. Does the exit valuation use forward NOI or trailing? Is the management fee based on Effective Gross Income? Are replacement reserves below the NOI line? Does the bridge balance track holdback draws? The output is structured diagnostics: check ID, pass/fail, severity, affected cells, expected versus found values, estimated financial impact.
Contrastive repair. XL-2 receives the AUReM report and compares what the model does against what AUReM says it should do. The orchestrator classifies each failure (formula error, plan error, or interpretation error), generates targeted repair instructions, and re-executes. The repaired model goes back to AUReM.
The last three stages repeat until convergence.
Three Iterations
v0: 64.4% (UNRELIABLE)
The first build had 13 Critical failures. Three categories accounted for most of the damage.
Debt mechanics scored 40%. The bridge loan balance dropped unexpectedly at month 3 because the system ignored $2.4M in holdback draws that should have been added to the outstanding balance over months 3 through 10. Every downstream interest calculation was wrong. The permanent debt service was dynamically re-sized each year using a formula that produced $389K instead of the correct fixed payment of $276K, overstating debt service by $113K annually.
Returns metrics scored 43%. The unlevered IRR formula pointed at the levered cash flow row, reporting 20.78% instead of the true 11–12%. The equity multiple formula had a sign error that produced 0.42× instead of 1.42×.
Cash flow waterfall scored 27%. No unlevered cash flow line existed. Replacement reserves were counted twice, once in operating expenses and again below NOI.
v1: 91.6% (FAIL)
XL-2 resolved 11 of 13 Critical failures in one repair cycle. Bridge balance tracking, permanent debt service, IRR formulas, equity multiples, exit valuation, loan payoff, reserve double-counting, and a Boolean error in the renovation draw schedule were all corrected.
Two Critical issues remained: gross potential rent was still computed from a blended average rather than a unit-type-weighted calculation, and replacement reserves were positioned above the NOI line instead of below it. Score jumped 27 points.
v2: 96.0% (PASS)
The final repair fixed both remaining Criticals. GPR was rebuilt from unit-type counts times market rents. Reserves moved below the NOI line. The GP catch-up formula was corrected. All cross-checks passed.
Two Minor issues remained, neither affecting the investment thesis: the interest reserve went slightly negative ($8,179 shortfall at Month 19), and the stabilized expense ratio ran above the typical multifamily benchmark, consistent with the heavy-renovation property profile.
The Evaluator Matters: Claude Opus 4.6 vs. GPT 5.2
AUReM is a framework: 12 categories, 99 checks, defined methodology. But running those checks against a live spreadsheet requires a foundation model that can read cell references, trace formula chains across tabs, and judge whether a formula implements the correct economic logic. The evaluation quality depends on which model runs it.
We evaluated the same v2 model with both Claude Opus 4.6 and GPT 5.2. Claude scored it 96.0% (PASS). GPT scored the populated version 89.0% (UNRELIABLE) and the clean template 97.8% (also UNRELIABLE).
The divergence was concentrated in three areas where GPT was stricter.
Rent growth. GPT identified that annual rent growth doesn't flow through the 36-month lease-up period. Revenue stays flat for the first three years because the monthly lease-up formulas use fixed rents, and the annual pro forma just sums those months. Growth only kicks in at Year 4. Claude passed this check. GPT flagged it as Critical because the stated 3% growth assumption doesn't appear in early-year revenue.
Preferred return. GPT flagged that the waterfall accrues preferred return on original contributed capital for the full hold, without reducing the base as capital is returned. For this specific deal, where no capital comes back until exit, the distinction is economically irrelevant. But the formula wouldn't generalize to a deal with interim distributions.
Draw schedule. GPT flagged the renovation draws as straight-lined (equal monthly amounts) rather than following a realistic non-linear pattern. For this deal's scope (steady-pace unit interiors), straight-line is defensible. For ground-up construction, it would be wrong.
The populated model scored lower than the template because deal-specific data makes certain issues arithmetically verifiable. When GPR cells show $1,230,000 across three years against a stated 3% growth rate, the evaluator can confirm the growth isn't flowing. In a blank template, those cells are zero and the evaluator can only assess formula structure.
If the evaluator is too lenient, the system converges to a local optimum that a stricter auditor would reject. The builder and the auditor serve different roles. The best model for each may not be the same.
The Convergence Curve
The v0-to-v1 jump (+27 points) came from execution-level fixes: wrong cell references, broken balance tracking, formula errors. These are problems XL-2 repairs confidently because the AUReM feedback is precise and cell-level.
The v1-to-v2 gain (+4.4 points) came from classification and convention issues: where to position reserves, how to compute catch-up, what basis to use for origination fees. These require the orchestrator to understand CRE modeling conventions, not just formula mechanics.
The GPT evaluations show that further gains require rethinking structural design decisions: how rent growth propagates through a monthly lease-up engine, whether the waterfall tracks unreturned capital, whether draw schedules reflect realistic construction spending. These are plan-level corrections, not formula-level patches.
The template result is the most interesting row. At 97.8%, the formula architecture scores near-perfect, but the status is still UNRELIABLE.
Defined vs. Well-Defined
The 97.8% UNRELIABLE result exposed a limitation in AUReM that generalizes to any RLAIF system operating in a structured domain.
AUReM's checks define what a correct model looks like. But "defined" and "well-defined" are different things. A well-defined problem has a single correct answer for every input. A defined problem has evaluation criteria that are clear enough to automate but that don't fully account for the heterogeneity of real-world inputs. AUReM operates in the gap.
AQ-141 is an opportunistic deal. Its cash flows are heterogeneous in ways that stabilized assets are not. Revenue is genuinely flat during a 36-month lease-up as units come online against an S-curve absorption function; the stated rent growth rate doesn't apply until the property stabilizes. No capital is returned until exit because the entire hold generates negative or minimal free cash flow during renovation, so preferred return on original capital and preferred return on unreturned capital produce identical results. Renovation draws are straight-lined because the scope is steady-pace unit interiors, not a ground-up construction project with front-loaded foundation work.
AUReM flags all three as Critical because it evaluates generalizability. If you hand this template to a different analyst for a different deal, will the formulas still be correct? A deal with interim distributions would need unreturned-capital tracking. A deal with a longer absorption window would need rent growth during lease-up. A ground-up project would need non-linear draw schedules.
The answer is that the formulas won't transfer, and that's why the status is UNRELIABLE despite 97.8%.
This matters for the RLAIF framing. The reward signal is only as good as its coverage of the output space. When the output space is homogeneous (stabilized assets, predictable cash flows), AUReM's 99 checks provide comprehensive supervision. When the output space is heterogeneous (opportunistic deals where different phases of the hold have different dynamics), the reward signal develops blind spots. The system can converge to a high score and still have the evaluator flag it as unreliable, because the evaluator's definition of correctness doesn't fully cover the territory.
We're considering a framework extension: a deal-profile tag that adjusts check behavior for known cash flow patterns. Opportunistic multifamily with phased renovation would suppress the rent growth check during the absorption period while keeping it active post-stabilization. The tag wouldn't lower the bar. It would make it context-aware. In RLAIF terms, this is conditioning the reward function on a categorical property of the input rather than applying a single function uniformly.
Convergence Criteria
We define convergence as three conditions met simultaneously: AUReM score above 95%, zero Critical flags, and score delta below 2 percentage points between consecutive iterations. The Claude-evaluated v2 meets all three. The GPT evaluation would need at least one more cycle.
Score oscillation (fixing issue A introduces issue B, fixing B re-breaks A) signals an internal conflict in the repair strategy. We cap the loop at 5–7 iterations. If convergence hasn't been reached, the system escalates to a human with the full diagnostic history.
What We Learned
A closed-loop system with no human in the build-audit-repair cycle can take an institutional financial model from 64% to 97.8% in three iterations. The feedback signal doesn't need to be learned from human preferences or pairwise rankings. It can be engineered from domain expertise.
The separation between builder and evaluator maps to a clean RLAIF decomposition. XL-2's inline checks verify computational integrity ("is this a valid spreadsheet"). AUReM's checks verify domain correctness ("is this a correct real estate model"). A model can pass all of the former while failing on the latter, because the math can be perfect while the methodology violates how institutional investors actually underwrite deals.
The evaluator model matters. Claude and GPT agree on most checks but diverge where convention meets spec compliance. We now run evaluations with both and flag divergences for human review. Model disagreement identifies where the reward signal is uncertain.
The boundary condition is the gap between defined and well-defined. AUReM's reward signal is defined: clear, automatable, grounded in practice. But when the output space includes deals with heterogeneous cash flow dynamics, the definition doesn't fully cover the territory. The reward model is right about the general case. The builder is right about the specific case. Reconciling the two requires making the reward function context-aware.
We think this decomposition (separate builder from evaluator, engineer the reward signal from domain expertise, use structured feedback for iterative improvement) transfers to other verticals where the output is a complex structured artifact with verifiable correctness criteria. Financial models are an extreme case: dozens of interconnected sheets, thousands of formula dependencies, conventions that can't be inferred from the math alone. But the architecture is domain-agnostic. The domain lives in the reward model.
The XL-2 paper and AUReM framework specification are available at apers.app/post.