RESEARCH

XL-2: A Foundation Model for Spreadsheet Intelligence

March 2026 · 14 min

Figure 1 — XL-2 architecture. The model jointly encodes spreadsheet structure (cell graph), document context (extracted text), and numeric values (tokenized floats) into a unified representation.

Abstract

We introduce XL-2, a foundation model designed specifically for structured financial spreadsheet generation and comprehension. Unlike general-purpose language models that treat spreadsheets as flat text, XL-2 encodes the inherent graph structure of cell references, formula dependencies, and cross-sheet links. Trained on 1.2M anonymized institutional real estate models, XL-2 achieves leading results on financial model generation, assumption extraction, and formula prediction tasks — outperforming GPT-4 by 34% on our CRE-Bench benchmark and producing audit-ready .xlsx output with 97.3% formula accuracy.

KEY RESULT

XL-2 generates production-grade acquisition models in 28 seconds with 97.3% formula accuracy, compared to 72.1% for GPT-4 and 84.6% for fine-tuned Code Llama on the same benchmark.

The Problem with LLMs and Spreadsheets

Large language models have demonstrated remarkable capability across text, code, and reasoning tasks. However, financial spreadsheets present a fundamentally different challenge. A typical institutional real estate acquisition model contains 2,000-8,000 cells across 6-12 sheets, with complex formula dependency graphs, conditional logic, circular references (for debt sizing), and implicit structural conventions that vary by asset class, deal type, and firm preference.

When prompted to generate a spreadsheet, current LLMs produce output that looks correct — the labels are right, the numbers are plausible — but fails under audit. The most common failure modes are:

Failure Mode	Frequency	Description
Broken references	41.2%	Cell references point to wrong row/column after structural changes
Hardcoded values	28.7%	Values that should be formulas are pasted as constants
Missing dependencies	15.3%	Formulas omit inputs that affect the output (e.g., vacancy in NOI)
Sign errors	8.9%	Cash flows with incorrect sign conventions across sheets
Circular logic	5.9%	Iterative calculations (debt sizing) that don't converge

Table 1 — LLM failure modes in financial model generation, measured across 500 GPT-4 generated acquisition models evaluated by institutional analysts.

These aren't edge cases. In our evaluation, 83% of GPT-4 generated models contained at least one error that would be caught in an institutional IC review. The fundamental issue is representational: LLMs serialize spreadsheets as text, losing the spatial and relational structure that makes a spreadsheet a spreadsheet.

Architecture

XL-2 addresses this with a tri-modal architecture that jointly encodes three distinct signal types:

Figure 2 — The tri-modal encoder. Cell positions are encoded as graph nodes with formula edges. Document context (extracted OM text) feeds a standard transformer. Numeric values use a specialized float tokenizer that preserves magnitude and precision.

Cell Graph Encoder. Each cell is represented as a node in a directed graph, where edges encode formula references (=B12*C4 creates edges from B12 and C4 to the formula cell). Sheet boundaries, named ranges, and cross-sheet links are encoded as edge attributes. This preserves the structural relationships that flat text representations destroy.

Document Context Encoder. Deal documents (OMs, rent rolls, appraisals) are processed through a standard transformer encoder. The key innovation is the assumption linking layer — a cross-attention mechanism that learns to map extracted document values to specific cell positions in the target model. When the OM states "Year 1 NOI: $2.4M," the model learns not just to extract the number but to place it in the correct cell with the correct formula dependencies.

Numeric Tokenizer. Financial values span 10+ orders of magnitude ($0.50/SF to $2.4B AUM) and require decimal precision. Standard text tokenization handles this poorly. We use a specialized tokenizer that decomposes floats into (sign, exponent, mantissa) triples, allowing the model to reason about numeric magnitude independently from precision.

Component	Parameters	Input	Output
Cell Graph Encoder	340M	Cell positions + formula edges	Structural embeddings
Document Encoder	220M	Extracted text + tables	Contextual embeddings
Numeric Tokenizer	45M	Float values	(sign, exp, mantissa) tokens
Fusion Decoder	780M	Combined embeddings	.xlsx cell stream
Total	1.385B

Table 2 — XL-2 component breakdown. Total parameter count: 1.385 billion.

Training Data

XL-2 was trained on 1.2 million anonymized financial models sourced from institutional real estate transactions. The dataset spans:

Asset Class	Models	% of Dataset	Avg. Cells
Multifamily	387,000	32.3%	3,420
Office	216,000	18.0%	4,890
Industrial	198,000	16.5%	2,980
Retail	144,000	12.0%	4,210
Mixed-Use / Other	135,000	11.3%	5,640
Affordable / LIHTC	72,000	6.0%	6,120
Self-Storage / Specialty	48,000	4.0%	2,340

Table 3 — Training data distribution by asset class. LIHTC models are disproportionately complex due to tax credit layering and compliance requirements.

All models were anonymized at the entity level (property names, addresses, investor identities replaced with synthetic equivalents) while preserving structural and numeric relationships. The anonymization pipeline is described in Appendix B of the full paper.

Benchmarks

We evaluate on CRE-Bench, a new benchmark we're releasing alongside this paper. CRE-Bench contains 2,400 model generation tasks across 8 categories, each with human-validated ground truth and automated evaluation metrics.

Figure 3 — CRE-Bench overall scores. XL-2 achieves 94.7% across all categories, compared to 60.3% for GPT-4 and 71.2% for fine-tuned Code Llama. Human expert baseline is 96.1%.

Task	XL-2	GPT-4	Code Llama (FT)	Human Expert
Acquisition Model	96.8%	62.4%	73.1%	97.2%
Development Pro Forma	93.2%	54.8%	68.4%	95.6%
Waterfall Distribution	91.4%	48.2%	62.7%	94.8%
Debt Sizing	95.1%	58.9%	71.3%	96.4%
LIHTC Underwriting	89.7%	41.3%	55.8%	93.1%
Assumption Extraction	97.3%	72.1%	84.6%	98.2%
Formula Prediction	96.1%	65.7%	78.2%	97.8%
Sensitivity Analysis	94.2%	59.1%	70.5%	95.9%
Overall	94.7%	60.3%	71.2%	96.1%

Table 4 — CRE-Bench results by task category. XL-2 approaches human expert performance across all categories and significantly outperforms general-purpose LLMs.

The gap between XL-2 and general-purpose LLMs is widest on waterfall distributions and LIHTC underwriting — precisely the structures that require deep domain knowledge and complex conditional logic.

Case Study: 240-Unit Multifamily Acquisition

To illustrate XL-2 in practice, we walk through a complete acquisition underwriting for a 240-unit Class B multifamily property in Austin, TX. The input is a 47-page offering memorandum.

Figure 4 — Input OM (left) and generated acquisition model (right). XL-2 extracted 142 assumptions from the OM, populated the rent roll from the unit mix table, and generated a 3,847-cell model across 8 sheets in 28 seconds.

Key outputs from the generated model:

Metric	XL-2 Output	Analyst Model	Variance
Purchase Price	$52,800,000	$52,800,000	0.0%
Going-In Cap Rate	5.12%	5.12%	0.0%
Year 1 NOI	$2,703,360	$2,698,440	0.18%
Levered IRR (5yr)	14.82%	14.76%	0.06%
Equity Multiple	1.87x	1.86x	0.54%
DSCR (Yr 1)	1.34x	1.34x	0.0%

Table 5 — XL-2 vs. human analyst output comparison. Variances stem from rounding differences in expense growth assumptions (XL-2 used monthly compounding; the analyst used annual).

The 0.18% variance in Year 1 NOI traces to a single assumption: XL-2 compounded the 3% expense growth rate monthly, while the analyst applied it annually. Both approaches are defensible. The remaining metrics cascade from this difference.

Limitations and Open Questions

XL-2 has clear limitations that we want to be transparent about:

Novel structures. XL-2 performs best on deal structures well-represented in its training data. For genuinely novel structures — a first-of-its-kind public-private partnership, an unusual tax credit stacking arrangement — performance degrades. The model produces output, but it requires more analyst review.

Judgment calls. XL-2 can extract assumptions from documents and apply them consistently, but it cannot make the judgment calls that distinguish good underwriting from great underwriting. Choosing a reversion cap rate, sizing a renovation budget, or deciding whether a sponsor's track record justifies a promote structure — these remain human decisions.

Data freshness. The training data reflects market conditions through Q4 2025. Cap rate assumptions, rent growth expectations, and construction cost benchmarks embedded in the model's priors will drift as markets evolve. We are implementing continuous fine-tuning on new transaction data.

Conclusion

XL-2 demonstrates that purpose-built models for structured financial documents significantly outperform general-purpose LLMs on real-world institutional tasks. The key insight is representational: by encoding spreadsheet structure as a graph rather than serialized text, XL-2 preserves the relationships that make financial models useful — and auditable.

The full paper, CRE-Bench dataset, and evaluation code are available at apers.app/research/xl2. We welcome feedback from the institutional real estate community.

Frequently Asked Questions

What is XL-2 and how does it differ from general-purpose LLMs?

XL-2 is a foundation model built specifically for structured financial spreadsheet generation. Unlike general-purpose LLMs that serialize spreadsheets as flat text, XL-2 encodes cell references, formula dependencies, and cross-sheet links as a graph structure, preserving the relationships that make financial models auditable.

What is CRE-Bench?

CRE-Bench is a benchmark released alongside the XL-2 paper that contains 2,400 model generation tasks across 8 categories, including acquisition models, development pro formas, waterfall distributions, and LIHTC underwriting. Each task has human-validated ground truth for automated evaluation.

How accurate are the models XL-2 generates?

XL-2 achieves 97.3% formula accuracy and 94.7% overall on CRE-Bench, compared to 60.3% for GPT-4 and 71.2% for fine-tuned Code Llama. On the same benchmark, human experts score 96.1% overall.

Can XL-2 handle complex deal structures like LIHTC or waterfall distributions?

Yes, though performance varies by complexity. XL-2 scores 89.7% on LIHTC underwriting and 91.4% on waterfall distributions. These are the most challenging categories due to conditional logic and multi-layer tax credit structures, and they also show the widest gap versus general-purpose LLMs.

What are XL-2's known limitations?

XL-2 performs best on deal structures well-represented in its training data and may require more analyst review for genuinely novel structures. It also cannot make subjective judgment calls like choosing a reversion cap rate or sizing a renovation budget. Training data reflects market conditions through Q4 2025, so embedded priors will drift over time.

Ready to try Apers?

Start using Apers today — no credit card required.

Start for Free