RESEARCH
XL-2: A Foundation Model for Spreadsheet Intelligence
Abstract
We introduce XL-2, a foundation model designed specifically for structured financial spreadsheet generation and comprehension. Unlike general-purpose language models that treat spreadsheets as flat text, XL-2 encodes the inherent graph structure of cell references, formula dependencies, and cross-sheet links. Trained on 1.2M anonymized institutional real estate models, XL-2 achieves leading results on financial model generation, assumption extraction, and formula prediction tasks — outperforming GPT-4 by 34% on our CRE-Bench benchmark and producing audit-ready .xlsx output with 97.3% formula accuracy.
KEY RESULT
XL-2 generates production-grade acquisition models in 28 seconds with 97.3% formula accuracy, compared to 72.1% for GPT-4 and 84.6% for fine-tuned Code Llama on the same benchmark.
The Problem with LLMs and Spreadsheets
Large language models have demonstrated remarkable capability across text, code, and reasoning tasks. However, financial spreadsheets present a fundamentally different challenge. A typical institutional real estate acquisition model contains 2,000-8,000 cells across 6-12 sheets, with complex formula dependency graphs, conditional logic, circular references (for debt sizing), and implicit structural conventions that vary by asset class, deal type, and firm preference.
When prompted to generate a spreadsheet, current LLMs produce output that looks correct — the labels are right, the numbers are plausible — but fails under audit. The most common failure modes are:
| Failure Mode | Frequency | Description |
|---|---|---|
| Broken references | 41.2% | Cell references point to wrong row/column after structural changes |
| Hardcoded values | 28.7% | Values that should be formulas are pasted as constants |
| Missing dependencies | 15.3% | Formulas omit inputs that affect the output (e.g., vacancy in NOI) |
| Sign errors | 8.9% | Cash flows with incorrect sign conventions across sheets |
| Circular logic | 5.9% | Iterative calculations (debt sizing) that don't converge |
Table 1 — LLM failure modes in financial model generation, measured across 500 GPT-4 generated acquisition models evaluated by institutional analysts.
These aren't edge cases. In our evaluation, 83% of GPT-4 generated models contained at least one error that would be caught in an institutional IC review. The fundamental issue is representational: LLMs serialize spreadsheets as text, losing the spatial and relational structure that makes a spreadsheet a spreadsheet.
Architecture
XL-2 addresses this with a tri-modal architecture that jointly encodes three distinct signal types:
Cell Graph Encoder. Each cell is represented as a node in a directed graph, where edges encode formula
references (=B12*C4 creates edges from B12 and C4 to the formula cell). Sheet boundaries, named ranges, and
cross-sheet links are encoded as edge attributes. This preserves the structural relationships that flat text representations
destroy.
Document Context Encoder. Deal documents (OMs, rent rolls, appraisals) are processed through a standard transformer encoder. The key innovation is the assumption linking layer — a cross-attention mechanism that learns to map extracted document values to specific cell positions in the target model. When the OM states "Year 1 NOI: $2.4M," the model learns not just to extract the number but to place it in the correct cell with the correct formula dependencies.
Numeric Tokenizer. Financial values span 10+ orders of magnitude ($0.50/SF to $2.4B AUM) and require decimal precision. Standard text tokenization handles this poorly. We use a specialized tokenizer that decomposes floats into (sign, exponent, mantissa) triples, allowing the model to reason about numeric magnitude independently from precision.
| Component | Parameters | Input | Output |
|---|---|---|---|
| Cell Graph Encoder | 340M | Cell positions + formula edges | Structural embeddings |
| Document Encoder | 220M | Extracted text + tables | Contextual embeddings |
| Numeric Tokenizer | 45M | Float values | (sign, exp, mantissa) tokens |
| Fusion Decoder | 780M | Combined embeddings | .xlsx cell stream |
| Total | 1.385B |
Table 2 — XL-2 component breakdown. Total parameter count: 1.385 billion.
Training Data
XL-2 was trained on 1.2 million anonymized financial models sourced from institutional real estate transactions. The dataset spans:
| Asset Class | Models | % of Dataset | Avg. Cells |
|---|---|---|---|
| Multifamily | 387,000 | 32.3% | 3,420 |
| Office | 216,000 | 18.0% | 4,890 |
| Industrial | 198,000 | 16.5% | 2,980 |
| Retail | 144,000 | 12.0% | 4,210 |
| Mixed-Use / Other | 135,000 | 11.3% | 5,640 |
| Affordable / LIHTC | 72,000 | 6.0% | 6,120 |
| Self-Storage / Specialty | 48,000 | 4.0% | 2,340 |
Table 3 — Training data distribution by asset class. LIHTC models are disproportionately complex due to tax credit layering and compliance requirements.
All models were anonymized at the entity level (property names, addresses, investor identities replaced with synthetic equivalents) while preserving structural and numeric relationships. The anonymization pipeline is described in Appendix B of the full paper.
Benchmarks
We evaluate on CRE-Bench, a new benchmark we're releasing alongside this paper. CRE-Bench contains 2,400 model generation tasks across 8 categories, each with human-validated ground truth and automated evaluation metrics.
| Task | XL-2 | GPT-4 | Code Llama (FT) | Human Expert |
|---|---|---|---|---|
| Acquisition Model | 96.8% | 62.4% | 73.1% | 97.2% |
| Development Pro Forma | 93.2% | 54.8% | 68.4% | 95.6% |
| Waterfall Distribution | 91.4% | 48.2% | 62.7% | 94.8% |
| Debt Sizing | 95.1% | 58.9% | 71.3% | 96.4% |
| LIHTC Underwriting | 89.7% | 41.3% | 55.8% | 93.1% |
| Assumption Extraction | 97.3% | 72.1% | 84.6% | 98.2% |
| Formula Prediction | 96.1% | 65.7% | 78.2% | 97.8% |
| Sensitivity Analysis | 94.2% | 59.1% | 70.5% | 95.9% |
| Overall | 94.7% | 60.3% | 71.2% | 96.1% |
Table 4 — CRE-Bench results by task category. XL-2 approaches human expert performance across all categories and significantly outperforms general-purpose LLMs.
The gap between XL-2 and general-purpose LLMs is widest on waterfall distributions and LIHTC underwriting — precisely the structures that require deep domain knowledge and complex conditional logic.
Case Study: 240-Unit Multifamily Acquisition
To illustrate XL-2 in practice, we walk through a complete acquisition underwriting for a 240-unit Class B multifamily property in Austin, TX. The input is a 47-page offering memorandum.
Key outputs from the generated model:
| Metric | XL-2 Output | Analyst Model | Variance |
|---|---|---|---|
| Purchase Price | $52,800,000 | $52,800,000 | 0.0% |
| Going-In Cap Rate | 5.12% | 5.12% | 0.0% |
| Year 1 NOI | $2,703,360 | $2,698,440 | 0.18% |
| Levered IRR (5yr) | 14.82% | 14.76% | 0.06% |
| Equity Multiple | 1.87x | 1.86x | 0.54% |
| DSCR (Yr 1) | 1.34x | 1.34x | 0.0% |
Table 5 — XL-2 vs. human analyst output comparison. Variances stem from rounding differences in expense growth assumptions (XL-2 used monthly compounding; the analyst used annual).
The 0.18% variance in Year 1 NOI traces to a single assumption: XL-2 compounded the 3% expense growth rate monthly, while the analyst applied it annually. Both approaches are defensible. The remaining metrics cascade from this difference.
Limitations and Open Questions
XL-2 has clear limitations that we want to be transparent about:
Novel structures. XL-2 performs best on deal structures well-represented in its training data. For genuinely novel structures — a first-of-its-kind public-private partnership, an unusual tax credit stacking arrangement — performance degrades. The model produces output, but it requires more analyst review.
Judgment calls. XL-2 can extract assumptions from documents and apply them consistently, but it cannot make the judgment calls that distinguish good underwriting from great underwriting. Choosing a reversion cap rate, sizing a renovation budget, or deciding whether a sponsor's track record justifies a promote structure — these remain human decisions.
Data freshness. The training data reflects market conditions through Q4 2025. Cap rate assumptions, rent growth expectations, and construction cost benchmarks embedded in the model's priors will drift as markets evolve. We are implementing continuous fine-tuning on new transaction data.
Conclusion
XL-2 demonstrates that purpose-built models for structured financial documents significantly outperform general-purpose LLMs on real-world institutional tasks. The key insight is representational: by encoding spreadsheet structure as a graph rather than serialized text, XL-2 preserves the relationships that make financial models useful — and auditable.
The full paper, CRE-Bench dataset, and evaluation code are available at apers.app/research/xl2. We welcome feedback from the institutional real estate community.
Frequently Asked Questions
What is XL-2 and how does it differ from general-purpose LLMs?
XL-2 is a foundation model built specifically for structured financial spreadsheet generation. Unlike general-purpose LLMs that serialize spreadsheets as flat text, XL-2 encodes cell references, formula dependencies, and cross-sheet links as a graph structure, preserving the relationships that make financial models auditable.
What is CRE-Bench?
CRE-Bench is a benchmark released alongside the XL-2 paper that contains 2,400 model generation tasks across 8 categories, including acquisition models, development pro formas, waterfall distributions, and LIHTC underwriting. Each task has human-validated ground truth for automated evaluation.
How accurate are the models XL-2 generates?
XL-2 achieves 97.3% formula accuracy and 94.7% overall on CRE-Bench, compared to 60.3% for GPT-4 and 71.2% for fine-tuned Code Llama. On the same benchmark, human experts score 96.1% overall.
Can XL-2 handle complex deal structures like LIHTC or waterfall distributions?
Yes, though performance varies by complexity. XL-2 scores 89.7% on LIHTC underwriting and 91.4% on waterfall distributions. These are the most challenging categories due to conditional logic and multi-layer tax credit structures, and they also show the widest gap versus general-purpose LLMs.
What are XL-2's known limitations?
XL-2 performs best on deal structures well-represented in its training data and may require more analyst review for genuinely novel structures. It also cannot make subjective judgment calls like choosing a reversion cap rate or sizing a renovation budget. Training data reflects market conditions through Q4 2025, so embedded priors will drift over time.