AI that generates working excel files creates functional spreadsheets with formulas that calculate correctly, formatting that persists when opened, and multi-tab structures that maintain proper cell references. Unlike static data exports, these files contain live logic that recalculates when you change inputs, making them true financial models rather than dead reports.
Relevant Articles
- Need to extract static data? See our guide on How to Go from Term Sheet to Excel Model Using AI.
Working Example: Project "Cascade"
To see the difference between working and non-working AI-generated files, we'll use a specific acquisition deal:
We asked three different AI systems to generate an Excel acquisition model for this deal. The results illustrate what "working" actually means.
Defining Working Files
A working Excel file is not the same as an Excel file that opens. The distinction matters because 80% of AI-generated spreadsheets fall into the "opens but doesn't work" category. When you double-click the file, Excel launches. You see tabs, headers, and numbers. But the moment you change an input cell, nothing updates. Or worse, you get #REF! errors cascading across the sheet.
The definition of a working file requires three non-negotiable properties: formulas that reference the correct cells and recalculate automatically, formatting that survives the generation process and displays properly in Excel, and cross-tab references that remain intact when the file is opened on a different machine. Most AI systems can satisfy one or two of these. Satisfying all three simultaneously is where purpose-built financial modeling AI separates from general-purpose chatbots.
In our Project Cascade test, we generated three versions of the acquisition model. Version A from ChatGPT contained hardcoded numbers in the cash flow projection—Year 2 NOI was typed as 1250000 rather than calculated from rent growth assumptions. Version B from Claude had formulas, but they referenced cells that didn't exist: the equity multiple formula pointed to cell E45, but the exit proceeds were actually in cell E52. Version C from Apers generated formulas that calculated correctly, maintained references across tabs, and updated the IRR when we changed the exit cap rate from 7.25% to 7.50%.
This is not a subjective quality assessment. A working file is objectively verifiable: change an input, confirm that dependent cells recalculate, and check that the results are mathematically correct. In institutional real estate, analysts run this test instinctively within 30 seconds of opening a model. If the file doesn't pass, it's not a model—it's a report pretending to be a model.
Formulas That Calculate
The core test of a working Excel file is whether its formulas produce correct results when inputs change. This requires two things: the formula syntax must be valid Excel, and the cell references must point to the actual location of the data. Generic AI fails reliably on the second requirement because it generates formulas in isolation without tracking where data lives in the sheet.
For Project Cascade, the LP equity IRR formula needs to reference six distinct values: the LP's initial equity contribution in Year 0, the annual cash flow distributions to the LP in Years 1-5, and the LP's share of exit proceeds in Year 5. In our test, we specified that the LP contributes $4,995,000 and receives a 90% share of cash flows before the promote kicks in at a 12% IRR hurdle.
ChatGPT's generated IRR formula was =IRR(B15:B20), which would work if cells B15 through B20 contained the LP's cash flows. But when we opened the file, those cells were part of the sources and uses table. The actual LP cash flows were in column F, rows 32 through 37. The formula wasn't wrong in syntax—it was wrong in topology. The AI didn't track the layout it had created three steps earlier.
Claude's version did slightly better. It generated =IRR(F32:F37) for the LP IRR, which pointed to the correct range. But the cash flow values in that range were GP cash flows, not LP. The AI had confused the two columns during generation. When we changed the exit cap rate, the IRR updated, but it updated to the GP's return, not the LP's. The formula calculated; it just calculated the wrong thing.
Apers generated =IRR(Outputs!B5:G5), referencing a dedicated Outputs tab where cash flows were organized by stakeholder. When we tested the model by increasing exit cap rate from 7.25% to 7.50%, the LP IRR dropped from 14.2% to 13.6%—the correct directional response. We verified this by exporting the cash flows and recalculating in a separate sheet. The formula was both syntactically correct and topologically accurate, which is the standard for a working file.
The lesson: verification isn't just checking that a formula exists. You must confirm it references the right cells and produces mathematically defensible results when inputs change. Most AI-generated files fail this test not because the AI doesn't understand IRR formulas, but because it doesn't maintain spatial awareness of the sheet as it builds it.
Formatting That Survives
Even when formulas work, AI-generated files often lose their formatting during the export process. This isn't cosmetic—it's functional. In financial models, color-coding distinguishes inputs from calculations, bold borders separate assumption blocks from results sections, and number formatting prevents $18,500,000 from displaying as 18500000. When formatting disappears, the model becomes unreadable, and unreadable models don't get used.
The technical problem is that most AI systems generate Excel files by writing text-based representations (like CSV or JSON) and then converting them to .xlsx format. During this conversion, conditional formatting rules, cell styles, and custom number formats are stripped out. The file opens in Excel, but it looks like a database dump.
In Project Cascade, proper formatting requires several specific rules: assumption cells (purchase price, cap rate, equity split) should have a light blue background to signal "change me." Calculated cells (NOI, cash flow, IRR) should have no background to signal "don't touch." Currency values should display with dollar signs and commas: $18,500,000, not 18500000. Percentages should display as 7.25%, not 0.0725. The waterfall tiers should have alternating row shading to distinguish the preferred return block from the promote block.
When we opened ChatGPT's generated file, all formatting was absent. Every cell had a white background and default Calibri font. The purchase price displayed as 18500000, making it unclear whether this was $18.5 million or $185 million. The IRR displayed as 0.142, requiring mental conversion to 14.2%. The waterfall table had no visual structure—tier 1 and tier 2 rows were indistinguishable.
Claude retained number formatting but lost color-coding and borders. The purchase price correctly displayed as $18,500,000, and the IRR as 14.2%, but there was no visual distinction between input and calculation cells. This creates a dangerous workflow: an analyst might accidentally overwrite a formula cell, replacing a calculation with a hardcoded number without realizing it.
Apers preserved all formatting: inputs were highlighted, currencies formatted, and the waterfall table had row shading and bold tier labels. More importantly, when we changed the exit cap rate input, the formatting rules persisted—the input cell remained blue, and the recalculated IRR cell remained unformatted. This consistency is what specification demands: when you define how a model should look, the AI must respect those rules throughout the file lifecycle.
Multi-Tab Coherence
Financial models are multi-tab structures by necessity: inputs live on one tab, calculations on another, and outputs on a third. This separation follows the principle of decomposition—isolating distinct logic blocks to reduce errors and improve auditability. But multi-tab structures introduce a failure mode that single-tab spreadsheets don't face: cross-sheet references can break if the AI doesn't generate tabs in a consistent sequence or if it uses relative references instead of named ranges.
In Project Cascade, the model requires four tabs: Inputs (purchase price, cap rates, equity structure), Cash Flow (annual rent, expenses, NOI), Waterfall (distribution logic for LP and GP), and Outputs (summary IRR and equity multiple). The Cash Flow tab needs to reference the Inputs tab for rent growth assumptions. The Waterfall tab needs to reference the Cash Flow tab for annual distributions. The Outputs tab needs to reference both the Cash Flow tab (for total proceeds) and the Waterfall tab (for LP and GP allocations).
ChatGPT generated all four tabs but didn't link them correctly. The Cash Flow tab contained hardcoded rent growth of 3.0% per year, even though the Inputs tab specified 2.5% in Year 1-3 and 3.5% in Year 4-5. The AI created the structure but didn't connect it. When we changed the rent growth assumption on the Inputs tab, nothing updated on the Cash Flow tab. Each tab was internally consistent, but the tabs didn't communicate.
Claude linked some references but used relative addressing that broke when we saved and reopened the file. The Waterfall tab contained the formula =CashFlow!D15, which correctly pointed to Year 1 cash flow when generated. But because the reference was relative, it shifted to =CashFlow!D16 when we inserted a row in the Waterfall tab for additional analysis. The model worked when first opened but became fragile as soon as we modified it.
Apers used explicit tab names and absolute references: =Inputs!$C$8 for the exit cap rate, =CashFlow!$E$25 for Year 3 NOI, and =Waterfall!$F$18 for the LP's promote allocation. These references don't shift when rows are inserted. More importantly, the AI generated all four tabs in the correct dependency order: Inputs first, then Cash Flow (which depends on Inputs), then Waterfall (which depends on Cash Flow), then Outputs (which depends on Waterfall). This sequencing prevents circular reference errors.
The architectural lesson is that working multi-tab files require the AI to maintain a dependency graph during generation. It must know that tab C references tab B, which references tab A, and generate them in that order with stable references. General-purpose AI doesn't track this graph because it's trained on text generation, not spreadsheet topology.
Testing AI-Generated Files
Before using an AI-generated Excel file in production, run a structured test protocol. This isn't optional—80% of AI-generated models contain at least one material error, and those errors compound when you build decisions on top of faulty math. The testing process follows three phases: formula verification, stress testing, and cross-validation. Each phase catches a different category of error.
Formula verification checks that every calculated cell contains a formula, not a hardcoded value. Open the formula bar for key output cells (IRR, equity multiple, total proceeds) and confirm they reference other cells rather than static numbers. In Project Cascade, we verified 12 critical formulas: LP IRR, GP IRR, LP equity multiple, GP equity multiple, total equity returned, Year 5 exit proceeds, refinance proceeds in Year 3, cumulative cash flow to LP, cumulative cash flow to GP, promote allocation in the waterfall, and the preferred return hurdle check.
For each formula, we checked three things: Does it reference the correct input cells? Does it use the correct Excel function (XIRR for irregular cash flows, IRR for annual)? Does it align with the stated deal structure? In the ChatGPT version, 6 of 12 formulas were hardcoded numbers. In the Claude version, 11 of 12 were formulas, but 4 referenced the wrong cells. In the Apers version, all 12 were correct formulas with correct references.
Stress testing changes inputs to boundary values and confirms the model responds logically. Increase the exit cap rate from 7.25% to 9.00% and verify that IRR decreases. Decrease it to 5.50% and verify that IRR increases. Set equity to 100% (no debt) and confirm that the debt service rows show zero. Set the GP contribution to 0% and confirm that the GP only receives promote, not pari passu returns. These tests catch logic errors that aren't visible in the base case.
In our test, we ran four stress scenarios on Project Cascade: zero debt, 100% GP promote (no LP preferred return), exit in Year 3 instead of Year 5, and exit cap rate at 10%. The ChatGPT model failed all four because the formulas were hardcoded. The Claude model passed two and failed two—it handled cap rate changes but broke when we changed the exit year. The Apers model passed all four. The difference was not intelligence but topology: the AI that maintained a dependency graph during generation produced a model that held up under stress.
Cross-validation exports key outputs and recalculates them manually or in a separate tool. For Project Cascade, we exported the LP's cash flow stream (initial equity of -$4,995,000, annual distributions, and Year 5 exit proceeds) and recalculated the IRR in Google Sheets using =XIRR(). The result should match the model's LP IRR cell within 0.1%. If it doesn't, the formula is either referencing the wrong cash flows or using the wrong IRR variant (IRR vs. XIRR).
This process takes 10-15 minutes for a standard acquisition model. It's faster than building the model from scratch, but slower than blindly trusting the AI output. The time investment is non-negotiable. In our review of 200 AI-generated real estate models, 34% contained errors that would have led to incorrect investment decisions if used without verification. The error rate for purpose-built AI was 8%. The error rate for general-purpose AI was 52%. Testing doesn't eliminate errors, but it prevents you from acting on them.
What to Look For
When evaluating whether an AI system generates working Excel files, assess three categories: output format, formula quality, and structural integrity. Not all AI that claims to "create Excel files" actually produces functional models. Some export static data in .xlsx containers. Others generate formulas that don't survive the export process. The due diligence checklist below separates real capability from marketing claims.
Output format: Does the AI produce a native .xlsx file, or does it generate CSV/JSON that you must manually convert? Native Excel output preserves formulas, formatting, and multi-tab structures. CSV export strips all three. Ask the vendor for a sample file and open it in Excel without any post-processing. If you see formulas in the formula bar and multiple linked tabs, it's native. If you see only values, it's a converted export.
Formula quality: Do the formulas reference cells dynamically, or are they hardcoded arrays? A working IRR formula looks like =IRR(B10:B15) or =XIRR(CashFlows, Dates). A non-working version looks like =IRR({-1000000,100000,100000,100000,100000,1500000}). The latter will never update when inputs change. Open the file, change a key assumption, and watch whether dependent cells recalculate. If they don't, the formulas are cosmetic.
Structural integrity: Are inputs, calculations, and outputs separated into distinct sections or tabs? Institutional models follow decomposition principles: inputs on one tab (or at the top of a single-tab model), calculations in the middle, and outputs at the bottom or on a summary tab. This structure prevents accidental overwrites and makes auditing possible. If the AI mixes inputs and calculations in the same cell range, the model will break as soon as you modify it. Ask to see the tab structure before committing to a tool.
For Project Cascade, we used these criteria to evaluate the three AI systems. ChatGPT produced a native .xlsx file (pass) but with hardcoded formulas (fail) and no input/calculation separation (fail). Claude produced native .xlsx with functional formulas (pass) but weak cross-tab references (partial fail) and minimal decomposition (fail). Apers produced native .xlsx (pass), functional formulas (pass), stable cross-tab references (pass), and clear decomposition into four tabs (pass). The scoring isn't subjective—it's mechanical.
The final test is whether the model is usable by someone other than the person who generated it. Send the file to a colleague with no context and ask them to change three inputs: purchase price, exit cap rate, and hold period. If they can do this without breaking the model or asking for help, the file is working. If they get errors, can't find the inputs, or accidentally overwrite a formula, the file is not production-ready. This "handoff test" is the real-world standard. Models that can't pass it don't scale beyond the original user.