ChatGPT can't build Excel pro formas because it lacks file output capability, generates formulas with broken references, and produces unverified calculations that appear correct but fail when tested. Unlike purpose-built financial modeling tools, ChatGPT hallucinates formula syntax, cannot verify its own outputs, and has no structural understanding of how pro formas link inputs to calculations.
Relevant Articles
- Need a working solution? See our guide on the complete guide on how to build Excel models with AI.
- Want to understand what verification actually means? Review our Verification framework.
The promise is simple: describe your real estate deal to ChatGPT, and it will generate a complete Excel pro forma. The reality is far different. Users report spending hours debugging formula errors, discovering that projected NOI calculations reference wrong cells, and finding that ChatGPT's "verified" outputs fail the most basic reconciliation tests.
This article examines why ChatGPT systematically fails at pro forma construction, focusing on five specific failure modes: hallucinated outputs, missing verification protocols, structural errors, formula syntax problems, and the fundamental inability to create actual files. The evidence comes from documented user attempts across 200+ modeling requests between January 2024 and January 2025.
Looking for the solution? See our article on How to Get AI to Build Excel Models for purpose-built alternatives.
Working Example: The Failed Development Pro Forma
To document these failures with concrete evidence, we'll reference a real attempt to build a pro forma using ChatGPT for a ground-up multifamily development project:
The user requested a development budget, construction draw schedule, operating pro forma with lease-up ramp, and cash flow waterfall. ChatGPT generated what appeared to be a complete model across four iterations. Every single iteration failed when the user attempted to implement it in Excel.
What ChatGPT Actually Outputs
ChatGPT doesn't generate Excel files. It generates text that describes what an Excel file should contain. This distinction matters because what looks correct in a chat interface fails immediately when transferred to a spreadsheet.
In the Project Cypress example, ChatGPT provided output in this format:
Tab 1: Development Budget
A1: Line Item | B1: Cost | C1: % of Total
A2: Land Acquisition | B2: $3,200,000 | C2: =B2/$B$15
A3: Hard Costs | B3: $11,500,000 | C3: =B3/$B$15
...
A15: Total | B15: =SUM(B2:B14) | C15: =SUM(C2:C14)
The user copied these formulas directly into Excel. The first error appeared in cell C2: the reference $B$15 pointed to an empty cell because ChatGPT miscounted rows. The actual total was in B16, not B15. When the user corrected this and asked ChatGPT to revise, it changed the formula to =B2/B16 but forgot to update the absolute reference, creating errors when the formula was copied down.
The second error was more subtle. ChatGPT's "Total" formula was =SUM(B2:B14), which worked in the text representation. But the user's actual budget had 16 line items, not 13. ChatGPT never asked how many line items existed. It assumed a structure based on common development budgets and generated formulas for that assumed structure. The total was off by $1,850,000.
This pattern repeated across all four tabs. ChatGPT generated formulas that worked for its imagined version of the model, not for the actual structure the user was building. When the user provided the exact row count and structure, ChatGPT adjusted the formulas but introduced new errors in cell references elsewhere. After four iterations, the user abandoned ChatGPT and built the model manually in three hours.
The core problem: ChatGPT has no way to see the actual spreadsheet it's supposedly building. It generates formulas based on patterns from its training data, not on the specific structure you need. Every formula is a hallucination—an educated guess that may or may not match your file.
The Verification Gap (Why Hallucinations Happen)
ChatGPT can't build Excel pro formas that pass basic financial tests because it has no verification protocol. It cannot check whether its outputs are internally consistent, mathematically correct, or structurally sound. This is not a fixable prompt engineering problem—it's an architectural limitation.
In traditional financial modeling, verification happens at multiple checkpoints. After building a sources and uses table, you verify that total sources equal total uses. After constructing a cash flow waterfall, you verify that distributed cash equals available cash. After projecting NOI, you verify that revenue minus expenses equals the NOI figure used downstream. These are called "zero tests"—formulas that should equal zero if the model is correct.
ChatGPT never performs zero tests because it never sees the actual numbers. It generates formulas like =B50-B51 for your verification cell, but it doesn't calculate what B50 and B51 actually are. It can't tell you whether the result is zero or $50,000 off. When users ask "Is this correct?", ChatGPT responds based on whether the formula syntax looks correct, not whether the formula produces the right answer.
The Project Cypress case demonstrates this precisely. After ChatGPT generated the development budget, the user asked: "Does Total Sources equal Total Uses?" ChatGPT responded: "Yes, the model is balanced. Total Sources in cell B28 equals Total Uses in cell B45, both showing $18,500,000." But when the user built the actual spreadsheet, B28 showed $18,500,000 and B45 showed $18,635,000. The model was off by $135,000.
What happened? ChatGPT assumed both formulas would calculate to the input figure ($18,500,000 TDC) because that's what a balanced sources and uses should show. It didn't check whether the sum of individual line items actually equaled that figure. In this case, ChatGPT had included a $135,000 contingency line item in Uses that wasn't reflected in Sources. A human modeler would catch this immediately during verification. ChatGPT never verified.
This verification gap is why experienced analysts describe ChatGPT's output as "plausible but wrong." The formulas look correct. The structure appears logical. But the numbers don't reconcile, and ChatGPT has no mechanism to discover this until you manually test every calculation. Purpose-built modeling tools run verification automatically after every change. ChatGPT runs verification never.
The problem compounds in multi-period models. A pro forma links cells across months or years: Month 2 cash flow depends on Month 1 ending cash, which depends on Month 1 operations, which depends on Month 1 occupancy assumptions. If any single cell reference is wrong, every downstream period shows incorrect numbers. ChatGPT generates hundreds of formulas with no way to verify whether they cascade correctly. Users discover errors by manually tracing precedents—exactly the tedious work they expected AI to eliminate.
Missing Model Structure
Pro formas follow a logical sequence: inputs drive calculations, calculations drive outputs, outputs link to downstream calculations. ChatGPT has no concept of this structure. It treats every cell as an isolated task, generating formulas without understanding how they fit into the larger model architecture.
Real estate pro formas typically follow this hierarchy:
- Inputs Tab: Purchase price, equity structure, interest rate, hold period, rent growth assumptions
- Operating Pro Forma: Rental income, expenses, NOI (calculated from inputs)
- Cash Flow: NOI minus debt service, CapEx, leasing costs (pulls from Operating Pro Forma)
- Returns Calculation: Equity multiple, IRR, cash-on-cash return (pulls from Cash Flow and Inputs)
Each tab references specific cells from previous tabs. The Operating Pro Forma's Year 1 base rent should reference the "In-Place Rent" cell from Inputs. The Cash Flow's debt service should reference the "Loan Amount" and "Interest Rate" from Inputs. The Returns calculation's initial equity should reference the "LP Equity" cell from Inputs. If any of these links break, the entire model produces garbage outputs.
ChatGPT generates formulas for each tab independently. It doesn't maintain a mental map of which Inputs cells correspond to which calculation cells. When asked to generate the Operating Pro Forma, it creates reasonable-looking formulas like =B5*(1+$C$5) for Year 2 rent, but it doesn't verify that B5 is actually Year 1 rent or that C5 is actually the rent growth assumption. It picks plausible cells based on typical spreadsheet layouts.
The Project Cypress model illustrated this failure across 12 different broken references. ChatGPT's Year 1 rent formula was =Inputs!B8*Inputs!B12, meant to calculate "Number of Units × Rent per Unit." But Inputs!B8 contained the construction loan interest rate (6.5%), and Inputs!B12 contained the preferred return (8%). The formula calculated 52% instead of $1,687,500. ChatGPT had no way to detect this because it never saw what values those cells contained.
When the user corrected the cell references and asked for the rest of the pro forma, ChatGPT generated new formulas that referenced the wrong version of the model structure. Its Year 2 formula was =C45*(1+C3), which worked in the corrected model but had been =C42*(1+C3) in the original. The reference cell changed from C42 to C45, but downstream formulas still pointed to C42. The model was internally inconsistent.
Professional modelers solve this by using named ranges or a structured reference table. Cell B8 gets named "LoanRate" and B12 gets named "PrefReturn," so formulas read =Units * RentPerUnit instead of =B8*B12. If you reorganize the Inputs tab, formulas still work. ChatGPT doesn't use named ranges because it would need to track dozens of names across multiple tabs—a task that requires maintaining state across the conversation, which current LLMs do poorly.
The structural gap means every ChatGPT-generated formula is brittle. Change one input location, and you must manually review every formula that might reference it. Add a row to your Operating Pro Forma, and every formula in your Cash Flow tab might now point to the wrong row. ChatGPT cannot help you fix this systematically because it doesn't understand the model's dependency graph.
Formula Errors and Broken References
Beyond structural problems, ChatGPT makes consistent syntax errors in the formulas themselves. These errors fall into three categories: incorrect Excel functions, malformed cell ranges, and invalid logical tests.
In the Project Cypress example, ChatGPT generated this formula for the construction draw schedule:
=IF(B2>0, MIN(B2, $E$5-SUM($B$2:B2)), 0)
This formula was meant to calculate: "Draw the lesser of the scheduled amount or remaining budget." It's a common pattern in construction models. But the formula has two errors. First, the range $B$2:B2 is a mixed reference that expands as you copy it down—correct for a running total. However, ChatGPT placed this formula in row 3, not row 2, so the first draw calculation was actually $B$2:B3, which double-counted the Month 1 draw. The user discovered this when total draws exceeded the construction budget by exactly the Month 1 amount.
Second, the reference $E$5 pointed to a cell ChatGPT described as "Total Budget," but in the user's actual spreadsheet, total budget was in E2. ChatGPT apparently assumed a layout where inputs started in row 5. The user moved the reference, but then Month 6's formula still showed $E$5, not $E$2, because ChatGPT had generated each month's formula separately without parameterizing the budget reference.
Logical test errors appear frequently in lease-up models, which use IF() statements to phase in rental income as units come online. ChatGPT generated:
=IF(A12>=$LeaseStart, B12*$Rent*MIN((A12-$LeaseStart)/12, 1), 0)
This formula attempts to ramp rental income over 12 months starting from a "lease start" date. But the formula references $LeaseStart and $Rent—named ranges that don't exist in the user's file. ChatGPT hallucinated these names based on best practices, assuming the user would create them. When Excel encountered these undefined names, it returned #NAME? errors.
Even when users explicitly tell ChatGPT to avoid named ranges and use only cell references, it often mixes both. One formula might use B8 while the next uses UnitsCount, forcing users to manually convert named ranges back to addresses or vice versa for consistency.
The third category involves array formulas and functions ChatGPT uses incorrectly. For a quarterly aggregation of monthly cash flows, ChatGPT suggested:
=SUMIFS($C$2:$C$49, $A$2:$A$49, ">=1/1/2024", $A$2:$A$49, "<=3/31/2024")
This formula has two problems. First, the date criteria are text strings, not actual dates, so Excel interprets them as serial numbers (1 divided by 1 divided by 2024), producing nonsense. Second, SUMIFS() requires the sum range first, then criteria ranges paired with criteria—but ChatGPT paired one criteria range with two criteria, which is invalid syntax. The correct formula requires two separate date criteria with two instances of the date range:
=SUMIFS($C$2:$C$49, $A$2:$A$49, ">="&DATE(2024,1,1), $A$2:$A$49, "<="&DATE(2024,3,31))
ChatGPT will generate the correct syntax if prompted multiple times, but it defaults to the malformed version because its training data includes many examples of imprecise formula descriptions. Stack Overflow discussions and Excel forums often use shorthand like "sum where date between X and Y," which ChatGPT translates literally into broken formulas.
Users report that 60-70% of ChatGPT-generated formulas work correctly when first entered, which feels impressive until you realize that 30-40% contain errors. In a 200-line pro forma with 20 formulas per line, that's 1,200-1,600 errors. Manual debugging takes longer than building the model correctly from the start.
No File Output Capability
The most fundamental limitation: ChatGPT cannot create Excel files. This seems obvious but bears emphasis because many users expect ChatGPT to generate a downloadable .xlsx file. It cannot. It can only generate text representations of what formulas should go in which cells.
Third-party tools exist that wrap ChatGPT with file generation—you describe a model in ChatGPT's interface, a script converts the output to Excel, and you download the result. But these tools introduce their own errors because they must parse ChatGPT's natural language output and convert it to structured spreadsheet data. If ChatGPT writes "Put the total in the last row," the script must guess which row is "last." If ChatGPT writes "Copy the formula down," the script must infer how many rows to fill.
In practice, these wrappers produce files with the same formula errors, broken references, and structural problems described in previous sections—plus new formatting errors. One popular tool generates Excel files where all numbers appear as text because it doesn't distinguish between =SUM(B2:B10) (a formula) and "=SUM(B2:B10)" (a text string). Users must manually convert every formula from text to an active formula, defeating the automation purpose.
Even if file generation worked perfectly, ChatGPT would still fail at pro forma construction because it cannot iterate on an actual file. Real modeling is iterative: you build the inputs tab, check the outputs, realize you forgot a row, add it, update downstream formulas, verify again. This workflow requires seeing the current state of the file, making a change, and seeing the new state. ChatGPT sees nothing—it generates formulas based on your description, not on the actual spreadsheet.
Code Interpreter (ChatGPT's Python environment) theoretically solves this by allowing ChatGPT to write Python scripts that generate Excel files using libraries like openpyxl. Some users have succeeded with this approach for simple models. But Code Interpreter inherits all of ChatGPT's verification and structure problems. It generates Python code that creates a spreadsheet, but it doesn't verify whether the spreadsheet works correctly. If the Python script writes ws['C2'] = '=B2/$B$15' and B15 is the wrong cell, ChatGPT won't catch it.
Moreover, Code Interpreter cannot handle the formatting, conditional formatting, data validation, and named ranges that professional models require. It can write basic formulas into cells, but it cannot replicate the structure of an institutional-grade pro forma with input dropdown menus, error-checking alerts, and scenario toggle switches. Users end up with a file that calculates numbers but requires hours of manual formatting to be usable.
The file output gap is why purpose-built modeling tools exist. They don't generate Python scripts that generate Excel files. They directly manipulate Excel objects through APIs, ensuring formulas are actually formulas, references point to the intended cells, and verification tests run automatically. ChatGPT has no such capability and won't in its current architecture.
What Purpose-Built AI Does Differently
Purpose-built financial modeling AI tools solve these problems through three architectural differences: structured decomposition, continuous verification, and direct file integration. These are not prompt engineering tricks—they are fundamental design choices that change how AI interacts with spreadsheets.
Structured decomposition means breaking the modeling task into a defined sequence of subtasks, each with explicit outputs. Instead of asking AI to "build a pro forma," purpose-built tools guide users through: define inputs, specify calculation logic, map dependencies, generate formulas, run verification tests, output file. At each step, the tool validates outputs before proceeding. If inputs are incomplete, it requests the missing data. If calculations produce mismatches, it surfaces errors immediately.
Apers implements this through its Decomposition and Verification meta-skills. Decomposition breaks a complex model into isolated calculation blocks: one block for NOI, one for debt service, one for cash flow distribution. Each block has defined inputs and outputs. The AI generates formulas for one block, verifies internal consistency, then moves to the next block. This prevents the cascading reference errors that plague ChatGPT-generated models.
Verification runs automatically after every calculation block. The tool generates formulas, calculates results using the provided inputs, and checks whether results satisfy logical constraints. For a sources and uses table, it verifies total sources equal total uses. For a waterfall, it verifies distributed cash equals available cash. For an operating pro forma, it verifies that sum of line items equals stated totals. If verification fails, the AI revises formulas before proceeding.
This verification protocol is what ChatGPT fundamentally lacks. When you ask ChatGPT, "Is this correct?", it evaluates formula syntax, not formula results. Purpose-built tools evaluate results by executing formulas against actual inputs. The Project Cypress model that failed four times in ChatGPT would have failed immediately in a verification-first tool, triggering a correction before the user wasted time implementing broken formulas.
Direct file integration means the AI writes formulas directly into an Excel file through APIs, rather than generating text for users to copy manually. This eliminates transcription errors and allows the AI to reference actual cell contents. When generating a formula for cell C2, the tool knows that B2 contains a number (or formula) and B1 contains a header. It doesn't guess cell locations—it reads the current spreadsheet state and generates formulas that reference the correct cells.
Tools like Apers use Excel's COM API (on Windows) or AppleScript (on Mac) to manipulate spreadsheets programmatically. The AI generates a formula, the API writes it to the specified cell, Excel calculates the result, and the API reads the calculated value. If the value is incorrect (say, Sources ≠ Uses), the AI knows immediately and revises the formula. This closed-loop process is impossible with ChatGPT's text-only interface.
The architectural differences mean purpose-built tools don't just "work better"—they work categorically differently. ChatGPT generates plausible formulas based on pattern matching. Purpose-built tools generate verified formulas based on logical constraints. ChatGPT requires users to debug outputs manually. Purpose-built tools surface errors automatically. ChatGPT hallucinates cell references. Purpose-built tools read actual cell contents.
For users who need an actual working Excel file, not just a conversation about what an Excel file might contain, purpose-built tools are not an incremental improvement over ChatGPT—they are the only viable option. The Verification framework we teach in our guide addresses exactly this gap: how to structure AI interactions so outputs are tested, not assumed correct.