Scaling Core Earnings Measurement With Large Language Models | CLS Blue Sky Blog

Published 4 months ago• 9 minute read

Large language models (LLMs) are capturing headlines and users as AI researchers push the limits of their capabilities. Those of us who work on understanding and analyzing corporations, whether as practitioners or researchers, are presented with two practical questions: What can these models do for us that wasn’t previously possible (their upside), and when, how, and why might they mislead (their risks)? The answers lie in the differences between LLMs and prior AI.

LLMs are, at base, neural networks trained through deep learning to predict sequences of words. This requires them to develop representations of the words’ meanings in context. Deep learning pioneer Ilya Sutskever offers an apt analogy to a detective novel: Predicting the revelation on the final page requires “understanding” the story and background (like the genre’s norms). With sufficient scale and tuning for instruction, LLMs can now answer questions and perform tasks with varying, but often remarkable, accuracy. Whether to call that “reasoning” or “merely” extraordinary pattern-matching is beyond the scope of this post. What is undeniable, though, is that LLMs can do work with natural language that previously required human judgment.

This has major implications for analyzing and understanding businesses. Previously, we had to choose between the depth of firm-specific research (“Warren Buffett-style”) vs. the breadth of quantitative analysis. The first is slow and subjective. The second is scalable and more neutral but misses crucial context. LLMs have the potential to bridge this gap because, like humans, they can “understand” texts, but, unlike us, they don’t need sleep or coffee breaks and aren’t swayed by personal interests.

In a new paper, Charles C.Y. Wang and I apply language models to estimate firms’ “core earnings” – the persistent profitability from main business activities, excluding one-time events and ancillary activities. We chose this setting because it is an important and thorny task for investors and especially suited to LLMs’ distinctive strengths.

U.S. public companies must report financial statements in accordance with Generally Accepted Accounting Principles (GAAP). Yet, GAAP’s measure of bottom-line profit, net income, is not the measure investors focus on to value shares. It includes the effects of various transitory items, like write-downs and one-off events. Equity investors, who own a share of all future profits, are interested in the persistent component, the “core” earnings. But that’s not a GAAP-defined or reported measure, so analysts and investors must calculate it themselves.

This is not a trivial task. A single expense reported on the income statement might combine recurring operational costs and one-time write-offs, requiring cross-referencing with footnotes. The variance in reporting practices across firms (e.g., different choices of terminology or aggregation) and inevitable judgment calls confound purely “algorithmic” approaches. And it has become challenging over time, as accounting standards have increased the presence and complexity of nonrecurring items, while financial disclosures have grown more bloated. While company management and equity analysts offer their own “adjusted” earnings measures, those have predictable biases and inconsistencies. In short, obtaining reliable measures of core earnings remains surprisingly elusive and costly.

We set out to determine whether and how language models could tackle this task and where they might fail, using GPT-4o, which OpenAI promotes as its state-of-the-art general model.

We began with an “out of the box” approach, providing the LLM with a definition of core earnings, the 10-K text, and instructions to estimate core earnings and show its work. We were interested in this approach for two reasons: (1) It hinges on the model’s “native” reasoning, without expert procedural oversight; (2) It mirrors an analyst delegating the entire process to ChatGPT with minimal oversight. We call this the “lazy analyst approach” as a mnemonic.

We found that under this approach, the LLM’s analysis had systematic and significant errors, typically making adjustments for interest expense (a recurring financing cost), stock-based compensation (a recurring employee cost), and depreciation/amortization (recurring allocations). Our interpretation was that under this approach, the model’s output seemed to largely follow the pattern of other financial analyses that frequented its training data (such as EBITDA analyses) rather than hewing to our provided definition. In short, the out-of-the-box approach failed.

Next, to determine the potential utility of the LLM for this task, we experimented on a small holdout sample and developed a refined approach. This involved giving the LLM rote procedural instructions for three serial passes over the 10-K: (1) identify unusual expenses/losses, (2) identify unusual income/gains, and (3) tabulate and aggregate into an adjusted earnings measure. We call this the “sequential prompt” as shorthand. The model’s output under this approach appeared sound to us (as professors who teach this subject). However, since our qualitative evaluations aren’t scalable or fully neutral, we next designed and implemented quantitative tests.

While there is no accepted benchmark for “true core earnings” in any particular case (this was, after all, what motivated our study), we know the statistical properties a good core earnings measure should exhibit. Therefore, we can test whether and how well each candidate fits the bill. First, a good core earnings measure should be “smoother” than GAAP net income, with higher autocorrelation, if it removes transitory shocks. Second, its “adjustments” (i.e., relative to net income) should not be autocorrelated if they capture non-recurring items. Third, it should predict future earnings if it truly tracks its recurring components. Finally, it should explain market prices if it reflects the persistent earnings investors focus on for valuation.

Consistent with our qualitative observations, the core earnings measure produced by the out-of-the-box approach failed these empirical tests. Its adjustments showed high autocorrelation – repeating between periods – indicating they weren’t truly capturing non-recurring items. Further, its substantially exceeded net income on average, suggesting it tracked a higher profitability level (like EBITDA). Such a measure may be useful for other purposes but differs from core owners’ earnings.

In contrast, the sequential approach’s core earnings measure had the desired properties. It was smoother than net income but maintained a similar average level, indicating it tracked the same underlying construct. Its adjustments showed zero autocorrelation, suggesting the model successfully distinguished transitory from recurring amounts. Moreover, it did better in predicting future net income and explaining market valuations than net income itself.

The results from the sequential approach are striking and suggest that these tools may have enormous potential for reducing investors’ information-processing costs. While our analysis for our full validation sample was computationally heavy, on a per-firm basis, the costs are minimal – less than $1 and one minute for each 10-K. Further, our results may set a relatively low bar for these models’ potential value. A savvy analyst using ChatGPT to analyze a 10-K wouldn’t necessarily stop after just three exchanges and accept the output uncritically. Likewise, a quantitative investment fund analyzing contemporaneous filings (i.e., with a larger budget and fewer 10-Ks to analyze) could use more granular approaches – fine-tuning, cross-validating across prompts, etc. Meanwhile, language models (including newer “reasoning models” built on them) keep improving.

At the same time, our findings come with some important caveats. First, our findings are not unequivocally “pro-ChatGPT.” As notable as the sequential prompt’s success was the failure of the out-of-the-box approach. It is possible that errors of this type are inherent to the paradigm (autoregressive models trained on the internet corpus) and will not be remedied by future models. As professors who teach this subject matter, we were able to spot the errors, but the model’s analyses were highly fluent and might seem authoritative to others. Thus, our findings suggest that analysts still cannot fully delegate this work to LLMs, and, indeed, credulous use of them could introduce hard-to-detect errors. This provides a practical and timely lesson, as many workers feel tempted to offload some cognitive labor onto these tools.

Second, though the sequential prompt’s outputs appeared sound and showed desirable properties in large-sample tests, we cannot be sure the results produced by our approach would be valid in any specific case or meet the threshold of reliability appropriate for high-stakes decisions (e.g., concentrated investments). Our findings support only the claims we make and no more.

Third, given that GPT-4o was trained on data spanning our sample period, and several of our tests related the LLMs’ core earnings measures to future results, there is a natural concern about look-ahead bias: whether the model’s performance reflects memorized future data rather than “reasoning.” This concern applies to virtually any study of state-of-the-art LLMs on archival data, since the best-performing models all have relatively recent knowledge cutoffs and were trained on data that included the SEC’s EDGAR.

While we cannot rule out this issue entirely, we believe it does not undercut the significance of our findings. First, note that in our successful approach, the LLM is simply instructed, in separate passes, to tabulate unusual items without being told that they will be used for an adjusted earnings measure or how that result will be validated. Further, leading language models are evidently capable of interpreting financial terminology, and they are specifically fine-tuned for “instruction following.” We think it is likely that the model’s success in the sequential approach is attributable more to its ability to apply the requisite judgments to the provided context, as instructed, than to pure “memorization.” However, we cannot be certain. Interpreting neural networks’ “decision process” is challenging in general and impossible for GPT-4o, which is closed-source. At minimum, our findings suggest these models are useful for retrospective analysis of financial statement disclosures.

Our paper advances two important conversations.

First, we explore the potential and limitations of this new class of AI for core earnings measurement. This is a fundamental yet surprisingly fraught task in financial analysis, made increasingly difficult by changes in accounting standards and “disclosure bloat.” It could become standard practice to apply tools of this class to tasks like these.

Second, we hope to affect the direction of research on LLMs in financial and accounting analysis. Our study was motivated by careful consideration of these models’ distinctive capabilities and the new opportunities they present for our fields. Some research has examined LLMs’ performance on discrete tasks, such as arithmetic or declarative knowledge of facts or quantitative prediction. But the greatest potential of language models is in tasks that suit their distinctive strengths. We believe these are tasks that, like ours, involve integration of general background knowledge, human-like judgment, and reasoning over unstructured texts – work that mirrors (and may even partially supplant) that of white-collar knowledge workers.

This post comes to us from Matthew Shaffer, an assistant professor at the University of Southern California, and Charles C.Y. Wang, the Tandon Family Professor of Business Administration at Harvard Business School. It is based on their recent paper, “Scaling Core Earnings Measurement with Large Language Models,” available here.

Origin: