1. Executive Summary
We benchmarked the Hudson Labs Co-Analyst, ChatGPT, and Perplexity Pro on investment research tasks involving multi-period, multi-document data extraction. The evaluation covered 25 queries, each requiring structured retrieval of figures across multiple reporting periods (e.g., quarterly results spanning several years).
We used a strict binary scoring system, awarding one point for fully correct and complete answers and zero for any errors or omissions.
The Hudson Labs Co-Analyst correctly answered 21/25 of the queries, far outperforming GPT-5 (17/25) and Perplexity Pro (0%). The Co-Analyst delivered more precise and consistent multi-period extractions with fewer errors and less manual intervention.
This strict scoring reflects the high-stakes nature of investment workflows, where even a single incorrect number can materially affect decisions.
Key findings:
- Higher accuracy: The Co-Analyst correctly answered 21/25 of the queries, while GPT-5 and Perplexity Pro scored 17/25 and 0/25, respectively. Every answer provided by Perplexity Pro was incomplete or contained estimates based on trends. GPT-5 did a better job than Perplexity Pro, but still lagged behind the Co-Analyst.
- No hallucinations: The Co-Analyst never tried to predict or estimate values when data was not available. Perplexity Pro frequently introduced spurious figures “based on trend” to fill in the gaps.
- Operational reliability: The Co-Analyst required no retries or prompt adjustments, supporting smoother workflows. GPT-5 and Perplexity Pro were explicitly instructed in every prompt to use only reliable sources such as SEC filings, company press releases, or earnings call transcripts. This adjustment aimed to give both systems the best possible chance to perform on par with the Co-Analyst’s retrieval behaviour.
2. Introduction
Large language models (LLMs) are increasingly being used in investment research workflows to accelerate data gathering, reduce manual work, and surface insights from complex financial disclosures. Tools like ChatGPT and Perplexity Pro have made it possible to query the web and corporate filings conversationally, raising the question of whether general-purpose models can reliably support high-stakes financial analysis tasks.
One of the most common and operationally intensive tasks for equity analysts and fundamental investors is extracting multi-period financial data from primary sources, such as SEC filings, company press releases, and earnings call transcripts. This involves locating figures buried in structured tables and narrative commentary, spanning multiple quarters or years, and aligning them into accurate, complete time series. Errors or omissions in this process can materially impact valuation models, investment theses, and decision-making.
While benchmarks like FinanceBench have evaluated LLMs on a broad set of financial question-answering tasks, few studies focus narrowly on the precision and reliability of multi-period data retrieval — a core requirement in professional investment workflows. This report addresses that by benchmarking the Hudson Labs Co-Analyst, GPT-5, and Perplexity Pro on a set of 25 queries that mirror real-world analyst tasks.
GPT-5 and Perplexity Pro were not given the underlying documents directly. Instead, they were explicitly instructed in every prompt to use only reliable primary sources – SEC filings, company press releases, or earnings call transcripts – when searching for and extracting information. This design ensures that their performance reflects their ability to retrieve and ground responses in authoritative disclosures, rather than relying on secondary summaries or heuristic estimates.
The evaluation examines each tool’s ability to retrieve relevant primary sources and extract accurate, complete data across multiple reporting periods, using a strict binary scoring rubric that reflects the error tolerance in institutional investment settings.
3. Methodology
We created 25 queries that mirror common analyst tasks requiring time-series data collation from primary sources such as earnings releases and transcripts. Each query required structured extraction across multiple reporting periods, often spanning different documents.
We scored each response using a simple, binary system:
- 1 point for fully correct and complete answers (all periods, no hallucinations or omissions)
- 0 points for any errors, missing data, or hallucinations.
We recognize that this binary system is strict. However, our clients operate in high-stakes investment environments, where even a single incorrect figure can compromise an analysis or investment decision. The scoring reflects this reality rather than academic tolerance for partial credit.
Unlike many controlled benchmarks, we did not provide the underlying documents to GPT-5 or Perplexity. Both were expected to find the relevant documents online through their respective search or browsing capabilities and extract the required figures. To ensure fairness and focus the evaluation on retrieval and extraction, every prompt included explicit instructions to use only reliable primary sources such as SEC filings, company press releases, or earnings call transcripts. This change was introduced after initial testing showed GPT-5 and Perplexity Pro often relied on secondary summaries and provided incomplete answers unless explicitly directed otherwise.
3.1 Comparison to FinanceBench
Our methodology draws on FinanceBench, a large-scale “open-book” benchmark that evaluates models on financial question answering using real SEC filings. FinanceBench tests models’ ability to retrieve, ground, and answer questions using evidence from filings, across over 10,000 questions. (FinanceBench, arXiv:2311.11944)
While FinanceBench covers a broad range of question types with graded scoring, our benchmark focuses on a narrower task — multi-period extraction — and applies a zero-tolerance scoring system to reflect production requirements. The retrieval setup is also aligned. Like FinanceBench, GPT-5 and Perplexity were expected to find documents themselves. The Co-Analyst similarly retrieves relevant documents from our content library based on the query.
3.2 Scope and Limitations
- The exercise focuses strictly on multi-period retrieval and extraction, not just extraction in isolation.
- Results represent a point-in-time snapshot; relative performance may change as models evolve.
- Prompts were deliberately simple to reflect how analysts typically interact with these tools under time pressure.
4. Results
4.1 Overall Performance
Across the 25 benchmark queries, the Hudson Labs Co-Analyst substantially outperformed both GPT-5 and Perplexity Pro.
This gap highlights the Co-Analyst’s strength in reliably retrieving the correct documents and accurately extracting multi-period data. Perplexity Pro consistently failed to locate the right information, often returning incomplete or fabricated outputs. GPT-5 performed better but still failed to retrieve figures for more queries than the Co-Analyst. This likely reflects gaps in its retrieval recall and extraction consistency — GPT-5 sometimes missed relevant disclosures within filings or failed to aggregate figures spread across multiple documents and periods.
4.2 Performance by Query Type
To better understand where each tool succeeds or fails, we grouped the 25 benchmark queries into three functional categories that reflect common analyst workflows:
- Standard GAAP Metrics – Core financial line items consistently disclosed in structured tables within SEC filings (e.g., revenue, operating income, cash balances, share repurchases). These tasks primarily test a model’s ability to locate and extract standardized tabular disclosures over multiple reporting periods.
- Company-Specific Disclosures – Non-GAAP and operational KPIs that are unique to each company (e.g., benefit ratios, segment-level performance, ATM program capital raises, lease metrics). These figures are often disclosed in footnotes, supplemental tables, or segment discussions, and require more domain-specific retrieval.
- Narrative & Guidance Figures – Metrics embedded in textual commentary, such as same-store NOI growth, utilization rates, or revenue guidance. These require identifying and parsing unstructured narrative sections, often across several quarters or years.
These categories are not mutually exclusive. In practice, some queries involve elements of more than one category (for example, a KPI disclosed both in a table and within narrative commentary). The categorization reflects the dominant retrieval and extraction challenge posed by each query.
The Co-Analyst achieved near-perfect accuracy on Standard GAAP Metrics, correctly answering 12 of 13 queries. These tasks typically involve structured tabular disclosures that are straightforward for a retrieval-focused system to locate and extract. GPT-5 performed reasonably well in this category, correctly answering 9 queries, while Perplexity Pro failed to return a single correct result.
On Company-Specific Disclosures, the Co-Analyst answered 5 of 6 queries correctly, demonstrating strength in handling less standardized, domain-specific information. GPT-5 answered 4 correctly, indicating it can retrieve some specialized figures when explicitly instructed to rely on primary sources, but still lags in completeness and consistency. Perplexity Pro again returned no correct answers.
Narrative & Guidance Figures proved more difficult for all tools, as expected. The Co-Analyst and GPT-5 each correctly answered 4 queries, respectively. These tasks require accurately identifying and extracting figures embedded in narrative sections across multiple periods – a process that general-purpose models often struggle to execute reliably. Perplexity Pro failed to return any correct responses in this category as well.
5. Conclusion
This benchmark demonstrates that the Co-Analyst substantially outperforms general-purpose LLMs on multi-period financial data extraction tasks that mirror real analyst workflows. Across 25 queries drawn from real SEC filings, press releases, and earnings call transcripts, the Co-Analyst consistently delivered accurate and complete extractions with minimal intervention, while GPT-5 and Perplexity Pro showed significant gaps in reliability.
GPT-5 performed better than Perplexity Pro, correctly answering a meaningful subset of queries when explicitly instructed to rely on primary sources. However, it still struggled to consistently retrieve all relevant disclosures and aggregate figures across multiple reporting periods—particularly for company-specific and narrative-derived metrics. Perplexity Pro failed to produce any fully correct responses, often returning incomplete or fabricated figures.
These results highlight the operational importance of high-recall retrieval and precise multi-period extraction in investment research. Even a single incorrect or missing figure can materially impact downstream analyses, and general-purpose models are not yet reliable enough to automate these workflows without extensive human verification. By contrast, the Co-Analyst’s domain-tuned retrieval and structured extraction pipeline enables it to meet the accuracy standards required in institutional investment settings.
Looking ahead, the Hudson Labs team is already working to advance the Co-Analyst capabilities to address increasingly complex analyst workflows — including broader coverage of guidance parsing, historical restatements, richer evidence citation, and further optimization of speed and cost. These enhancements are designed to reinforce the Co-Analyst’s position as the most accurate and reliable retrieval and extraction system for financial research, and to reduce the manual workload for investment professionals.




