AI Excels at Synthesis, Not Discovery - Reflections of an R&D Engineer

Today I ran the same complex financial question through three deep research systems powered by frontier AI models. Two of them came back with the same answer - within a thousand dollars of each other - following completely different analytical paths.

The task was a real one: estimating the fair value of a stake in a private company I was evaluating as an investment. It required pulling M&A comparable transactions, building probabilistic scenario frameworks, running DCF analysis, and weighting outcomes by exit probability. Not a toy prompt - the kind of structured analysis that a financial analyst might spend days on.

Two systems delivered comprehensive reports. One failed to produce usable output at all. But what struck me was not just that two models agreed - it was that they arrived at the same number through genuinely different reasoning chains. Different data sources, different analytical framings, same landing zone.

This was the first time I saw independent AI systems converge on the same answer through different methodologies. It told me something precise about what these systems can and cannot do.

A disclaimer before we go further: this is the perspective of one R&D engineer (me), who uses SOTA AI/ML tools daily for both professional work and personal analysis.

Three Deep Research Systems, One Question

I used three deep research systems: Claude (powered by Opus 4.6), ChatGPT (powered by GPT-5.3), and Gemini (using Gemini-3-Pro). Each received the same prompt - a structured 6-step analytical framework.

1
Fundamental
Analysis
→
2
Probabilistic
Scenarios
→
3
Expected Value
Calculation
→
4
Alternative
Valuation
→
5
Comparative
Decision Table
→
6
Sensitivity
Analysis

Gemini failed to produce meaningful results - Google servers got overwhelmed mid-analysis, which has been happening to me frequently lately. Disappointing, because usally deep research done by Gemini was better than ChatGPT and cheaper than Claude.

But Claude and ChatGPT both delivered full reports. And here is where it gets interesting.

Claude (Opus 4.6)

Approach: DCF + scenario weighting

Data: Sourced specific M&A transactions, built probability-weighted expected value from 4 scenarios

Output: ~240 lines, structured with sensitivity tables

Verdict: $X

ChatGPT (GPT-5.3)

Approach: Comparable M&A multiples + risk framework

Data: Sector dynamics, annualized cost comparison, different comparable set

Output: ~190 lines, range-based estimates with risk-adjusted ranges

Verdict: $X + ~$1k

Different data. Different chains of estimation. Same landing zone - within roughly a thousand dollars of each other.

I have run many similar analyses before - total cost of ownership comparisons, market sizing, technology evaluations. In most cases, the models diverge significantly, sometimes by an order of magnitude, and I have to manually triangulate the reasonable insights and this was the first time it was different.

Confidentiality note: Financial details are intentionally kept vague. The point is the convergence pattern, not the specific numbers.

Why They Converged: Linear Combination of Known Facts

The convergence is not magic - it reveals the exact modality where current SOTA LLMs excel: linear combination of already known facts.

Both systems searched the same universe of public data: company financials, M&A precedent transactions in the sector, analyst reports, regulatory filings. They applied established analytical frameworks - DCF, comparable multiples, scenario weighting and combined them into structured outputs.

They are excellent at transforming existing knowledge: finding relevant data points across vast data sets, applying known analytical methods, and deriving new relationships from existing information. When the answer is latent in publicly available data, independent systems will converge on it - because they are fundamentally doing the same thing: sophisticated recombination.

This is genuinely useful. I saved hours of analyst work and got two independently reasoned reports that I could cross-validate. For synthesis tasks, these tools are transformative.

But this capability has a precise boundary.

Where I Saw This Before: The BeeARD Hackathon

This pattern is not new to me. In April 2024, I participated in the BeeARD hackathon - a challenge focused on AI-driven hypothesis generation in rheumatology.

Our team built a multi-agent pipeline: knowledge graphs as input, specialized agents for ontology enrichment, parallel hypothesis generation, critic evaluation, decomposition into falsifiable statements, and iterative refinement. The full stack - research agents pulling from PubMed, BioRxiv, and biomedical knowledge bases, feeding into reasoning models that generated and refined scientific hypotheses.

The models connected existing facts brilliantly. They traced paths through knowledge graphs from autoimmune mechanisms through cytokine pathways to bone density outcomes. They generated hypotheses that were well-structured, referenced, and logically coherent.

But the hypotheses were sophisticated recombinations of existing knowledge, not genuinely novel insights. They connected dots that were already in the graph - they did not discover new dots.

Key finding from BeeARD: The winning team had domain expert PhDs in rheumatology who provided the creative leap in the framework required for better flow of domain-specific knowledge. Our multi-agent system was a solid generic engine, but lack of the proper domain knowledge and intuition was likely the gap between second and first place.

The Blocker: Synthesis Is Not Innovation

This same limitation is the main blocker preventing current AI models from doing real R&D work autonomously.

They can optimize within a defined space. They can gather low-hanging fruit. They can automate structured analysis and connect known facts across domains. But they cannot innovate in the way that matters most: they cannot generate the question that nobody has asked yet.

The financial analysis converged because the answer was latent in public data - the models just had to find it and combine it correctly. The hackathon hypotheses were bounded by the knowledge graph - the models could traverse existing connections but could not imagine connections that were not yet established.

A similar example landed just yesterday. OpenAI published a preprint titled “GPT-5.2 derives a new result in theoretical physics” - framed as AI producing novel scientific knowledge. But read the methodology - human physicists computed the scattering amplitudes for specific cases by hand, obtaining increasingly complex expressions.

GPT-5.2 then simplified those expressions and spotted a pattern across the base cases. An internal scaffolded model spent 12 hours producing a formal proof. The “discovery” was pattern recognition over human-generated results - exactly the kind of synthesis these models excel at. The creative leap - asking whether single-minus gluon amplitudes might be nonzero in half-collinear kinematics - came from human physicists who had been thinking about the problem for fifteen years. The AI was a powerful tool for the tedious step of simplification and generalization, not the source of the question.

This is the synthesis-discovery spectrum, and understanding where your task falls on it is the single most important thing for knowing how much to trust AI output.

SYNTHESIS
Current AI frontier
▼
DISCOVERY

Financial analysis
Literature review
Code generation
Data integration

Novel hypotheses
Research direction
Strategic vision
Creative leaps

What This Means in Practice

From my daily experience as an R&D engineer working with AI agents extensively - and again, this is one engineer’s perspective, not universal truth:

We move faster. Dramatically faster. Tasks that took days now take hours. But we move faster because we act as managers for fleets of AI agents, directing them where to look and what to optimize.

Without human direction, agent-driven work dissolves into local optimization loops. The agents will happily refine whatever they are pointed at, but they will not step back and ask “are we even solving the right problem?”

"We don't replace thinking. We manage fleets of AI agents and tell them where to look."

For me, the highest-leverage skill now is to build environments for agents that encapsulate my hypotheses and questions, so harnessed AI can materialize a concrete solution, that can be then tested, validated and improved.

When the task requires genuine novelty - a new research direction, a creative architectural choice, a hypothesis that challenges existing assumptions - that is still on you.

Practical rule of thumb: If you know which task you are doing - synthesis or discovery - you know exactly how much to trust the AI output. Synthesis tasks get high trust. Discovery tasks get verification and human judgment.

Conclusion

In my opinion, SOTA AI models are synthesis machines, not discovery machines. The convergence of two independent models on the same financial valuation anegdotally proves it: when the answer is latent in existing data, they will find it - reliably, efficiently, and through different analytical paths.

For now, the most valuable skill is not prompt engineering or agent orchestration. It is the ability to distinguish between tasks that need synthesis and tasks that need genuine novelty. Delegate the first. Own the second.

Keep your instinct for “what question should we even be asking” sharp. That is the one capability these systems cannot synthesize from existing data - because by definition, no one has written it down yet.