26/02/2025
Expert Research in the Era of Agentic AI
At Aperio, the decisive intelligence we gather helps clients to make timely decisions and mitigate risk. From conducting due diligence to investigating complex cross-border matters, clients often depend on open-source intelligence (OSINT) or human source intelligence (HUMINT) to shape their strategy. Skilled research involves knowing where and how to source reliable intelligence, how to uncover bias or contradictions, and how to interpret contextually subtle findings.
With rapid advances in AI – particularly large language models (LLMs) and so-called “agentic AI” – some believe PhD-level analysis is now available in a fraction of the time required for equivalent human-based research. This prospect raises the question: how can AI technologies be harnessed without sacrificing the rigour, scepticism and contextual understanding that define expert research?
This article focuses on the evolution of AI in the intelligence domain, focusing on how these technologies have improved, where they still fall short, and how intelligence professionals can blend them with traditional approaches to deliver dependable outcomes.
The Early Era of LLM-based AI
The rise of generative AI, powered by large language models (LLMs), transformed how we interact with text-based systems. The foundation was laid in 2017 when researchers at Google introduced the Transformer architecture, revolutionising natural language processing. Early models like BERT focused on understanding language, but it was models like OpenAI’s GPT-3 that showcased the full potential of text generation. Their zero-shot capabilities, where a user could ask almost anything without providing extensive labelled data, felt revolutionary compared to traditional machine learning systems that required rigorous data training for each task.
Yet initial enthusiasm gave way to caution when factual inaccuracies, inexplicable omissions, and hallucinations became apparent. The models often prioritised linguistic fluency over factual precision. This gap was serious enough that intelligence practitioners quickly realised that for mission-critical research requiring rigour and verification, first-wave LLMs could not yet replace domain experts or human research workflows.
The Second Wave: The Rise of Instruction-Tuned Models
The next wave of LLMs wasn’t just about larger datasets or better reasoning – it was about making models follow human instructions effectively. The shift began with instruction-tuned models like OpenAI’s InstructGPT (which improved upon Da Vinci) and later, GPT-3.5 and GPT-4, which used reinforcement learning from human feedback (RLHF) to refine responses. This marked the transition from raw text prediction to conversational AI, enabling models to follow step-by-step reasoning, contextualize answers, and engage in multi-turn dialogues – a leap forward in usability and reliability.
Accuracy improved, but certain core limitations remained. Even with advanced methods such as Context- Aware Decoding (CAD), Retrieval Augmented Generation (RAG), versions of Chain-of-Thought prompting, and mechanisms to evaluate multiple potential answers, reliance on outdated data, or data of variable quality, resulted in fabricated answers and erroneous outputs.
Data bias and reliance on outdated or low-quality inputs could still cause error-laden outputs, and “black box” limitations remained. Although Chain-of-Thought-style prompting made the reasoning steps more transparent, practitioners found it difficult to vet each process step for factual accuracy, not least as this reasoning forms part of the intellectual property of AI companies.
The Third Wave: “Agentic AI”
Enter the so-called ‘third wave’: ‘Agentic AI’, featuring systems like OpenAI’s o3, Google’s Gemini 2.0, and DeepSeek’s R1. These models may be used to invoke Computer Using Agents (CUAs) to perform iterative tasks such as scheduling automated web searches, comparing data across platforms, or consolidating relevant documents through multiple steps, mimicking a real human researcher. In theory, this approach can overcome some of the older models’ shortcomings: by actively planning and evaluating contradictory information, the AI approximates a more human-like problem-solving process, inching closer to thorough investigative research.
However, these AI systems raise new questions: how autonomous do we want AI to be in scouring private data or paywalled documents? How do we guarantee that an AI’s exploration doesn’t overstep legal boundaries, infringe on privacy, or inadvertently embed existing biases deeper into algorithms? How is an AI system able to accurately reflect the intuition and experience of a skilled researcher?
Key Challenges in AI-Driven Intelligence Research
Data Quality and Training Bottlenecks
A large language model’s output is only as good as the data on which it is trained. Curating reliable, representative data across subjects – while avoiding bias and misinformation – presents an expensive and time-consuming endeavour, even with newer techniques such as reinforcement learning and reward modelling. Today’s internet, increasingly saturated with AI-generated content, risks creating echo chambers of self-referential text.
Developers have responded with smaller, specialised models and distillation techniques (where a large model “teaches” a smaller one), increased training data curation, and stricter guardrails. But these solutions can inadvertently suppress certain lines of reasoning or block certain facts, making the decision-making process opaque. “Garbage In, Garbage Out” still applies; if a model is trained on questionable sources or subject to poorly implemented corporate restrictions, it will produce equally questionable outputs. Its reasoning may be impacted as a result.
In intelligence work, the cost of being wrong is high. A single training set dominated by biased sources or incomplete archives could distort entire lines of inquiry. While “reasoning” LLM responses appear confident and fluent, real experts must evaluate whether the underlying data is accurate, timely, and relevant.
Availability of Relevant Data
The promise of agentic AI is that it can “dig deeper” into obscure corners of the web or specialised datasets beyond typical search engines. In principle, this can revolutionise investigations and due diligence, letting an AI sift sources such as corporate registries, multilingual legal filings, or local-language news.
However, emerging agentic AI tools tend to focus on B2C applications, like comparing airfares, scanning retail sites, booking holidays or automating online shopping. True investigative tasks – such as dissecting contradictory court records in multiple languages across different jurisdictions or piecing together evidence on an individual’s source of wealth – remain difficult and require specialised domain knowledge (including subject matter expertise in local regulations).
In jurisdictions with scant press freedom, analogue document registries or scarce official data, purely digital solutions fail to capture on-the-ground realities. Here, alternative approaches – such as gathering reliable HUMINT – remain a necessity. Even an AI that can supposedly spider through paywalls and government websites cannot produce intelligence where none exists, nor can it interpret local context without domain-specific nuance.
Generalisation vs. Specialisation
Many AI vendors have pursued mass-market relevance, training their models on wide-ranging general knowledge. This emphasises fluency. But intelligence practitioners often confront niche areas (e.g., targeted local investigations, specialised corporate structures, culturally nuanced contexts) where incomplete or contradictory data abounds.
As a result, general-purpose LLMs can drastically misinterpret or oversimplify specialised topics unless guided by experts who know how to frame requests, check facts, select sources, and interpret responses for subtle meaning. AI responses often fall back on data that is more visible and accessible on the surface web, despite the fact this may be superficial, imprecise and lacking in authority. Although there is a move towards lighter weight, more specialised models, utilising more carefully curated data sets, the intelligence sector often needs deeply tailored solutions that are expensive to build, test, and maintain.
Out-of-Date or Sparse Knowledge
Even the largest models face an inherent time lag. LLMs typically rely on historical datasets, meaning if an event or development is too recent, the model might have no record of it at all. Retrieval Augmented Generation (RAG) and agentic web integration can partially address this gap by connecting queries to more current data sources, but persistent memory and reliable updating remain unresolved challenges.
For intelligence around fast-unfolding situations – like emerging political crises or newly minted corporate structures – there is a distinct risk the AI will default to older data. Equally, if training data is thin, the LLM may introduce inferential or reasoning errors rather than admitting ignorance. A human expert is more likely to sense the absence of information and pursue additional avenues of inquiry, whether through HUMINT, specialised databases or alternative approaches.
The Need for Effective Prompt Engineering
Although newer AI models exhibit impressive understanding, even with minimal instructions, providing precise prompts is vital in guiding the model’s reasoning. The specialised field of prompt engineering helps the AI stay on track and interpret user needs correctly.
For instance, if an investigator wants to identify potential conflicts of interest in an overseas joint venture, the prompt must specify the relevant timeframe, location, search constraints, and document types. Otherwise, the AI might offer erroneous predictions or unvalidated data. Practitioners must also know how to interpret the AI’s Chain-of-Thought, distinguishing robust reasoning from speculative leaps. Unfortunately, with leading commercial reasoning models, although visibility on reasoning steps are improving (partially thanks to DeepSeek’s greater disclosure), intellectual property concerns still restrict access to the full record of AI reasoning steps.
As with human communication, the more precise and contextually rich the instructions, the more likely the “conversation” will yield actionable insight. Prompt engineering, therefore, demands not only linguistic skill but also subject-matter expertise: you must know which questions to ask and how to evaluate whether the AI’s answer aligns with reality.
Looking Ahead
AI’s capacity to sift enormous data sets continuously and present coherent analysis in minutes is alluring. Researchers and vendors continue to refine advanced LLM architectures, exploring next-generation techniques such as refined Chain-of-Thought prompting, such as measures to combine neural networks with logic rules, or advanced RAG techniques allowing LLMs to cite and interpret live information whilst incorporating advanced reasoning techniques. Multi-agent collaboration can allow researchers to scour newsfeeds and databases, interpret and synthesise information using pattern analysis, and cross-check facts and figures by performing additional computations . Open-sourced experimentation, along with rigorous academic study, suggests future breakthroughs may tackle many remaining pitfalls.
Still, intelligence aims for reliable, actionable insight. The surest path toward that outcome is blending
traditional tradecraft – notably careful source vetting, context awareness, and interpretive scepticism – with AI’s speed and scalability. As new agentic AI solutions emerge, intelligence professionals who bring subject matter expertise, ethical guardrails, and a thorough understanding of AI’s technical constraints will be best prepared to harness its potential.
Conclusion
While AI offers transformative possibilities for intelligence-gathering – from swiftly mapping corporate ownership structures to real-time conflict tracking – it cannot replace the discernment and contextual reasoning that skilled human analysts offer. Progress in third wave “agentic AI” is real, but so are persistent challenges related to data quality, governance, specialised domain knowledge, and ethical oversight.
Intelligence professionals should therefore see AI not as a simple shortcut but as an amplifier of human skill. By curating high-quality data, rigorously engineering prompts, verifying results, and adhering to robust governance policies, practitioners can deploy AI as a powerful new tool in research – without sacrificing the diligence that anchors credible intelligence.
Contact Details
Adrian Ford
Chief Executive Officer adrian.ford@aperio-intelligence.com
References and Further Reading
Xu, Z. (2023). Context-aware Decoding Reduces Hallucination in Query-focused Summarization.
Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
Patil, A. (2025). Advancing Reasoning in Large Language Models: Promising Methods and Approaches. Tran, K-T. (2025). Multi-Agent Collaboration Mechanisms: A Survey of LLMs.