Innovation Posts Archive - Thomson Reuters Institute

Strategic AI Integration: Balancing Business Value, Governance, and Human-Centered Adoption

denise.lam@thomsonreuters.com — Thu, 17 Apr 2025 18:18:29 +0000

Recently, Thomson Reuters leaders held a joint panel discussion with Deloitte, uncovering strategic approaches to implementing generative AI across organizations—emphasizing business-driven objectives, proactive governance frameworks, custom tool development, and human-centered adoption strategies that transform AI potential into tangible business performance. Below were a few key insights shared from the discussion:

Gen AI as a strategic business tool
Audrey Ancion, Partner, AI Institute Canada Leader, Deloitte, discusses how generative AI should be viewed as a value-creating business tool rather than just a technology trend. She emphasizes an “insight-driven approach” at Deloitte, which balances data and technology with people, strategy, and process considerations. Audrey reinforces that successful AI implementation begins with clear business objectives rather than technology experimentation. Organizations should first identify the specific business needs or opportunities they want to address, and then select the appropriate technological solution—not the other way around.
Proactive AI Governance in a Rapidly Evolving Regulatory Landscape
David Wong, Chief Product Officer, Thomson Reuters, outlines the forward-thinking approach to AI governance. He describes how Thomson Reuters is not just focusing on current compliance requirements, but actively preparing for future regulations. Key elements of the strategy include implementing a comprehensive AI model governance process, maintaining a model registry, establishing rigorous asset management practices, and updating customer data privacy policies. David emphasized a dual focus: ensuring full compliance with existing regulations while building flexible systems and infrastructure that can quickly adapt to new requirements as they emerge in this fast-changing regulatory environment.
Leveraging AI Tools Across the Product Development Ecosystem
David Wong discusses how AI tools like GitHub Copilot are enhancing their engineering teams’ productivity and product quality. He shares that Thomson Reuters is now exploring AI solutions for non-engineering roles including editors, content experts, and designers. He notes this initiative is still developing, but early successes have come from adopting vendor-created generative AI products that adapt large language models for specific business functions.
Building Custom AI Tools for Editorial Excellence
David Wong highlights Thomson Reuters strategic investment in proprietary AI technology for editorial teams. He explains that while they leverage vendor solutions where possible, their specialized legal editorial needs require custom-built tools. David reveals they’ve launched six internal development projects this year, led by Leanne Blanchfield and editorial teams, to create Thomson Reuters-owned AI software specifically designed to enhance editor efficiency, speed, and accuracy—ultimately improving customer service.
The “4 Ts” of Human-Centered AI Adoption at Thomson Reuters
Chief People Officer Mary Alice Vuicic outlines Thomson Reuters’ “4Ts” framework for successful AI adoption across the organization. She emphasizes that human adoption—not technology—presents the greatest challenge. The framework includes:
– Tone from the top, with CEO Steve Hasker championing AI as critical for the company, customers, and individual career resilience;
– Training, with accessible programs for both technical and non-technical staff;
– Tools, providing all employees access to Open Arena (Thomson Reuters internal AI platform) with four leading language models; and,
– Time to experiment, encouraging daily AI use to build habits.
Mary Alice reinforces that hands-on experience significantly reduces employee anxiety about AI, making it more accessible than previous technological transformations.

Legal AI Benchmarking: Evaluating Long Context Performance for LLMs

denise.lam@thomsonreuters.com — Mon, 14 Apr 2025 13:00:43 +0000

The Importance of Long Context

In the legal profession, many daily tasks revolve around document analysis—reviewing contracts, transcripts, and other legal documents. As the leading AI legal assistant, CoCounsel’s ability to automate document-centric tasks is one of its core legal capabilities. Users can upload documents, and CoCounsel can automatically perform various tasks on these documents, saving lawyers valuable time.

Legal documents are often extensive. Deposition transcripts can easily exceed a hundred pages, and merger and acquisition agreements can be similarly lengthy. Additionally, some tasks require the simultaneous analysis of multiple documents, such as comparing contracts or testimonies. To perform these legal tasks effectively, solutions must handle long documents without losing track of critical information.

When GPT-4 was first released in 2023, it featured a context window of 8K tokens, equivalent to approximately 6,000 words or 20 pages of text. To process documents longer than this, it was necessary to split them into smaller chunks, process each chunk individually, and synthesize the final answer. Today, most major LLMs have context windows ranging from 128K to over 1M tokens. However, the ability to fit 1M tokens into an input window does not guarantee effective performance with that much text. Often, the more text included, the higher the risk of missing important details. To ensure CoCounsel’s effectiveness with long documents, we’ve developed rigorous testing protocols to measure long context effectiveness.

Why a Multi-LLM Strategy—and Trusted Testing Ground—Matter

At Thomson Reuters, we don’t believe in putting all our bets on a single model. The idea that one LLM can outperform across every task—especially in high-stakes professional domains—is a myth. That’s why we’ve built a multi-LLM strategy into the core of our AI infrastructure. It’s not a fallback. It’s one of our competitive advantages.

Some models reason better. Others handle long documents more reliably. Some follow instructions more precisely. The only way to know what’s best for any given task is to test them—relentlessly. And that’s exactly what we do.

Because of our rigor and persistence in this space, Thomson Reuters is a trusted, early tester and collaborator for leading AI labs. When major providers want to understand how their newest models perform in high-stakes, real-world scenarios, they turn to us. Why? Because we’re uniquely positioned to pressure-test these models against the complexity, precision, and accountability that professionals demand — and few others can match.

Our legal, tax, and compliance workflows are complex, unforgiving, and grounded in real-world stakes
Our proprietary content—including Westlaw and Reuters News—gives us gold-standard input data for model evaluation
Our SME-authored benchmarks and skill-specific test suites reflect how professionals actually work—not how a model demo looks on paper

When OpenAI was looking to train and validate a custom model built on o1-mini, Thomson Reuters was among the first. And when the next generation of long-context models hits the market, we are routinely early testers.

A multi-model strategy only works if you know which model to trust—and when. Benchmarking is how we turn optionality into precision.

This disciplined, iterative approach isn’t just the best way to stay competitive—it’s the only way in a space that’s evolving daily. The technology is changing fast. New models are launching, improving, and shifting the landscape every week. Our ability to rapidly test and integrate the best model for the job—at any moment—isn’t just a technical strategy. It’s a business advantage.

And all of this isn’t just about immediate performance gains. It’s about building the foundation for truly agentic AI—the kind of intelligent assistant that can plan, reason, adapt, and act across professional workflows with precision and trust. That future won’t be built on rigid stacks or static decisions. It will be built by those who can move with the market, test with integrity, and deliver products that perform in the real world.

RAG vs. Long Context

In developing CoCounsel’s capabilities, a significant question was whether and how to utilize retrieval-augmented generation (RAG). A common pattern in RAG applications is to split documents into passages (e.g., sentences or paragraphs) and store them in a search index. When a user requests information from the application, the top N search results are retrieved and fed to the LLM in order to ground the response.

RAG is effective when searching through a vast collection of documents (such as all case law) or when looking for simple factoid questions easily found within a document (e.g., specific names or topics). However, some complex queries require a more sophisticated discovery process and more context from the underlying documents. For instance, the query “Did the defendant contradict himself in his testimony?” requires comparing each statement in the testimony against all others; a semantic retrieval using that query would likely only return passages explicitly discussing contradictory testimony.

In our internal testing (more on this later), we found that inputting the full document text into the LLM’s input window (and chunking extremely long documents when necessary) generally outperformed RAG for most of our document-based skills. This finding is supported by studies in the literature^1,2. Consequently, in CoCounsel 2.0 we leverage long context LLMs to the greatest extent possible to ensure all relevant context is passed to the LLM. At the same time, RAG is reserved for skills that require searching through a repository of content.

Comparing the Current Long-Context Models + testing GPT-4.1

As discussed in our previous post, before deploying an LLM into production, we conduct multiple stages of testing, each more rigorous than the last, to ensure peak performance.

Our initial benchmarks measure LLM performance across key capabilities critical to our skills. We use over 20,000 test samples from open and private benchmarks covering legal reasoning, contract understanding, hallucinations, instruction following, and long context capability. These tests have easily gradable answers (e.g., multiple-choice questions), allowing for full automation and easy evaluation of new LLM releases.

For our long context benchmarks, we use tests from LOFT³, which measures the ability to answer questions from Wikipedia passages, and NovelQA⁴, which assesses the ability to answer questions from English novels. Both tests accommodate up to 1M input tokens and measure key long context capabilities critical to our skills, such as multihop reasoning (synthesizing information from multiple locations in the input text) and multitarget reasoning (locating and returning multiple pieces of information). These capabilities are essential for applications like interpreting contracts or regulations, where the definition of a term in one part of the text determines how another part is interpreted or applied.

We track and evaluate all major LLM releases, both open and closed source, to ensure we are using the latest and most advanced models, such as the newly updated GPT-4.1 model with its much-improved long context capabilities.

Skill-Specific Benchmarks

The top-performing LLMs from our initial benchmarks are tested on our actual skills. This stage involves iteratively developing (sometimes very complex) prompt flows specific to each skill to ensure the LLM consistently generates accurate and comprehensive responses required for legal work.

Once a skill flow is fully developed, it undergoes evaluation using LLM-as-a-judge against attorney-authored criteria. For each skill, our team of attorney subject matter experts (SMEs) has generated hundreds of tests representing real use cases. Each test includes a user query (e.g., “What was the basis of Panda’s argument for why they believed they were entitled to an insurance payout?”), one or more source documents (e.g., a complaint and demand for jury trial), and an ideal minimum viable answer capturing the key data elements necessary for the answer to be useful in a legal context. Our SMEs and engineers collaborate to create grading prompts so that an LLM judge can score skill outputs against the ideal answers written by our SMEs. This is an iterative process, where LLM-as-a-judge scores are manually reviewed, grading prompts are adjusted, and ideal answers are refined until the LLM-as-a-judge scores align with our SME scores. More details on our skill-specific benchmarks are discussed in our previous post.

Our test samples are carefully curated by our SMEs to be representative of the use cases of our users, including context length. For each skill, we have test samples utilizing one or more source documents with a total input length of up to 1M tokens. Additionally, we have constructed specialized long context test sets where all test samples use one or more source documents totaling 100K–1M tokens in length. These long context tests are crucial because we have found that the effective context windows of LLMs, where they perform accurately and reliably, are often much smaller than their available context window.

In our testing, we have observed that the more complex and challenging a skill, the smaller an LLM’s effective context window for that skill. For more straightforward skills, where we search a document for one or a few data elements, most LLMs can accurately generate answers at input lengths up to several hundred thousand tokens. However, for more complex tasks, where many different data elements must be tracked and returned, LLMs may struggle with recall to a greater degree. Therefore, even with long context models, we still split documents into smaller chunks to ensure important information isn’t missed.

When you look at the advertised context window for leading models today, don’t be fooled into thinking this is a solved problem. It is exactly the kind of complex, reasoning-heavy real-world problem where that effective context window shrinks. Our challenge to the model builders: keep stretching and stress-testing that boundary!

Final Manual Review

All new LLMs undergo rigorous manual review by our attorney SMEs before deployment. Our SMEs can capture nuanced details missed by automated graders and provide feedback to our engineers for improvement. These SMEs further provide the final check to verify that the new LLM flow performs better than the previously deployed solution and meets the exacting standards for reliability and accuracy in legal use.

Looking Ahead: From Benchmarks to Agents

Our long-context benchmarking work is more than just performance testing — it’s a blueprint for what comes next. We’re not just optimizing for prompt-and-response AI. We’re laying the technical foundation for truly agentic systems: AI that can not only read and reason, but plan, execute, and adapt across complex legal workflows.

Imagine an AI assistant that doesn’t just answer a question, but knows when to dig deeper, when to ask for clarification, and how to take the next step — whether that’s reviewing a deposition, cross-referencing contracts, or preparing a case summary. That’s where we’re headed.

This next chapter requires everything we’ve built so far: long-context capabilities, multi-model orchestration, SME-driven evaluation, and deep integration into the professional’s real-world tasks. We’re closer than you think.

Stay tuned — more on that soon.

——-

Li, Xinze, et al. “Long Context vs. RAG for LLMs: An Evaluation and Revisits.” arXiv preprint arXiv:2501.01880 (2024).
Li, Zhuowan, et al. “Retrieval augmented generation or long-context llms? a comprehensive study and hybrid approach.” Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024.
Lee, Jinhyuk, et al. “Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?.” arXiv preprint arXiv:2406.13121 (2024).
Wang, Cunxiang, et al. “Novelqa: Benchmarking question answering on documents exceeding 200k tokens.” arXiv preprint arXiv:2403.12766 (2024).

AI in the Legal World: A Human-Centric Approach

denise.lam@thomsonreuters.com — Tue, 08 Apr 2025 18:11:48 +0000

In the rapidly evolving landscape of legal technology, concerns about artificial intelligence replacing human lawyers have grown increasingly prevalent. A new video featuring an AI expert from Thomson Reuters directly addresses these concerns with a clear message:

More than three-quarters (77%) of respondents overall said they believe AI will have a high or transformational impact on their work over the next five years, according to the Thomson Reuters Institute’s 2024 Future of Professionals report. This was 10 percentage points higher than in the 2023 report.

At the same time, fears about job loss within the legal industry persist. Indeed, the perception that AI will replace lawyers endures, even though fears that widespread AI adoption across professional work would lead to job loss has now given way to acknowledgement that there needs to be humans in the loop to keep AI work ethical and on track.

In fact, AI is NOT a lawyer. And human oversight of AI output in lawyering workflow is critical.

Jake Heller, Head of Legal AI Product Innovation at Thomson Reuters explains how in this new video.

More and more, professionals believe that more AI-specialist and technology-related jobs will be created, according to the Thomson Reuters Future of Professionals research. Rather than viewing AI as a replacement for legal professionals, Heller encourages us to see it as a sophisticated tool—one that can process vast amounts of data and provide valuable insights for legal research and analysis.

What makes this perspective particularly compelling is the analogy that AI functions much like a junior associate fresh out of law school. It can assist, draft documents, and analyze information, but it fundamentally lacks the human qualities essential to legal practice—judgment, empathy, and understanding of human nuance. Just as a senior attorney would review a new lawyer’s work, AI output requires human oversight.

The solution is straightforward yet crucial—embed humans in the loop. This means having experienced lawyers and judges review and approve AI-generated work to ensure accuracy, ethical compliance, and alignment with established legal standards.

Perhaps most importantly, combining AI’s efficiency with human expertise can enhance legal processes without compromising professional integrity. Legal professionals must embrace AI as a partner rather than viewing it as a threat. It reminds us that technology should serve humanity, not the other way around.

Thomson Reuters Achieves ISO 42001 Certification for Responsible AI

denise.lam@thomsonreuters.com — Wed, 12 Mar 2025 11:15:02 +0000

Thomson Reuters has achieved the ISO/IEC 42001:2023 certification, a globally recognized standard for AI management systems. In a world where AI is transforming professional industries, trust is essential. This certification demonstrates our commitment to building AI solutions that professionals can rely on—ensuring legal, tax, risk, and compliance professionals have access to AI tools that are secure, transparent, and aligned with ethical best practices.

The achievement followed a rigorous 10-month process involving both internal and external audits focused on risk management, data governance, and responsible AI practices. This certification assures that Thomson Reuters’ Westlaw Precision and Edge solutions meet high industry standards, providing reliable and ethical tools for legal professionals.

For more on our approach to AI governance, visit our Trust Center.

Forecasting the Future of the Law Firm

jeffrey.mccoy@thomsonreuters.com — Wed, 05 Mar 2025 19:55:59 +0000

In the ever-evolving landscape of the legal industry, law firm leaders find themselves at a critical juncture, facing unprecedented challenges and opportunities. As we engage in conversations with leaders from global law firms, AmLaw 200 firms, and major independent law firms, a common thread emerges: the pressing need for decisive action in an environment of rapid change.

At the forefront of this transformation is the rise of generative AI (GenAI), which promises to reshape the very foundations of legal practice. This technological revolution is not just another trend; it’s poised to become the most influential force shaping law firms over the next five years, surpassing even economic factors and geopolitical instability in its potential impact. Looking at the implications of GenAI on the law firm business model, it becomes clear that the time for passive observation has passed. In today’s legal industry, there is simply no room for bystanders.

The Future of the Law Firm white paper looks at the current environment and forecasts the next three waves of how AI-driven technologies will reshape the legal industry.

Wave 1: Optimization of legal workflows

Law firms are increasingly pressured to adopt AI to reduce costs as their clients embrace these technologies, leading to shifts in cost structures and hiring practices. Despite these changes, the demand for legal services continues to grow as clients face business disruptions due to AI, prompting the need for new legal support. As AI adoption becomes more widespread, it is expected to significantly impact pricing strategies and workforce dynamics in the legal industry.

Wave 2: Legal market disruption and law firm re-engineering

Law firms are adopting more technology and project management strategies, which leads to fewer lawyers being hired and a re-engineering of business models to stay competitive. Legal departments are keeping more work in-house, but smaller firms can leverage AI to handle larger, complex tasks. The competitive landscape is evolving, with middle-market firms feeling pressure as routine work becomes productized and new AI-enabled services enter the market.

Wave 3: Disruption of legal services landscape and AI winners emerge

The use of AI in the legal industry will lead to a shift in how clients interact with law firms, with top-tier firms focusing on high-stakes and complex matters, while smaller firms move up the value chain with AI-enabled solutions. New AI-powered delivery models and self-serve legal products will transform the way legal services are bought and delivered, potentially leading to consolidation in the middle of the market. Ultimately, AI will have a profound impact on the law firm of the future, but it will work best when it complements, rather than substitutes for, legal professionals.

It’s crucial for law firm leaders to recognize that the emergence of AI and GenAI signifies a real and fundamental shift in the legal landscape, impacting how legal work is done. As these technologies promise to transform law firm operations, firms already grappling with pricing, talent, and competition must proactively manage AI adoption.

By addressing the interconnected challenges of client communication, talent acquisition, and AI-driven service pricing, firms can navigate the coming changes and avoid being left behind in this technological revolution.

Download your full copy of The Future of the Law Firm white paper.

Beauty Is in the AI of the Beholder

jeffrey.mccoy@thomsonreuters.com — Wed, 26 Feb 2025 17:04:29 +0000

“Welcome to the era of the AI superlative. While the first two years of generative artificial intelligence (GenAI) development were an all-out sprint to create new models, establish proof-of-concept solutions, and define optimal use cases, the next phase to deliver increased efficiency and better work product to clients in the AI lifecycle will be dominated by marketing as well.”

Raghu Ramanathan, president of Legal Professionals at Thomson Reuters, opened with these statements and shared his view on industry benchmarks in an article on Above the Law titled Beauty is in the AI of the Beholder.

He noted as more companies develop AI solutions and start-ups seek capital investment, customers will look for benchmarks to evaluate these tools. Thomson Reuters does see value in benchmarking, however, Ramanathan added benchmarks must measure products the way they’re designed to be used and should focus on results customers care about.

“The challenge is that one-dimensional metrics do not offer a reliable representation of the real value of GenAI in the legal research process,” stated Ramanathan. “No LLM-based legal research products in the market today provide answers with 100% accuracy, so users must engage in a two-step process of 1) getting the answer and 2) checking the answer for accuracy.”

Chief Legal Operations Officer Meredith Williams-Range from Gibson, Dunn & Crutcher LLP discussed how they are using and seeing results from AI-enabled resources. “There is a widespread misperception around how law firms are using AI and how we conduct legal research. We are not bringing in AI and saying: ‘Go do all the research and write a brief,’ and then replacing all of our junior associates with automated results. We’re using AI-enabled tools that are integrated directly into the research and drafting tools we were using already, and, as a result, we’re getting deeper, more nuanced, and more comprehensive insights faster. We have highly trained professionals doing sophisticated information analysis and reporting, augmented by technology.”

Read the full article on Above the Law, and as Ramanathan concludes, “the value of legal AI – of any technological innovation for that matter – is in how it gets used in the real world and how well all the different components come together to help lawyers do their jobs more effectively.”

Exploring AI’s Influence on the Legal Profession: Insights from Frost Brown Todd

jeffrey.mccoy@thomsonreuters.com — Thu, 20 Feb 2025 17:01:09 +0000

The legal profession is no stranger to change – noting that the change and its impact on the industry may be viewed very differently. But the rapid evolution of technology, particularly artificial intelligence (AI), is presenting a unique set of opportunities and challenges.

Raghu Ramanathan, president of Legal Professionals at Thomson Reuters, recently hosted his inaugural podcast episode with Cindy Thurston Bare, chief data and innovation officer, and Kayla Kotila, senior knowledge & research services manager, from Frost Brown Todd.

Highlights from the conversation include:

Adoption and Use Cases: Frost Brown Todd has approximately 250 legal professionals actively integrating generative AI into their daily operations with impressive results. This adoption is not limited to specific tasks; instead, AI is being woven into existing workflows, signaling a fundamental shift in how legal work is conducted.

Measuring Success and Adoption: Success is measured by client satisfaction and efficiency improvements, and the firm uses qualitative feedback and quantitative methods, like A/B testing, to assess AI’s impact. Adoption is widespread across different demographics, with varying use cases depending on experience levels. Junior lawyers are leveraging AI to streamline certain tasks, while partners are using it to enhance their recall and decision-making processes.
Client-Centric Innovation and Collaboration: One key measure of success for technological implementation lies in its ability to enhance client service. From expediting document review processes to uncovering critical insights for litigation, AI enables the firm to deliver faster, more accurate, and ultimately more valuable services. The importance of open communication with clients regarding their AI initiatives is key. Sharing success stories, addressing concerns, and exploring potential applications collaboratively fosters trust and ensures that AI implementation aligns with client needs and expectations that benefit both parties.

Technology continues to move forward and law firms must prioritize a culture of innovation, continuous learning, and client-centricity to thrive in an increasingly complex and competitive environment. The future of law is not about replacing lawyers with machines but rather empowering them with the tools and knowledge to deliver exceptional legal service.

As part of the Clarity podcast series from the Thomson Reuters Institute, Ramanathan will speak with customers, industry experts and colleagues, bringing perspectives from legal leaders and subject matter experts shaping the industry. The conversations aim to highlight the innovations driving the legal profession as well as the people and organizations implementing new technologies and approaches to maintain a competitive edge in the rapidly changing market.

You can listen to the full conversation on either Apple Podcasts or Spotify.

Thomson Reuters Labs Showcases Innovation in AI and Legal Tech at EMNLP 2024

jeffrey.mccoy@thomsonreuters.com — Fri, 14 Feb 2025 18:11:33 +0000

The Thomson Reuters Labs team recently made a significant impact at the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) in Miami, demonstrating our leadership in applying cutting-edge AI technology to legal and business solutions. The TR Labs presence included three groundbreaking papers that showcased commitment to innovation in legal tech, corporate tax, and international trade.

Pioneering Research Highlights

Revolutionizing Global Trade Classification: TR Labs’ research on LLM-based Robust Product Classification in Commerce and Compliance introduces an innovative approach to automating product classification in international trade. This breakthrough work has already been implemented in Thomson Reuters CoCounsel platform, demonstrating the ability to quickly transform research into practical solutions that benefit customers.
Advancing Legal AI Evaluation: The TR Labs team also presented a novel Automated Pointwise Evaluation Metric for Generated Long-Form Legal Summaries, addressing a critical gap in the field. The approach offers unprecedented accuracy in assessing AI-generated legal content, setting new standards for quality control in legal tech.
Enhancing AI Reliability in Legal Applications: The TR Labs research on Measuring Groundedness in Legal Question-Answering Systems introduces new benchmarks for ensuring AI reliability in legal applications, reinforcing the Thomson Reuters commitment to delivering trustworthy solutions.

Industry Impact and Innovation

Our work generated significant interest from both academic and industry leaders, particularly regarding our successful integration of advanced AI research into practical applications. The conference highlighted Thomson Reuters’ position at the intersection of cutting-edge research and real-world implementation, demonstrating leadership in:

Development of AI agents for legal applications
Advancement in Retrieval Augmented Generation (RAG)
Enhanced LLM safety and security measures
Innovative approaches to mitigating AI hallucinations

Looking Forward

The conference reinforced Thomson Reuters’ position as a pioneer in legal tech innovation and the TR Labs research aligns with and often leads industry trends, particularly in developing practical applications of AI that address real-world challenges in the legal and business sectors.

Building Bridges Between Research and Practice

A distinctive aspect of the TR Labs presence at EMNLP was the demonstration of how theoretical research translates into practical applications. The presentations sparked engaging discussions with leading tech companies, including Amazon, underlining the universal relevance of our research in solving complex industry challenges.

Future Directions

The conference reinforced several key areas that will shape future innovation:

Enhanced focus on AI safety and security in legal applications
Development of more sophisticated evaluation metrics for legal AI
Continued advancement in automated legal research and analysis
Integration of multimodal AI capabilities in legal tech solutions

Thomson Reuters Labs continues to push the boundaries of what’s possible in legal tech while maintaining focus on practical, reliable solutions that serve customer needs. For more information about Thomson Reuters Labs and their innovative work, you can visit the website here.

Thomson Reuters Best Practices for Benchmarking AI for Legal Research

jeffrey.mccoy@thomsonreuters.com — Wed, 12 Feb 2025 15:38:20 +0000

At Thomson Reuters, we do an enormous amount of AI testing in our efforts to improve our customers’ ability to move through legal work faster and more effectively. We’ve noticed an increase in interest in AI testing generally, and in benchmarking AI applications for legal research specifically. We’ve learned a lot in our thousands of hours of AI testing, as such we offer the following best practices for those interested in considering an updated or differentiated approach when testing or benchmarking AI for legal research.

1. Test for the results you care about most.

This would seem obvious, but we’ve seen a lot of confusion about it, and if we could only make one recommendation, this would be it. It’s foundational for all other recommendations.

If you cared most about determining how long it takes to drive from one place to another, you wouldn’t just measure highway time, you’d measure total door-to-door time. If you cared most about car maintenance costs, you wouldn’t just measure the cost and frequency of brake repairs and maintenance.

With the use of AI for legal research, there are no LLMs nor any LLM-based solutions that offer 100% accuracy. Because of that, all answers generated by large language models or LLM-based solutions, even if they use Retrieval Augmented Generation (RAG), must be independently verified.

Some assume verification is a simple matter of checking the sources cited in an AI answer, but this is incorrect. We’ve seen plenty of examples where an AI-generated answer is wrong, and the cited sources simply corroborate the wrong answer. Verification requires using additional tools (like a citator, statute annotations, etc.) to ensure the answer is correct.

This means every time an AI-generated answer is used for research, there is a three-step process the researcher must engage in: (1) review the answer, (2) review the cited material from the answer, (3) use traditional research tools to make sure the answer and cited material are correct.

When we talk with researchers about research generally and this process specifically, what they care about most is (a) getting to a correct answer or understanding of the relevant law, and (b) the time it takes to get to that correct answer or understanding.

Because of this, the two most important measures are:

Percentage of times using this three-step process the user can get to the right answer, and
Time it takes to complete all three steps

Surprisingly, the percentage of errors in answer in step 1 can have very little impact on the percentage of correct answers by the researcher using all three steps or the time to complete those steps (unless errors are excessive), as long as citations and links to primary law are good and those primary resources are current and easily verified. Focusing on step one is like trying to figure out door-to-door times by measuring highway speeds only. It’s not very useful.

For instance, which of the following systems would you rather use?

System where the initial AI answer is 92% accurate, but verification, on average, takes 18 minutes, and post-verification accuracy is 97%, or
System where the initial AI answer is 89% accurate, but verification, on average, takes 10 minutes, and post-verification accuracy is 99.9%

It’s a clear choice, but there is often a misplaced focus on measurement of the first step in the process to the exclusion of steps two and three. Measure what you care about most.

2. Use realistic, representative questions in your testing.

Presumably you want to evaluate AI for the typical legal research you or your organization does. For instance, if you look at the research your organization does and find the questions are roughly 20% simple questions, 60% medium complexity, and 20% very complex or difficult, and that roughly half are questions about IP law and half are about federal civil procedure, then a benchmark testing 90% simple questions about criminal law would not be very helpful to you.

At Thomson Reuters, we model our testing based on the real-world questions we see from our customers every month. For your own testing, focus on the question types that best represent the researchers you’re focused on.

Testing mostly simple questions with clear-cut answers is easiest for testing, but if those types of questions don’t represent what your users do most (it doesn’t well represent most AI usage in Westlaw), then the results are not particularly helpful. Similarly, if you primarily test overly complex, extremely difficult and nuanced questions – or trick questions, those can be useful for testing the limits of a system, but they tend not to be very helpful for most real-world decision making.

3. Test a lot of questions.

In our own testing, we’ve found that testing small sets of questions is rarely representative of actual performance with a larger set. Large language models can generate different responses each time, even with identical inputs. Additionally, if responses are long and complex, graders may disagree, even when judging identical responses. For just a quick general sense of direction, it’s fine to test with a sample of questions as small as 100 or so, but for comparing algorithms/LLMs against each other, we strongly recommend checking the results as you grade and testing until the measure of interest stabilizes. For example, if you are running a comparison between two systems to see which is preferred, you would test until the rate at which one system is preferred over the other stops changing dramatically with each new batch of questions. Another guide to the number of questions you should test is the confidence level and interval you want (see next section).

4. Calculate and report confidence levels and intervals.

Even with a relatively large set of questions, measurements of accuracy are only so precise. When using these measurements to make decisions, it’s important to understand the degree or range of accuracy of the measurement, often referred to as confidence level and confidence interval. You can think of confidence intervals and levels like margin of error in surveys. It lets you know how reliable or repeatable the measurement is expected to be.

For instance, testing AI accuracy based on 200 questions, if you ran the test again with the same questions/answers but different evaluators, or used the same evaluators but with a different 200 random, representative sample of questions, would you expect the exact same result? Typically, you wouldn’t. You’d expect the result to fall within a certain range, so it’s important to report that range along with the results so decision makers understand the differences between algorithms/LLMs that are meaningful and those that are not meaningful. The proper way to report this is with confidence intervals and levels. You can read more about them here. Using standard assumptions, when measuring an error rate of 10% from a sample of only 100 questions, you can be about 95% confident that the true error rate is between 4.1% and 15.9%. This is called a 95% confidence level, and the “+/- 5.9%” is the margin of error. If you measure an error rate of 10% from a sample of 500 questions, the 95% confidence interval would be between 7.4% and 12.6%, or 10% +/- 2.6%.

The basic power analysis to estimate a confidence interval assumes a perfect means of detecting the outcome you are trying to measure. If there is some uncertainty in that detection, e.g., if two independent evaluators disagree about the outcome some percentage of the time, then the margin of error increases. A grading process or measurement that’s unreliable ~5% of the time, might increase the margin of error from 5.9% to 7.3%, in our example above with 100 questions. It’s important to note that there are various methods for calculating standard error, and these examples make simplifying assumptions that likely underestimate the confidence intervals observed in practice.

5. Use a combination of automated and manual evaluation efforts.

Having human evaluators pore through lengthy answers to complex questions can be difficult and time-consuming. Ideally, we would just have AI evaluate the accuracy and quality of answers generated by AI. This is sometimes referred to as LLM as judge. But in the same way that AI makes mistakes when generating an answer, it can also make mistakes when evaluating the quality of an answer against a gold-standard answer written by a human. In our experience, modern LLMs are pretty good at evaluating AI-generated answers against gold-standard answers when answers are clear and relatively short. With length and complexity, we’ve found the LLM as judge approach to be very unreliable.

For instance, research has shown that LLMs tend to struggle when evaluating responses to complex and challenging questions like those requiring expert knowledge, reasoning, and math.

Since most test sets will contain a sample of simple/easy/clear questions and answers, it makes sense to use AI for automated evaluation of these, then use human evaluators for the rest, at least until AI improves to the point where more can be automated.

6. For human grading, use two separate human evaluators for each answer, and have a third (ideally more experienced) evaluator to resolve conflicts.

For assessments like these, inter-rater reliability can be a real issue. In our own testing, we’ve found attorneys evaluating AI-generated answers for more complex legal research questions can disagree about the accuracy or quality of answers about 25% of the time, which makes single-grader evaluation unreliable. To improve reliability, we have two evaluators separately grade each answer, and where there are conflicts, we have a third, more experienced evaluator resolves the conflict.

7. When answers are wrong, investigate to see if the gold-standard answer might be wrong.

In the same way people make mistakes in evaluating answers, they can also make mistakes in coming up with the gold-standard answer for testing. In our experience, we’ve found some instances where the AI-generated answer was evaluated as incorrect when compared to the gold-standard answer, but when we dug into it further, it turned out the AI was correct and the person who put together the gold-standard answer was wrong. Sometimes AI makes mistakes and sometimes humans make mistakes – you should check both.

8. If evaluating multiple algorithms/LLMs/solutions, make sure the evaluators are blind to which algorithm/LLM/solution the answer was generated by.

In our evaluations we try to avoid human bias in grading. Sometimes an evaluator has had bad experiences or great experiences with a certain product or LLM in the past, and we don’t want them to bring that bias to the current evaluation, so when evaluating different solutions, we first strip away anything that would identify the source of the solution, so results are not biased by past positive or negative experiences.

9. Grade the value of answers in addition to making a binary determination of whether the answer has an error.

What’s right or wrong in an answer can vary enormously in terms of positive value and negative impact. For instance, consider the following answers:

A. Answer is correct in every way but is short and high level. It just gives a basic description of the legal issue as it relates to the question but doesn’t provide any references to primary or secondary law for verification, nor any nuance regarding exceptions or other considerations.

B. Answer is lengthy and nuanced, addressing multiple aspects of the question and discussing important exceptions that might apply, and it provides references with citations and links for verification, and it’s correct in every way except in one of the citations, the date is incorrect, but that’s easily verified and corrected when clicking the link from the citation.

C. Answer is incorrect in every way and all its linked references point to primary law that simply corroborate the wrong answer.

If the evaluation is simply a binary view of the number of answers that contain an error, then answer A looks good and answers B and C look equally bad. In reality, answer C is far worse and more harmful than answer B, and Answer B is likely much more valuable to the researcher than answer A.

In our evaluations, we’re looking for answer attributes that are helpful to researchers, like depth of the answer and quality of the references, and we don’t just evaluate errors in a binary way. We consider answers that are totally wrong to be far worse than answers with erroneous statements in otherwise correct and helpful answers. Similarly, we consider erroneous statements in answers based on whether they address the core questions or are tangential to it, and whether they’re contradicted in the answer or easily verified with the linked references. We’d like to eradicate all errors, of course, but some are more harmful than others.

10. Look for errors beyond gold-standard answers.

Often LLMs generate answers with information beyond the scope of a gold-standard answer. For instance, the gold-standard answer might say the answer should state that the answer to the question is no, and it should explain that with X, Y, and Z, and it should specifically cite to cases A & B and statute C.

The LLM-generated answer might state the answer is no and explain X, Y, and Z with references to A, B, and C, but it might also add a few statements about exceptions or related issues or an additional case or statute. Sometimes these additional statements are incorrect, even when everything else is correct. So, if an LLM-as-judge or human evaluator only looks at the gold-standard answer to see if the AI-generated answer is correct, that evaluation can miss errors in the additional material. This means evaluators need to do independent research beyond simply looking at the gold-standard answers to determine if an answer has an error.

11. Consider testing reliability.

LLMs often have some randomness built into them. Many have a temperature setting that can be used to minimize or eliminate this, making answers more consistent when asking the same question multiple times.

But some LLMs are better at this than others, and some integrated solutions that use LLMs in conjunction with other techniques, like RAG, don’t set temperature low to allow for more creativity in answers.

For big decisions you might be making, consider testing reliability by running the same question 20 times and seeing if any of the answers are substantially worse than the other answers to the same question.

The above are our and learnings from our extensive expertise with AI, Gen AI and LLMs over the past 30 years. At Thomson Reuters we put the customer at the heart of each of these decisions we make and are transparent that at the point of use all our AI generated answers must be checked by a human.

As we work through testing our AI products, our teams do not follow each of these steps for every test we do, sometimes we prioritize speed over accuracy of testing or vice versa, but we ensure we clearly understand the trade-off in prioritizing some of these steps and communicate this with our teams. The bigger and more important the decision we’re trying to make, the more of these steps we follow.

This is a guest post from Mike Dahn, head of Westlaw Product, and Dasha Herrmannova, senior applied scientist, from Thomson Reuters.

Law firms saw double-digit profit growth to close an incredible 2024: Thomson Reuters Law Firm Financial Index

jeffrey.mccoy@thomsonreuters.com — Mon, 10 Feb 2025 15:53:08 +0000

Law firms closed out the year strong as they pushed innovation and key investment in technology and talent, according to the Q4 2024 Thomson Reuters Law Firm Financial Index (LFFI), powered by Financial Insights. Below are a few key takeaways from the report.

Double-Digit Profit Growth

Law firms ended 2024 on a high note, with the average firm experiencing double-digit profit growth for the year. In the fourth quarter alone, profits grew by an impressive 11.5%. This strong financial performance underscores the industry’s ability to adapt and thrive.

Transactional Practices Lead the Way

Transactional practices were a significant driver of growth in Q4, with corporates seeing a 4.0% increase and real estate growing by 3.0%. Counter-cyclical practices slowed after two years of record growth, counterbalancing the increase in transactional work. The report suggests that demand is expected to be more static in the first half of 2025 compared to 2024.

Investment in Technology and Knowledge Management

Law firms have been investing heavily in technology and knowledge management, leading to a 6.9% increase in overhead expenses in Q4. These investments are crucial for maintaining competitive advantage and improving client service. Direct expenses also rose by 6.2% as firms paid out sizeable bonuses to their associates and staff.

Embracing AI and Upskilling

Many law firms are strategically using AI and that is expected to drive sustainable productivity growth and position firms for continued success in 2025.

“Law firms continue to successfully navigate the dynamic landscape and demonstrate their commitment to serve their clients at the highest level by investing significantly in technology, including generative AI, and the upskilling of their people,” said Raghu Ramanathan, president of Legal Professionals, Thomson Reuters. “Law firms enter 2025 from a position of strength and appear ready to drive profitability. As the AI market matures, law firms can enhance job satisfaction, well-being, and work-life balance by using AI for routine tasks, freeing up time to focus on their clients’ complex needs and driving sustainable productivity growth.”

Download the full report for additional insights on the factors shaping the future of law firms.