Screens Publishes Redlining GenAI Accuracy – Artificial Lawyer



As the movement to bring more transparency and understanding to legal genAI accuracy grows, US-based Screens has published another in-depth performance study, this time on its redlining capabilities.

As explored by Artificial Lawyer this summer, Screens – which is the sister company of contract-focused TermScout and which was created by CEO Otto Hanson and CTO Evan Harris – has already gone public with other aspects of its genAI accuracy, (see here.)

Now they are looking at the redlining aspects of Screens, and once again they have set out in detail how they achieved their accuracy scores.

Below is an intro to the study and you can see the full analysis here.

Big Picture

First, we are only ever going to develop shared benchmarks and protocols for genAI accuracy in legal work if people are open and share information. So, this is very helpful.

Now, as explored before here, every use case may have different accuracy needs and expectations. Redlining, however, is one of those where you really want to see high accuracy numbers and the genAI outputs are directly applicable, i.e. this is not like asking for a general summary of a document, this is ‘show me exactly what needs to change’ – hence accuracy really matters on this one.

That said, subjectivity can come into it. One lawyer’s redlines may be different to another’s. However, if you then add in a clearly-defined playbook, that changes the formula again, as the goal is to conform to the playbook.

Some may say that this is a company ‘marking its own homework’ – and it is, but the alternative at present is that there is very little transparency in the market, and this site would much rather see companies publish their results, as long as they fully explain how they reached them, as is the case here. When companies do this it helps everyone to understand more and, no doubt, the company in question would be happy to explain in more detail how the test can be repeated.

Also as noted before, empirical science demands that any test can be repeated by anyone and that you’ll get similar results. By publishing exactly how they achieved their results that really helps in this regard.

Last thing. This site has mentioned that we need a compass, rather than a fixed map for benchmarking accuracy. This is because LLMs will evolve, so too system prompting, RAG, and other techniques that drive improved results. Plus, our expectations will change too as AI develops.

But, public results help as they point us all in the right direction, i.e. for X use case, you probably need to get to Y level of results, using Z approach. So, we may not have to say 97% or any other percentage is an absolute benchmark, but we all learn the best way to get to the level of results we need for that type of genAI-based output.

Overall, these result publications are really helpful and lift us all up in terms of understanding what is possible. Thanks to Otto and Evan.

Please see the summary below, plus the full analysis is here.

Screens Redlining Accuracy Report

By Otto Hanson, Co-founder and CEO of TermScout & Screens

In an effort to increase trust and transparency in the legal tech community, the team at Screens is pleased to share our latest AI accuracy evaluation report. It builds on our previous work, which focused on measuring the accuracy of Screens’ ability to identify whether a broad set of contract standards were met across disparate contract samples. Today, we expand the focus to an equally critical question: how effective is the Screens platform at suggesting accurate redlines to correct failed standards?

Contract redlining is a nuanced, high-stakes process, and understanding its reliability is vital for advancing AI’s role in legal tech. In this report, we evaluated the Screens platform’s ability to suggest redlines to correct failed standards. Screens achieved a 97.6% success rate in correcting failed standards with suggested redlines.

To conduct this evaluation, our team used the Screens platform to review 50 publicly available software terms of service contracts against a screen (playbook) from the publicly available Screens Community: SaaS Savvy: Lower Value Purchases. This process focused on a single, objective metric: the percentage of failed standards that were successfully resolved by the suggested redlines. Here’s how the evaluation unfolded:

1. Initial Screening: Each contract was screened against the standards in the public screen to identify failures.

2. Redline Application: All suggested redlines were applied to the contract to fix all the failed standards.

3. Re-Screening: The revised contract was screened again to determine if the failed standards were properly corrected.

Results:

  • Contracts Analyzed: 50
  • Initial Failures Identified: 534
  • Failures Resolved by Suggested Redlines: 521
  • Overall Accuracy: 97.6%

While correcting failed standards is the primary goal of generating redlines, we also aim to optimize redlines for brevity, professionalism, and the likelihood of counterparty acceptance. While this evaluation focuses on one dimension of accuracy, our broader efforts consider the interplay between precision, usability, and professional norms.

This evaluation demonstrates the capabilities of Screens in addressing one of the most important contract review challenges: generating accurate and actionable redlines. While no system is perfect, we’re proud of the work we’ve done as we continue to strive for exceptional accuracy in AI-powered contract review. 

We invite the community to review the full report, explore the methodology, and share feedback as we continue to explore what LLMs can achieve in legal tech. The full report also provides detailed instructions on how this experiment can be replicated by third parties wishing to confirm the results reported here.





Source link

Related Articles

Latest Articles