AI Checks Papers for Errors
# | #ai, #ai-tools, #forecasting
Elizabeth Gibney (Nature):
Both the Black Spatula Project and YesNoError use large language models (LLMs) to spot a range of errors in papers, including ones of fact as well as in calculations, methodology and referencing.
The systems first extract information, including tables and images, from the papers. They then craft a set of complex instructions, known as a prompt, which tells a ‘reasoning’ model — a specialist type of LLM — what it is looking at and what kinds of error to hunt for. The model might analyse a paper multiple times, either scanning for different types of error each time, or to cross-check results. The cost of analysing each paper ranges from 15 cents to a few dollars, depending on the length of the paper and the series of prompts used.
A good application for the current SOTA models. It is of course difficult to make these things work reliably at scale. But that is a small problem - it only needs to be a net-positive.
The rate of false positives, instances when the AI claims an error where there is none, is a major hurdle. Currently, the Black Spatula Project’s system is wrong about an error around 10% of the time, says Gulloso. Each alleged error must be checked with experts in the subject, and finding them is the project’s greatest bottleneck, says Steve Newman, the software engineer and entrepreneur who founded the Black Spatula Project.
This seems like a small issue to me - depends on the total error rate. Say there are a 100,000 papers and 100 errors are found, of which 10 are false positive. That means 0.1% of the papers have to be hand-verified to find the 0.01% of the papers with errors. What's the alternative? Hand-verify the 100,000 papers? A sensible way to integrate AI systems is to think of them as tools-of-aid rather than human replacement. Halluciantion rates have been improving and I expect the false positve rate to go down linearly over the next few years. A time will come (has come already for some) where it will be ¡ to not have Claude at least check your work once before publishing.