As companies lean on AI, a Microsoft study flags a growing risk
Artificial intelligence (AI) tools are quickly becoming co-workers, helping draft emails, edit code and even manage complex documents. But a new research paper suggests that handing over too much control to these systems could damage the very work they are meant to improve.
Must read: Vietnam is winning the chip race India wanted to lead
A study by Microsoft Research finds that large language models (LLMs) like ChatGPT and Claude can steadily degrade documents when asked to perform repeated editing tasks. In some cases, even the most advanced models “corrupt an average of 25% of document content” after extended use, the researchers said.
The findings raise questions about a growing trend of delegating tasks to AI systems with minimal oversight at workplaces.
Promise and risk of AI delegation
The idea behind AI delegation is simple. Instead of manually editing files, users give instructions and let AI systems complete the task. This approach, sometimes called “delegated work” or “vibe coding,” is seen as a major shift in how knowledge work gets done. But it depends on trust.
“Delegation requires trust – the expectation that the LLM will faithfully execute the task without introducing errors into documents,” the researchers wrote.
That trust, the study suggests, may be premature. Using a benchmark called DELEGATE-52, the team tested 19 different AI models across 52 professional domains, from coding and accounting to music notation and textile design. The goal was to create real-world workflows where documents are edited repeatedly over time.
"Our findings show that current LLMs introduce substantial errors when editing work documents, with frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4) losing on average 25% of document content over 20 delegated interactions, and an average degradation across all models of 50%," the study said.
Small errors, big consequences
One of the key findings is that AI systems don’t always fail in obvious ways. Instead, they introduce what the researchers describe as “sparse but severe errors that silently corrupt documents.”
Must read: Pure software is rapidly becoming uninvestable: Naval Ravikant on why vibe coding changes everything
These could be simple mistakes like a wrong number or a missing sentence. But when the document is edited repeatedly, the errors pile up and change the final output.
Across all models tested, the average degradation reached about 50% by the end of long workflows. Even top-tier systems performed poorly over time.
“Current LLMs are unreliable delegates,” the paper said, noting that performance drops as interactions increase.
Why longer workflows make things worse
The study highlights a critical issue. AI systems struggle with long, multi-step tasks. While many models perform well on short interactions, their accuracy declines sharply when tasks are chained together.
“Short-term performance… is not always predictive of long-horizon performance,” the researchers found.
Must read: AI layoffs may hurt companies too, not just workers: Study warns of ‘automation trap’
This matters because most real-world work involves multiple steps. Documents are edited again and again, not just once. The problem becomes bigger with large and complex files. More steps mean more chances for mistakes and those mistakes add up over time.
One may think that giving access to tools like code execution or file editing utilities would make AI more accurate. But the study found the opposite.
Models that used tools showed slightly worse results. The reason is partly technical. Tool use increases the amount of data the model has to process, making it harder to maintain consistency across steps.
Not all domains are equal
The research also shows that AI performance varies depending on the type of task. Structured and rule-based domains, like programming, fare much better. In fact, coding was the only area where most models could reliably handle delegated workflows.
Must read: Tech layoffs 2026: Nearly 40,000 jobs lost in April amid changing AI priorities
In contrast, tasks involving natural language or specialised formats, such as financial records or creative documents, saw much higher error rates.
What does that mean for workplaces?
The findings come at a time when companies are increasingly integrating AI into daily operations. From drafting reports to managing data, these tools are often used with minimal human review. The study suggests that the approach may need rethinking.
Users “still need to closely monitor LLM systems as they operate,” the researchers warned, especially in high-stakes tasks.
Despite the shortcomings, the researchers note that progress is rapid. Newer models show significant improvements over earlier versions, even if they’re not yet ready for full delegation.