AI has a management problem, not a data problem

Prolific is a Business Reporter clientNew research shows the biggest bottleneck in AI development isn’t cost or compute – it’s clear communication.When we asked more than 120 AI practitioners what’s slowing them down, cost ranked last. Not the compute bill, not the annotation budget, not the price of expert time. What topped the list was something the industry has spent far less time thinking about: the ability to communicate clearly with the humans being asked to shape these systems. That finding cuts against years of received wisdom about what AI development actually costs. It also points towards a problem that no amount of infrastructure spending can fix. When we surveyed those practitioners, spanning AI engineers, researchers, product managers and business leaders across organisations actively building and deploying AI systems, the top bottleneck for advancing AI safety and alignment wasn’t compute. It wasn’t theoretical gaps in how we understand these models either. It was the quality of human feedback used to train them, followed closely by the difficulty of measuring whether training is even working in the first place. To understand why, it helps to look at how the role of humans in AI development has changed. The original mental model was built for a different era. Early AI systems needed volume: thousands of people clicking through image classification tasks, labelling whether emails were spam, choosing which output sounded more fluent. The work was repetitive, the instructions were binary and the humans doing it were largely interchangeable. Intelligence was a raw material to be purchased in bulk. That model is breaking down. The tasks that matter now aren’t binary. When you’re building a system to reason through ambiguous legal scenarios, navigate a multi-step clinical workflow or generate code for a production environment, you need a different kind of human input entirely. You need someone who can tell you whether the model is actually reasoning correctly, not just whether its output sounds plausible.When we asked AI practitioners what’s slowing them down, cost ranked last (Prolific)Our data reflects this shift. The top human contributions that AI teams rely on today are designing evaluation methodologies and subject matter expert validation. Nearly half of the respondents cited each as a primary input. The humans shaping these models have moved well past labelling. They’re designing the tests that determine whether a model reasons correctly in the first place. Here’s what the industry hasn’t fully reckoned with: it doesn’t yet know how to work with these people. When we asked practitioners to identify the biggest difficulty in involving humans in their AI workflows, the least-cited challenge was cost. Only around one in six respondents flagged it. The top challenges were communicating tasks clearly enough for expert contributors to actually perform them, and finding people with the right combination of domain knowledge and contextual understanding in the first place. These are more management problems than compute problems. We call this the instruction gap: the loss of signal between what an engineer needs and what an expert contributor can deliver without proper context. In the era of simple labelling, instructions were clean and objective. A picture either contained a stop sign or it didn’t. Today, explaining to a cardiologist how to evaluate a model’s reasoning about arrhythmia, or to a securities lawyer how to assess whether a contract review agent is flagging the right clauses, requires a genuine transfer of operational knowledge. When that transfer fails, the data comes back noisy. And noisy data, fed into a sophisticated model over time, produces a less capable and misaligned system. The stakes of getting this wrong are rising in direct proportion to how ambitious the systems are becoming.Primary growth area: Practitioners flagged AI agents and autonomous systems as a key focus (Prolific)Nearly two-thirds of practitioners in our survey identified AI agents and autonomous systems as the primary growth area for 2026. These are systems that do more than just generate text in response to a prompt. They plan, decide and act. They book appointments, execute transactions, triage documents and navigate interfaces on your behalf. The standard for reliability is categorically different from anything the industry has shipped before. A chatbot that produces a confused response is simply an inconvenience. An agent making decisions in a healthcare, legal or financial context without properly calibrated human guidance is something else entirely. The more autonomous these systems become, the more their behaviour depends on the precision of the human signal used to train them. There’s no shortcut around this. The companies beginning to close the instruction gap aren’t doing it by spending more. They’re doing it by treating the management of human expertise as a first-class engineering problem, one that deserves the same operational rigor as model architecture, infrastructure, or deployment pipelines. What this looks like in practice is less exciting than a new training technique, but considerably more impactful. Before a domain expert ever sees a model output, the best teams invest significant time in calibration: explaining not just what the task is, but why it matters, how the model will eventually be deployed, what failure looks like in context, and which edge cases the team is most worried about. A cardiologist evaluating a diagnostic model needs to know whether it’s being built for triage in an emergency department or for routine screening in a primary care setting. Those are different tasks, requiring different judgments. Without that context, even the most qualified contributor is working in the dark. This is the instruction gap in practice.Looped in: The best human feedback systems build in regular loops (Prolific)This process has to be ongoing, not a one-time onboarding. The best human feedback systems build in regular loops where contributors can flag ambiguous instructions, engineers can tighten their task design based on observed disagreements, and the gap between what was intended and what was understood gets progressively smaller. Right now, most teams don’t have this. They write instructions once, ship them to contributors, and treat the resulting data as ground truth. The noise that enters the pipeline at that stage doesn’t announce itself. It just quietly degrades the model’s judgment over time. There’s also a recruiting problem that no operational process fully solves, but that better processes can significantly reduce. Finding a credentialed expert is one thing. Finding a credentialed expert who can also translate their tacit professional knowledge into explicit, consistent evaluations is considerably harder. The teams doing this well have stopped treating expert recruitment as a one-off procurement exercise and started treating it more like hiring: building relationships with domain networks, investing in contributor development and creating the conditions for genuine expertise to be expressed rather than just extracted. None of this is glamorous. But as AI systems take on more consequential roles, the quality of the human judgment used to shape them stops being a nice-to-have and becomes a direct determinant of whether those systems are safe to deploy at all. The industry has spent years asking how to make models more capable. The more pressing question now is how to make the humans guiding them more effective. Because the model on the other side of the process is only ever as good as the clarity of the people who shaped it.Find out more about Prolific for AI.

Comments (0)

AI Article