AI Is a Long Way From Replacing Software Coders

ChatGPT and other large language models (LLMs) are undeniably magical. No one really understands how they work or can predict reliably what text they will generate. They are like human brains in that respect. But, in other respects, LLMs are not at all like human brains, and recent studies confirm their weaknesses. LLMs train on unimaginable amounts of text and then generate convincing text based on statistical patterns they discover. However, they have no way of relating the text they input or output to the physical world. Consequently, they have no common sense or grasp of causality. They have no way of assessing whether a factual statement is true or false or whether a logical argument is persuasive. LLM vs. human brain Human brains in contrast, train on the physical world. We know what happens when a person uses a wooden bat to hit a thrown baseball because we have seen it happen — and we don’t need to see it happen a billion times. We can confidently predict what will happen if a thrown baseball is hit with an aluminum bat, a fiberglass hockey stick, or a wooden broomstick even if we have never seen it happen. Ditto a baseball hit with a T-shirt, a piece of paper, or a foam water toy. We also know that the color of the objects doesn’t matter. And that we don’t want to be hit by a baseball or a bat. We have little understanding of how our magical brains know such things. But it is clear that they learned from watching the world we live in, not from discovering statistical patterns in textual databases. It is also clear that ChatGPT and other LLMs will never rival our brains until they are able to understand what words mean and how words relate to the physical world. This is the fundamental reason why LLMs are not about to take over jobs that require logical reasoning, critical thinking, common sense, and notions of causality — particularly if mistakes are expensive. Why can’t LLMs code better than humans? It might seem that coding is one thing that LLMs might do better than humans. After all, writing computer code that sorts names alphabetically doesn’t seem to require any critical thinking or knowledge of the physical world. Image Credit: Elnur - Adobe Stock Many CEOs seem to agree, though their claims are undeniably self-serving. Microsoft CEO Satya Nadella has said that 30% of Microsoft’s code is already AI-generated while Salesforce CEO Marc Benioff has said that AI now does up to 50% of all work, including that of coding. Meta CEO Mark Zuckerberg has claimed that within a year, AI will write 50% of Meta’s code while Amazon Web Service CEO Matt Garman  says most coders could stop coding soon. AI startup CEOs are even more optimistic. “Anthropic CEO Dario Amodei said in May that half of all entry-level jobs could disappear in one to five years, resulting in U.S. unemployment of 10% to 20%” LLMs can certainly find and replicate code far faster than humans but they hallucinate during coding just as generative AI does with text. And some of these hallucinations can cause serious damages such as deleting databases. For instance, within days of each other this month, two coding assistants, one called Replit, and one called Gemini from Google, deleted company databases. Google’s Gemini CLI “destroyed user files while attempting to reorganize them” while “Replit’s AI coding service deleted a production database despite explicit instructions not to modify code.” AI coding assistants may also be unable to break unique, complex assignments into small tasks that fit together logically. They are also apt to do poorly at communicating with clients and collaborators. In addition, anyone who has done nontrivial programming (and we have many times) knows that debugging is typically the biggest challenge. Not understanding what computer code does or is intended to do, it is often nigh impossible for LLMs to identify coding errors that are errors in logic, not syntax. If LLMs cannot be trusted to debug the code they generate, then humans will have to take over and the only thing harder than debugging your own code is debugging someone else’s code. Thus the hours LLMs save generating code they found elsewhere may well be swamped by the days lost by humans debugging the LLM-recycled code. At present, LLMs are not more efficient, despite our assumptions A study conducted by Model Evaluation & Threat Research (METR) found that, even though coders believed that AI was saving them time, it actually ended up costing 19% more time: “The study observed 16 experienced developers across 246 real tasks on mature open-source projects that they were already familiar using popular tools from Cursor Pro and Claude 3.5/3.7.” The study was much different from previous tasks that “often rely on software development benchmarks for AI, which sometimes misrepresent real-world tasks.” Image Credit: Ewa - Adobe Stock The study found that while developers spent less time coding, they spent more time prompting, considering LLM suggestions, and, most importantly, reviewing and debugging the AI generated code. The results even surprised the researchers. Before beginning the tests, one of the lead authors wrote that he expected “a 2x speed up, somewhat obviously.” METR also did a study on improvements in AI for software coding, with a summary published in Nature and in IEEE Spectrum. This study devised a metric called “task-completion time horizon. It’s the amount of time human programmers would take, on average, to do a task that an LLM can complete with some specified degree of reliability, such as 50 percent. A plot of this metric for general-purpose LLMs going back several years shows clear exponential growth, with a doubling period of about seven months.” A doubling every seven months is very rapid, much faster than the 12-to-24-month doubling for Moore’s Law. However, 50% reliability is far too low for any real-world tasks where these errors have substantial costs; even 99% or 99.9% might be too low. The study used this low 50%-reliable bar because “individual benchmarks saturate increasingly quickly, and we lack a more general, intuitive, and quantitative way to compare between different benchmarks, which prevents meaningful comparison between models of vastly different capabilities (e.g., GPT-2 versus o1).”  In other words, if they had required more reliability, diminishing returns would appear and compress the performance across systems!  Yet, our real world objective is not to compare unreliable systems but to see if any systems are reliable enough to be trusted with real world tasks. The critical “messiness score” The METR researchers also considered “messy” tasks “that more resembled ones in the ‘real world’” and found that “Large Language Models are more challenged by tasks that have a high ‘messiness’ score.” Tellingly, a “high messiness score” was not very high. The mean messiness score amongst the tasks they considered was only 3.2 out of 16 and none of the tasks had a messiness score above 8. “For comparison, a task like ‘write a good research paper’ would score between 9/16 and 15/16, depending on the specifics of the task.” In other words, the tests didn’t involve challenging tasks that humans routinely do in their work, like real coding problems, and the LLMs still struggled. The bottom line is that LLMs excel at simple coding tasks but they are still too unreliable to use without extensive human supervision on complex tasks where mistakes are expensive. They are getting better but they aren’t nearly ready to take over programming from humans.
AI Article