Fallibility is a hard problem, even for people. We all make mistakes. And Large Language Models (LLMs) like GPT-4 make bigger mistakes than we do, or so it would seem.
There has been significant technical progress in making LLMs less error prone but far more improvements are needed.
The article “Wrong Turn at Taipei: The crash of Singapore Airlines flight 006” describes an October 2000 runway accident at takeoff time in which 83 of the 179 people on board the flight died. The root cause involved several different human errors. The author says the expectation that humans will perform perfectly in every case is fatally flawed. In more detail:
In any field involving humans, achieving the highest possible level of safety requires that barriers be erected to prevent human errors from leading to disaster, because the expectation that humans will perform perfectly in every case is fatally flawed. Certainly one would expect (commercial) pilots to line up with the correct runway 999,999 times out of a million. The remaining 1 in a million only has to happen once to leave hundreds dead and others with lifelong injuries. Furthermore, with the volume of flights being carried out 24 hours a day, 365 days a year, for year after year, that small chance eventually becomes an inevitability, unless redundancy is added to the system.
In the case of flight 006, there was human redundancy — three experienced, credentialed pilots on the flight deck and many more people involved in airfield management and operation.
Humans make (many) mistakes of many different types. Beyond reading about the disaster known as flight 006:
Watch “The Monkey Business Illusion” video by Daniel J. Simons on YouTube and count how many times the players wearing white pass the ball. Did you get the right answer? Selective attention can be a killer.
Explore the body of literature on cognitive biases, and the many ways in which people demonstrate failures in logic.
Humans are fallible and AI is fallible but let’s not get glib.
Large Language Models (LLMs) like GPT-4 can’t be trusted and their output must be carefully inspected to avoid serious logical errors that LLMs can generate. People come to trust other people based on direct and indirect experience (observation of behavior) under a range of conditions. Your LLM-based tools, assistants, systems and applications need to be tested over and over again to earn the trust of users and others impacted by the LLM’s results.
Pity the fool that learns to trust an LLM and fails to check whether it has finally failed today.
The consequences of a possible error by an LLM should determine whether or not the user has to test the quality of the content created by the LLM. It should not be left to either chance or trust, at least not at this stage of development of these technologies.
Trust me on this!
It should be easy to create a 'consequences' index - like the Grammerly index of readability level - to IA generated content to assess the impact of failed recommendations. This could be an app that runs with the AI.