Mastodon Skip to content

The Limitations of AI in Formal Reasoning

The image depicts an abstract, stylized human brain divided into two halves. On the left side, the brain is made up of mechan
DALL·E 2024-10-19 - AI in Formal Reasoning
Published:
Recent advancements in Large Language Models (LLMs) show promise, but they struggle with logical reasoning due to token bias, where small input changes can lead to significantly different outputs. This weakness, particularly in critical fields like medicine, law, and public policy, raises ethical concerns about their reliability. The reliance of LLMs on pattern recognition rather than formal reasoning calls for deeper scrutiny and improvements in AI development.

Recent advancements in artificial intelligence (AI), particularly in Large Language Models (LLMs), have demonstrated notable achievements across various domains, including natural language processing and mathematical reasoning. However, questions arise regarding the reliability of these models in performing genuine logical reasoning tasks:

… most LLMs still struggle with logical reasoning. While they may perform well on classic problems, their success largely depends on recognizing superficial patterns with strong token bias, thereby raising concerns about their actual reasoning and generalization abilities.

Jiang et al, 2024.
The code and data to the study by Jian et al are open sourced: A Peek into Token Bias.

'Token bias' in AI refers to the unintended favouring or disfavouring of certain words, phrases, or tokens during the training or operation of the model. Examples of this are 'sex bias' (e.g., assuming a doctor is male but a nurse female) or 'cultural bias' (e.g., assuming Irish people are drunkards).

What this and a recent study by AI researchers at Apple shows is that small changes in input tokens can drastically alter model outputs, indicating a strong 'token bias' and suggesting that these models are highly sensitive and fragile. Additionally, in tasks requiring the correct selection of multiple tokens, the probability of arriving at an accurate answer decreases exponentially with the number of tokens or steps involved.

Having inherent unreliability in complex reasoning is highly problematic given the deployment of AI technologies, particularly LLMs, across various sectors is becoming ubiquitous. With an increasing number of employees leaning on these tools for the creation of written documents, and leaders making decisions based on these documents, it matters whether LLMs have genuine logical reasoning capabilities or are merely relying on probabilistic pattern-matching.

What may seem like a merely epistemological or academic quibble fundamentally matters because there are far-reaching implications as AI systems become increasingly integrated into decision-making processes in critical fields like medicine, law, and education. If AI systems cannot perform robust formal reasoning, their deployment in areas requiring precision may be ethically untenable.

Epistemology, Formal Reasoning, and LLMs

Epistemology, the branch of philosophy concerned with the nature and scope of knowledge, provides a useful framework for understanding the limitations of LLMs in performing formal reasoning. Classical epistemologists like Immanuel Kant (1724–1804) distinguished between a priori knowledge (knowledge independent of experience) and a posteriori knowledge (knowledge dependent on experience). LLMs, by design, operate on a posteriori principles, learning from vast datasets through pattern recognition. While this allows them to perform well on many tasks, it does not equip them to handle a priori reasoning, which is essential for formal logic and mathematics.

By mathematics, I'm not talking about the math most of us do each day. Splitting the bill at a restaurant ("who ordered that bottle of Château Latour?"), reading a balance sheet in a board meeting, or failing trying to calculate what delay to use on the washing machine so it finishes when I arrive home. I'm talking about mathematical logic—set theory, model theory, recursion theory, proof theory, and constructive mathematics.

Recent work by AI researchers at Apple highlights that LLMs struggle with even basic logical reasoning tasks, such as recognising the correct sequence of steps to solve mathematical problems when numerical values or the number of clauses in a question change.

That LLMs are highly sensitive to changes in input tokens suggests that their reasoning process is not grounded in formal logic but rather in probabilistic pattern-matching. This presents an epistemic limitation: that LLMs lack the ability to abstract universal principles from specific examples, a key component of formal reasoning.

From an epistemological standpoint, this reliance on inductive reasoning—pattern recognition from large datasets—contrasts sharply with the deductive reasoning required for complex problem-solving. This distinction raises significant questions about the reliability of LLMs in domains where deductive reasoning is crucial.

Ethical Implications of Deploying AI

The ethical implications of deploying AI systems like LLMs extend beyond technical concerns into the realm of moral philosophy. Kant's deontological ethics (morality of an action), which emphasises duty and the adherence to universal moral laws, offers a lens through which to critique the deployment of LLMs in tasks that require reliable formal reasoning. According to Kant, moral actions are those performed out of duty, and which follow rational principles or morally just motivations, rather than being contingent on specific outcomes.

As demonstrated in the GSM-Symbolic analysis, LLMs exhibit a significant performance drop when confronted with tasks that deviate from their training data patterns. The failure to recognise irrelevant clauses in mathematical problems or the incorrect processing of changed numerical values suggests that LLMs cannot yet meet the rational standard required for high-stakes applications.

… these models attempt to perform a kind of in-distribution pattern-matching, aligning given questions and solution steps with similar ones seen in the training data. As no formal reasoning is involved in this process, it could lead to high variance across different instances of the same question.

Mirzadeh et al, 2024.

Add the utilitarian ethics of thinkers such as John Stuart Mill (1806–1873) into the mix—which focuses on maximising overall happiness—and the deployment of such unreliable AI systems could result in negative outcomes that outweigh the perceived benefits. In practical terms, this could lead to misdiagnoses in healthcare, legal misjudgements, or faulty decision-making in public policy.

Moreover, Aristotle's notion of phronesis (practical wisdom) underscores the importance of context-sensitive judgment, particularly in ethical decision-making. LLMs, which operate based on pre-trained data without the ability to engage in real-time, context-sensitive reasoning, fall short of this. An inability to discern the relevance of certain pieces of information means LLMs lack the practical wisdom necessary to handle complex, real-world scenarios.

Technological Determinism and the Risks of Over-Reliance

Technological determinismThorstein Veblen's (1857–1929) theory that technology shapes society's development in deterministic ways—provides a cautionary perspective on the current trajectory of AI. As society becomes increasingly reliant on AI for decision-making, there is a risk that the limitations of LLMs will be overlooked in favour of efficiency and cost-effectiveness. The ethical question here is whether leaders should allow such technologies to influence critical decisions when their epistemic foundation is fundamentally flawed.

Martin Heidegger's (1889–1976) critique of technology in The Question Concerning Technology highlights the danger of viewing technology as an end in itself, rather than as a tool to serve human purposes. Heidegger warns that when technology becomes the primary means through which we understand the world, we risk reducing complex human experiences and decisions to mere data points. The current trend of using LLMs for reasoning in high-stakes domains exemplifies this risk. By treating LLMs as reliable tools for tasks that require formal reasoning, leaders may inadvertently prioritise technological convenience over ethical responsibility. Put another way, technology is a means to revealing the truth, it is not truth itself.

For LLMs to be ethically and epistemologically viable in tasks requiring formal reasoning, significant advancements in AI research are needed. Current LLMs, as shown by their performance on benchmarks like GSM8K and GSM-Symbolic, operate primarily through inductive pattern recognition. However, true formal reasoning requires deductive logic and the ability to apply universal principles to novel problems.

Philosophers such as Karl Popper (1902–1994) have argued that scientific progress is driven not by the accumulation of data but by the falsification of hypotheses. This approach must inform future developments in AI, where models are trained not only to recognise patterns but to actively test and falsify their reasoning processes. Such advancements would bring AI closer to achieving the level of reasoning required for high-stakes applications, aligning technological capabilities with ethical standards.

As a society we are well advised to take considerable interest in which lobby groups or activists are influencing our leaders. Yet, we fail in our ethical stewardship if we do not take a similar interest in the training, bias, and moral influences on the AI systems that inform the decision making of our leaders.

Good night, and good luck.

Further Reading

Heidegger, M. (2013). The Question Concerning Technology and Other Essays. (W. Lovitt, Trans.), New York: HarperCollins Publishers. (Original work published 1954)

Jiang, B., Xie, Y., Hao, Z., … Roth, D. (2024). A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners.

Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Bengio, S., & Farajtabar, M. (2024). GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. Retrieved October 13, 2024, from http://arxiv.org/abs/2410.05229.

Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are emergent abilities of large language models a mirage?

Shi, F., Chen, X., Misra, K., … Zhou, D. (2023). Large language models can be easily distracted by irrelevant context. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, & J. Scarlett, eds., Proceedings of the 40th international conference on machine learning, Vol. 202, PMLR, , pp. 31210–31227.

More in Philosophy

See all
A Renaissance painting by Titian, *The Three Ages of Man*, depicts the stages of life in a lush pastoral landscape. On the ri

Stacking Experience

/