New Research Unveils AI Limitations and Model Complexities
A series of recent publications on ArXiv, all dated May 28, 2026, has brought to light significant and intrinsic challenges in the understanding, evaluation, and safety of artificial intelligence models, ranging from spoken language systems to Vision Transformers (ViTs) and Large Language Models (LLMs). These studies underscore the increasing complexity of a rapidly evolving field and the urgent need for more robust metrics and methodologies.
What happened
Various research groups have published in-depth analyses of current AI limitations and evaluation methodologies. One study highlighted the fallacy of global perplexity as an evaluation metric for generative spoken language models. The research “On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation” argues that directly applying text perplexity to speech ignores fundamental differences between the two modalities, potentially leading to an underestimation of spoken language characteristics. This suggests that current methods might not fully capture the quality and coherence of speech models.
Another paper, “Differential syntactic and semantic encoding in LLMs”, examined how syntactic and semantic information is encoded in the inner layer representations of Large Language Models, with a focus on DeepSeek-V3. The authors found that by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, it's possible to obtain vectors that capture a significant proportion of this information. This study offers crucial insights into how LLMs process language, a fundamental step towards greater interpretability.
Concurrently, the research “On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning” addressed the intrinsic limitations of Vision Transformers (ViTs) in spatial reasoning tasks, such as mental rotation. While ViTs excel in semantic recognition, they exhibit systematic failures in spatial tasks. The study argues that this limitation is not solely due to data scale but arises from the intrinsic circuit complexity of the architecture itself, identifying a fundamental computational bottleneck.
Finally, to address gaps in medical safety for LLMs, JMedEthicBench was introduced, the first multi-turn conversational benchmark for evaluating medical safety in Japanese Large Language Models. The paper “JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models” emphasizes that existing benchmarks are predominantly English-centric and rely on single-turn prompts, which are inadequate for multi-turn clinical consultations. JMedEthicBench is based on 67 guidelines from the Japan Medical Association, offering an essential tool for safe AI implementation in healthcare.
Why it matters
Collectively, these studies highlight a fundamental truth: the advancement of AI requires a deeper understanding of its internal mechanisms and intrinsic limitations. The fallacy of global perplexity in spoken language means we might have overestimated the capabilities of speech models, with implications for applications like virtual assistants or text-to-speech systems. Understanding how LLMs encode syntax and semantics is crucial for building more robust models, less prone to bias, and easier to interpret—fundamental aspects of responsible AI. The limitations of Vision Transformers in spatial reasoning remind us that AI is not a universal solution, and that the architecture itself can impose insurmountable boundaries, requiring hybrid or entirely new approaches for certain tasks.
The introduction of JMedEthicBench is particularly significant. As AI expands into critical sectors like healthcare, the need for language- and culture-specific safety benchmarks becomes urgent. Medical safety cannot be assessed with generic metrics or only in English; it requires meticulous attention to cultural nuances and the complex interactions that characterize clinical dialogue. This is an essential step to ensure that AI, especially in healthcare contexts, is not only performant but also ethical AI and safe for all users.
The HDAI perspective
These research findings reinforce Human Driven AI's belief that technological progress must be accompanied by critical analysis and a deep understanding of its implications. It's not just about improving performance, but about ensuring AI is safe, fair, and transparent. The discovery of intrinsic limitations and the need for more sophisticated evaluation metrics underscore the importance of a human-centric approach to AI governance. The creation of benchmarks like JMedEthicBench, which account for linguistic and cultural specificities, is a prime example of how ethical AI must be integrated into the design and evaluation phase, not just as an afterthought. This approach, which places humans at the center of technological development, is the core of the Human Driven AI vision and the themes we will address at the HDAI Summit 2026 in Pompeii.
What to watch
It will be crucial to observe how the research community responds to these findings, developing new metrics and architectures that overcome current limitations. Integrating these insights into regulatory frameworks, such as the EU AI Act, will be fundamental to guiding AI development and deployment that is truly responsible and safe for society. The continued emphasis on interpretability and safety, especially in high-risk sectors, will define the future of Italian AI innovation and global AI innovation.

