The European Union Artificial Intelligence Act, formally Regulation (EU) 2024/1689, entered into force on August 1, 2024, with its phased implementation commencing shortly thereafter. It represents a bold attempt to harmonize AI governance across the bloc, prioritizing safety, transparency, and fundamental rights while attempting to foster innovation. Yet, scarcely a year into its rollout, the Act reveals profound limitations in its technology-specific provisions, particularly those governing large language models and general-purpose AI systems. These elements, designed to address risks from high-compute models, now appear misaligned with the explosive market shifts in AI capabilities. In this post, we dissect how provisions like compute-based risk thresholds, transparency mandates, and documentation requirements have been outpaced by advancements in model scale, efficiency, and deployment paradigms, rendering the Act unviable in its current form for regulating the frontier of AI development.
To appreciate the Act's shortcomings, consider its foundational structure. The regulation adopts a risk-based approach, categorizing AI systems into unacceptable-risk, high-risk, limited-risk, and minimal-risk tiers. For general-purpose AI models, which encompass large language models capable of generating text, code, images, or other content, the Act introduces tailored obligations under Chapter V. These models are defined as those trained on vast datasets via self-supervision at scale, exhibiting generality across tasks (Article 3(63)). Providers must maintain technical documentation detailing architecture, training processes, and performance metrics (Article 53(1)(a)), ensure compliance with EU copyright laws during data mining (Article 53(1)(c), referencing Directive (EU) 2019/790), and publish summaries of training data content (Article 53(1)(d)). This framework aims to mitigate harms like bias amplification or misinformation dissemination, especially from generative outputs.
A pivotal technology-specific mechanism is the systemic risk designation for high-impact general-purpose AI models. The Act presumes systemic risk if training compute exceeds 10^25 floating-point operations (FLOPs) (Article 51(2)), a threshold intended to capture models with broad societal influence. Providers of such models face heightened duties: conducting adversarial testing and red-teaming to identify risks (Article 55(1)(a)), reporting serious incidents (Article 55(1)(d)), and cooperating with the AI Office for mitigation (Recital 115). The Commission can adjust this threshold via delegated acts to reflect technological progress (Article 51(3)), acknowledging potential obsolescence. Additional criteria for designation include data quality, user base size, and multimodal capabilities (Annex XIII).
When drafted, 10^25 FLOPs represented the bleeding edge, exemplified by GPT-3's estimated 10^23 FLOPs in 2020. But market realities have surged ahead. By June 2025, over 30 publicly announced models from developers like OpenAI, Google, Meta, and Anthropic exceed this threshold, including GPT-4, Grok-2, Llama 3.1 (405B parameters), and Claude 3.5. Training costs for these models hover in the hundreds of millions, yet proliferation continues at roughly two per month (and this was back in 2024). Compute trends show training FLOPs for frontier models growing 4-5x annually since 2019, far outstripping the Act's static benchmark. Efficiency gains compound this: algorithmic improvements and hardware optimizations mean models achieve comparable performance with fewer FLOPs. For instance, DeepSeek R1 matches 85% of GPT-4o's capabilities using just 30% of the compute, disrupting cost assumptions embedded in the Act.
Consider the technical underpinnings of these efficiency leaps. Mixture-of-Experts (MoE) architectures, as seen in models like Mixtral-8x7B, activate only a subset of parameters per inference query, reducing active compute by up to 75% while maintaining or exceeding dense model performance on benchmarks like MMLU. Sparse models, leveraging techniques such as conditional computation or dynamic pruning, further minimize FLOPs; for example, Google's Sparse Transformer variants cut inference costs by 50% without accuracy loss. Quantization; reducing precision from 32-bit floating-point to 8-bit or even 4-bit integers, has become standard, with tools like GPTQ enabling post-training compression that slashes memory use by 75% and speeds up inference 4x on commodity hardware. These innovations mean that models once presumed systemic under the Act's threshold can now be replicated or surpassed with sub-threshold compute, rendering the metric obsolete.
This mismatch erodes the Act's viability. The threshold, meant to target a select few high-impact models, now encompasses routine deployments, overwhelming regulators. Providers notify the Commission if thresholds are met (Article 52), triggering evaluations for risks like discriminatory content generation (Recital 110). Yet, with models like Gemini 2.5 Pro and o1-mini surpassing GPT-4 scales via augmented inference compute (where reasoning chains during inference add effective FLOPs without inflating training totals), the FLOPs metric fails to capture emergent behaviors or real-world deployment risks. Open-source models exacerbate this: once released, weights are irrevocable, defying the Act's recall provisions (Recital 164). Llama 3.1's variants, trained on 15T tokens at ~10^25 FLOPs, illustrate how fine-tuning democratizes access, bypassing systemic designations. Federated learning, where models train across distributed devices without centralizing data, further evades compute-centric scrutiny, as seen in Apple's on-device personalization achieving 90% of cloud performance with edge FLOPs orders of magnitude lower.
Transparency requirements fare no better against market dynamics. Providers must summarize training data (Article 53(1)(d)), respecting opt-outs under copyright directives. But datasets now span trillions of tokens, often synthetic or multimodal, complicating provenance. Generative models like Sora (video) and Imagen 3 (images) generate content from prompts, not just text, yet the Act's focus on text-and-data mining overlooks these modalities (Recital 105). By end of 2025, multimodal systems dominate: GPT-4o processes text, images, and code at 92% efficiency per compute dollar. The Act mandates marking synthetic content (Article 50(2)), but enforcement lags as tools evolve to evade watermarks. Synthetic data generation, using models like DALL-E 3 to create training corpora, reduces reliance on real-world data, sidestepping transparency altogether, yet the Act lacks provisions for auditing synthetic datasets' quality or bias propagation.
Documentation obligations, requiring architecture details and performance metrics (Annex IX), assume static models. However, agentic AI systems like Auto-GPT that autonomously chain tasks, introduces dynamism unaddressed in the text. Agents, built on base models exceeding thresholds, amplify risks like unintended escalation, yet fall outside explicit provisions. X posts highlight this: "The EU AI Act is already outdated" (Luiza Jarovsky), echoing industry calls for pauses amid delays in codes of practice. The Act's voluntary codes (Article 56) were due by May 2025 but faced postponement, with CEOs from Mercedes-Benz and Airbus urging a two-year "clock-stop" for complexity. Enforcement challenges compound obsolescence. The AI Office, operational from August 2025, oversees systemic models but is reportedly understaffed. Member states designate authorities by then (Article 70(2)), yet reports indicate delays in guidelines and standards. Penalties up to €35 million or 7% of turnover apply (Article 99), but uncertainty persists on GPAI enforcement until 2026.
Delving deeper into technical obsolescence, consider edge AI and distributed computing. The Act's centralized focus ignores how edge deployment (running inference on devices like smartphones with NPUs (neural processing units)) reduces effective compute. Qualcomm's Snapdragon X Elite, with 45 TOPS efficiency, enables on-device models matching cloud performance at 10% power, evading data center thresholds. RISC-V architectures, open-source and customizable, further democratize hardware, with SiFive's chips achieving 50% better efficiency than ARM for AI workloads. These shifts mean systemic risks manifest at the edge, unmonitored by the Act's provisions.
Global divergences highlight unviability. While the EU tightens, the US under Trump dismantles barriers, and China surges with models like DeepSeek R1. The Act's extraterritorial reach (Article 2) burdens non-EU firms, yet X discussions note: "EU's AI Act stumbles out of the gate", with calls for simplification amid competitiveness concerns. Mario Draghi's report deems it "onerous," hampering EU growth. Civil society critiques weakening via US Big Tech lobbying, undermining whistleblower protections and emergency protocols.
For firms I advise, this means proactive adaptation: reassess models against evolving thresholds, invest in dynamic documentation, and lobby for delegated acts. The Act's intent (safeguarding rights while innovating) is laudable, but its rigidity invites circumvention. In a nuthshell, this is indeed a situation, where the EU should consider throwing the baby out with the bath water (if the baby is the EU AI Act, and the bathwater is the obsolete provisions that it has established.