How to Reduce Non-Determinism and Hallucinations in Large Language Models (LLMs)

Gonzalo Wangüemert Villalba • 13 October 2025

In recent months, two separate pieces of research have shed light on two of the most pressing issues in large language models (LLMs): their non-deterministic nature and their tendency to hallucinate. Both phenomena have a direct impact on the reliability, reproducibility, and practical usefulness of these technologies.

On the one hand, Thinking Machines, led by former OpenAI CTO Mira Murati, has published a paper proposing ways to make LLMs return the exact same answer to the exact same prompt every time, effectively defeating non-determinism. On the other hand, OpenAI has released research identifying the root cause of hallucinations and suggesting how they could be significantly reduced.

Let’s break down both findings and why they matter for the future of AI.

The problem of non-determinism in LLMs

Anyone who has used ChatGPT, Claude, or Gemini will have noticed that when you type in the exact same question multiple times, you don’t always get the same response. This is what’s known as non-determinism: the same input does not consistently lead to the same output.

In some areas, such as creative writing, this variability can actually be a feature; it helps generate fresh ideas. But in domains where consistency, auditability, and reproducibility are critical — such as healthcare, education, or scientific research — it becomes a serious limitation.

Why does non-determinism happen?

The most common explanation so far has been a mix of two technical issues:

Floating-point numbers: computer systems round decimal numbers, which can introduce tiny variations.
Concurrent execution on GPUs: calculations are performed in parallel, and the order in which they finish can vary, changing the result.

However, Thinking Machines argues that this doesn’t tell the whole story. According to their research, the real culprit is batch size.

When a model processes multiple prompts at once, it groups them into batches (or “carpools”). If the system is busy, the batch is large; if it’s quiet, the batch is small. These variations in batch size subtly change the order of operations inside the model, which can ultimately influence which word is predicted next. In other words, tiny shifts in the order of addition can completely alter the final response.

Thinking Machines’ solution

The key, they suggest, is to keep internal processes consistent regardless of batch size. Their paper outlines three core fixes:

Batch-invariant kernels: ensure operations are processed in the same order, even at the cost of some speed.
Consistent mixing: use one stable method of combining operations, independent of workload.
Ordered attention: slice input text uniformly so the attention mechanism processes sequences in the same order each time.

The results are striking: in an experiment with the Qwen 235B model, applying these methods produced 1,000 identical completions to the same prompt, rather than dozens of unique variations.

This matters because determinism makes it possible to audit, debug, and above all, trust model outputs. It also enables stable benchmarks and easier verification, paving the way for reliable applications in mission-critical fields.

The problem of hallucinations in LLMs

The second major limitation of today’s LLMs is hallucination: confidently producing false or misleading answers. For example, inventing a historical date or attributing a theory to the wrong scientist.

Why do models hallucinate?

According to OpenAI’s paper, hallucinations aren’t simply bugs; they are baked into the way we train LLMs. There are two key phases where this happens:

Pre-training: even with a flawless dataset (which is impossible), the objective of predicting the next word naturally produces errors. Generating the right answer is harder than checking whether an answer is right.
Post-training (reinforcement learning): models are fine-tuned to be more “helpful” and “decisive”. But current metrics reward correct answers while penalising both mistakes and admissions of ignorance. The result? Models learn that it’s better to bluff with a confident but wrong answer than to say “I don’t know”.

This is much like a student taking a multiple-choice exam: leaving a question blank guarantees zero, while guessing gives at least a chance of scoring. LLMs are currently trained with the same incentive structure.

OpenAI’s solution: behavioural calibration

The proposed solution is surprisingly simple yet powerful: teach models when not to answer. Instead of forcing a response to every question, set a confidence threshold.

If the model is, for instance, more than 75% confident, it answers.
If not, it responds: “I don’t know.”

This technique is known as behavioural calibration. It aligns the model’s stated confidence with its actual accuracy.

Crucially, this requires rethinking benchmarks. Today’s most popular evaluations only score right and wrong answers. OpenAI suggests a three-tier scoring system:

+1 for a correct answer
0 for “I don’t know”
–1 for an incorrect answer

This way, honesty is rewarded and overconfident hallucinations are discouraged.

Signs of progress

Some early users report that GPT-5 already shows signs of this approach: instead of fabricating answers, it sometimes replies, “I don’t know, and I can’t reliably find out.” Even Elon Musk praised this behaviour as an impressive step forward.

The change may seem small, but it has profound implications: a model that admits uncertainty is far more trustworthy than one that invents details.

Two sides of the same coin: reliability and trust

What makes these two breakthroughs especially interesting is how complementary they are:

Thinking Machines is tackling non-determinism, making outputs consistent and reproducible.
OpenAI is addressing hallucinations, making outputs more honest and trustworthy.

Together, they target the biggest barrier to wider LLM adoption: confidence. If users — whether researchers, doctors, teachers, or policymakers — can trust that an LLM will both give reproducible answers and know when to admit ignorance, the technology can be deployed with far greater safety.

Conclusion

Large language models have transformed how we work, research, and communicate. But for them to move beyond experimentation and novelty, they need more than just raw power or creativity: they need trustworthiness.

Thinking Machines has shown that non-determinism is not inevitable; with the right adjustments, models can behave consistently. OpenAI has demonstrated that hallucinations are not just random flaws but the direct result of how we train and evaluate models, and that they can be mitigated with behavioural calibration.

Taken together, these advances point towards a future of AI that is more transparent, reproducible, and reliable. If implemented at scale, they could usher in a new era where LLMs become dependable partners in science, education, law, and beyond.

< Older Post Newer Post >

Vibe Coding: The New AI-Driven Era of Software Development

by Gonzalo Wangüemert Villalba • 25 November 2025

In early 2025, the term vibe coding began to circulate widely across the technology community. Coined by AI researcher Andrej Karpathy, it refers to a radically different way of building software. Instead of writing code line by line, the developer simply describes what they want to achieve in natural language, and an artificial intelligence system translates that description into executable code. This article explores what vibe coding is, how it works, its main advantages and risks, and how it fits within the broader movement of AI-driven software development. It also examines the social and ethical dimensions of this emerging paradigm and what the future might look like if the “vibe” becomes mainstream. What is Vibe Coding? Vibe coding is a form of AI-assisted programming in which a developer describes a problem or a desired feature using natural language. A large language model (LLM), such as GPT or Claude, then generates the corresponding source code that implements it. Rather than acting as a mere autocomplete tool, the AI effectively becomes a creative collaborator capable of producing entire systems or applications from conceptual prompts. The term was first introduced by Andrej Karpathy, former AI director at Tesla and a leading figure in the OpenAI ecosystem. In one of his social media posts, he summarised the concept with the now-famous phrase: “fully give in to the vibes, embrace exponentials, and forget that the code even exists.” He associated vibe coding with a freer, more experimental and iterative form of development. By mid-2025, Merriam-Webster had even listed “vibe coding” as an emerging slang term within technology. It is important to distinguish vibe coding from traditional AI-assisted programming. Using an AI tool to generate snippets or suggest completions is not quite the same thing. What defines vibe coding is a change in mindset. Instead of controlling every detail of the code, the developer focuses on intention, results, and iterative feedback. Simon Willison, a well-known software engineer, has noted that if you still read and understand every line the AI produces, you are not truly vibe coding — you are simply using a language model as an assistant. How Vibe Coding Works Although the idea sounds straightforward, the practice of vibe coding involves a dynamic interplay between human creativity and machine intelligence. It typically begins with a prompt: the developer describes what they want, for example, “create an interactive dashboard using data from environmental sensors.” The AI produces the initial code, and the developer then refines it through follow-up instructions such as “make the colours change with temperature” or “add a live refresh feature.” This loop of experimentation and adjustment lies at the heart of vibe coding. Developers primarily evaluate code through execution rather than inspection. They run the programme, see whether it behaves as expected, and request corrections when errors arise. Manual debugging still plays a role, but the relationship with code becomes more conversational than mechanical. Over time, trust in the AI fluctuates. Developers learn which tasks can be safely delegated and when to intervene directly. Researchers have described this as a process of “calibrating trust,” in which the human defines how much to rely on the system at each stage of development. The Benefits of Vibe Coding One of the greatest strengths of vibe coding is its speed. Ideas can be transformed into functional prototypes in a fraction of the time it would take traditional coding. This speed makes it particularly useful for startups, research teams and creative professionals who need to explore multiple directions quickly. Another significant benefit is accessibility. People with limited technical training can now create simple applications or automate workflows without learning programming languages in depth. This democratisation of software creation could empower a new generation of makers and entrepreneurs. By delegating repetitive or boilerplate tasks to the AI, developers can focus on strategic design and high-level logic. The workflow also encourages a sense of creative flow: instead of getting lost in syntax, the human partner can concentrate on goals, functionality and user experience. Many practitioners describe vibe coding as liberating, turning software creation into an expressive process similar to design or storytelling. Risks and Limitations Despite its promise, vibe coding carries significant risks. The most obvious one is the loss of understanding. Accepting generated code without reviewing it can lead to serious issues when something goes wrong. Bugs, security vulnerabilities or unexpected behaviours may remain unnoticed until they cause damage. As Andrew Ng has pointed out, vibe coding can sound effortless, but in reality, it remains cognitively demanding and far from trivial. Quality and maintainability are also major concerns. Code produced by AI models may be inefficient, inconsistent or difficult to update, especially in large-scale projects. Furthermore, compliance and data protection become complex when generated code integrates external libraries or APIs without explicit human oversight. In 2025, a case involving the platform Base44 revealed security flaws in applications created through automated AI workflows, highlighting the importance of robust verification processes. Culturally, some developers fear that vibe coding could erode traditional craftsmanship in software engineering. The discipline and rigour associated with manual coding is being replaced by superficial experimentation. Others have coined the term “vibe coding hell” to describe an over-reliance on AI, where developers use it for everything, including trivial tasks, eventually losing confidence in their own technical skills. Vibe Coding and Artificial Intelligence Vibe coding represents a natural evolution of generative AI. It is not just a new technique but a redefinition of the relationship between humans and machines. Instead of translating ideas into syntax, developers now express intentions through prompts, while the AI interprets and executes them. Researchers have called this shift a “mediation of intent,” where the act of programming becomes probabilistic and collaborative. In this new model, cognitive work is redistributed. The human becomes a designer of prompts, a tester and a strategist, while the AI handles most of the implementation. Some scholars describe the process as “material disengagement” — the developer orchestrates code indirectly, maintaining creative control without manual manipulation. Empirical studies show that vibe coders often experience high levels of creative flow and satisfaction when working with AI systems, even though they also face challenges such as latency, debugging uncertainty and fluctuating trust. Early adoption in technology firms like Notion and several AI startups suggests that vibe coding may soon become a standard practice for internal prototyping and innovation. Responsible Use and Best Practices To benefit from vibe coding without falling into its traps, developers should adopt specific best practices. Automated testing, static analysis and version control are essential to ensure reliability, even when the code is not fully read. Prompts should be written with transparency and clear objectives to avoid ambiguous or insecure results. Human oversight must remain a core principle. Developers need to decide when to trust the AI and when to intervene manually, particularly in systems that handle sensitive data or critical operations. Maintaining detailed records of prompts and outputs can improve reproducibility and accountability. Security audits and compliance checks are equally vital. AI-generated software must respect privacy standards and industry regulations. A hybrid approach, using vibe coding for rapid experimentation and conventional programming for critical components, seems to offer the best balance. Above all, developers should continue strengthening their ability to understand and review code, since comprehension remains the ultimate safeguard against failure. The Future of Vibe Coding Vibe coding marks a genuine paradigm shift in how software is created. It is not just about faster coding but about redefining the human role in development, from coder to orchestrator, from writer to conductor of intelligent systems. Academic research increasingly treats it as a socio-technical phenomenon that blends trust, creativity and delegation between humans and machines. Yet, vibe coding is not a magical solution. Without testing, documentation and ethical oversight, projects built on “vibes” can easily become unreliable or even dangerous. The next few years will likely bring more sophisticated tools, conversational interfaces, and automated audits tailored to AI-generated code. We may also see the emergence of new professional standards focused on safety, transparency and accountability in AI-assisted development. If used responsibly, vibe coding could democratise software creation, accelerate innovation and make technology more accessible than ever before. But like all powerful tools, it demands critical thinking, human supervision and a commitment to quality. The true promise of vibe coding lies not in abandoning code, but in transforming the act of coding into a more intuitive, creative and collaborative process.

XAl's Grok 2.5 vs OpenAl's GPT-OSS-20B & GPT-OSS-120B:
A Comparative Analysis

xAI’s Grok 2.5 vs OpenAI’s GPT-OSS-20B & GPT-OSS-120B: A Comparative Analysis

by Gonzalo Wangüemert Villalba • 4 September 2025

Introduction The open-source AI ecosystem reached a turning point in August 2025 when Elon Musk’s company xAI released Grok 2.5 and, almost simultaneously, OpenAI launched two new models under the names GPT-OSS-20B and GPT-OSS-120B. While both announcements signalled a commitment to transparency and broader accessibility, the details of these releases highlight strikingly different approaches to what open AI should mean. This article explores the architecture, accessibility, performance benchmarks, regulatory compliance and wider industry impact of these three models. The aim is to clarify whether xAI’s Grok or OpenAI’s GPT-OSS family currently offers more value for developers, businesses and regulators in Europe and beyond. What Was Released Grok 2.5, described by xAI as a 270 billion parameter model, was made available through the release of its weights and tokenizer. These files amount to roughly half a terabyte and were published on Hugging Face. Yet the release lacks critical elements such as training code, detailed architectural notes or dataset documentation. Most importantly, Grok 2.5 comes with a bespoke licence drafted by xAI that has not yet been clearly scrutinised by legal or open-source communities. Analysts have noted that its terms could be revocable or carry restrictions that prevent the model from being considered genuinely open source. Elon Musk promised on social media that Grok 3 would be published in the same manner within six months, suggesting this is just the beginning of a broader strategy by xAI to join the open-source race. By contrast, OpenAI unveiled GPT-OSS-20B and GPT-OSS-120B on 5 August 2025 with a far more comprehensive package. The models were released under the widely recognised Apache 2.0 licence, which is permissive, business-friendly and in line with requirements of the European Union’s AI Act. OpenAI did not only share the weights but also architectural details, training methodology, evaluation benchmarks, code samples and usage guidelines. This represents one of the most transparent releases ever made by the company, which historically faced criticism for keeping its frontier models proprietary. Architectural Approach The architectural differences between these models reveal much about their intended use. Grok 2.5 is a dense transformer with all 270 billion parameters engaged in computation. Without detailed documentation, it is unclear how efficiently it handles scaling or what kinds of attention mechanisms are employed. Meanwhile, GPT-OSS-20B and GPT-OSS-120B make use of a Mixture-of-Experts design. In practice this means that although the models contain 21 and 117 billion parameters respectively, only a small subset of those parameters are activated for each token. GPT-OSS-20B activates 3.6 billion and GPT-OSS-120B activates just over 5 billion. This architecture leads to far greater efficiency, allowing the smaller of the two to run comfortably on devices with only 16 gigabytes of memory, including Snapdragon laptops and consumer-grade graphics cards. The larger model requires 80 gigabytes of GPU memory, placing it in the range of high-end professional hardware, yet still far more efficient than a dense model of similar size. This is a deliberate choice by OpenAI to ensure that open-weight models are not only theoretically available but practically usable. Documentation and Transparency The difference in documentation further separates the two releases. OpenAI’s GPT-OSS models include explanations of their sparse attention layers, grouped multi-query attention, and support for extended context lengths up to 128,000 tokens. These details allow independent researchers to understand, test and even modify the architecture. By contrast, Grok 2.5 offers little more than its weight files and tokenizer, making it effectively a black box. From a developer’s perspective this is crucial: having access to weights without knowing how the system was trained or structured limits reproducibility and hinders adaptation. Transparency also affects regulatory compliance and community trust, making OpenAI’s approach significantly more robust. Performance and Benchmarks Benchmark performance is another area where GPT-OSS models shine. According to OpenAI’s technical documentation and independent testing, GPT-OSS-120B rivals or exceeds the reasoning ability of the company’s o4-mini model, while GPT-OSS-20B achieves parity with the o3-mini. On benchmarks such as MMLU, Codeforces, HealthBench and the AIME mathematics tests from 2024 and 2025, the models perform strongly, especially considering their efficient architecture. GPT-OSS-20B in particular impressed researchers by outperforming much larger competitors such as Qwen3-32B on certain coding and reasoning tasks, despite using less energy and memory. Academic studies published on arXiv in August 2025 highlighted that the model achieved nearly 32 per cent higher throughput and more than 25 per cent lower energy consumption per 1,000 tokens than rival models. Interestingly, one paper noted that GPT-OSS-20B outperformed its larger sibling GPT-OSS-120B on some human evaluation benchmarks, suggesting that sparse scaling does not always correlate linearly with capability. In terms of safety and robustness, the GPT-OSS models again appear carefully designed. They perform comparably to o4-mini on jailbreak resistance and bias testing, though they display higher hallucination rates in simple factual question-answering tasks. This transparency allows researchers to target weaknesses directly, which is part of the value of an open-weight release. Grok 2.5, however, lacks publicly available benchmarks altogether. Without independent testing, its actual capabilities remain uncertain, leaving the community with only Musk’s promotional statements to go by. Regulatory Compliance Regulatory compliance is a particularly important issue for organisations in Europe under the EU AI Act. The legislation requires general-purpose AI models to be released under genuinely open licences, accompanied by detailed technical documentation, information on training and testing datasets, and usage reporting. For models that exceed systemic risk thresholds, such as those trained with more than 10²⁵ floating point operations, further obligations apply, including risk assessment and registration. Grok 2.5, by virtue of its vague licence and lack of documentation, appears non-compliant on several counts. Unless xAI publishes more details or adapts its licensing, European businesses may find it difficult or legally risky to adopt Grok in their workflows. GPT-OSS-20B and 120B, by contrast, seem carefully aligned with the requirements of the AI Act. Their Apache 2.0 licence is recognised under the Act, their documentation meets transparency demands, and OpenAI has signalled a commitment to provide usage reporting. From a regulatory standpoint, OpenAI’s releases are safer bets for integration within the UK and EU. Community Reception The reception from the AI community reflects these differences. Developers welcomed OpenAI’s move as a long-awaited recognition of the open-source movement, especially after years of criticism that the company had become overly protective of its models. Some users, however, expressed frustration with the mixture-of-experts design, reporting that it can lead to repetitive tool-calling behaviours and less engaging conversational output. Yet most acknowledged that for tasks requiring structured reasoning, coding or mathematical precision, the GPT-OSS family performs exceptionally well. Grok 2.5’s release was greeted with more scepticism. While some praised Musk for at least releasing weights, others argued that without a proper licence or documentation it was little more than a symbolic gesture designed to signal openness while avoiding true transparency. Strategic Implications The strategic motivations behind these releases are also worth considering. For xAI, releasing Grok 2.5 may be less about immediate usability and more about positioning in the competitive AI landscape, particularly against Chinese developers and American rivals. For OpenAI, the move appears to be a balancing act: maintaining leadership in proprietary frontier models like GPT-5 while offering credible open-weight alternatives that address regulatory scrutiny and community pressure. This dual strategy could prove effective, enabling the company to dominate both commercial and open-source markets. Conclusion Ultimately, the comparison between Grok 2.5 and GPT-OSS-20B and 120B is not merely technical but philosophical. xAI’s release demonstrates a willingness to participate in the open-source movement but stops short of true openness. OpenAI, on the other hand, has set a new standard for what open-weight releases should look like in 2025: efficient architectures, extensive documentation, clear licensing, strong benchmark performance and regulatory compliance. For European businesses and policymakers evaluating open-source AI options, GPT-OSS currently represents the more practical, compliant and capable choice. In conclusion, while both xAI and OpenAI contributed to the momentum of open-source AI in August 2025, the details reveal that not all openness is created equal. Grok 2.5 stands as an important symbolic release, but OpenAI’s GPT-OSS family sets the benchmark for practical usability, compliance with the EU AI Act, and genuine transparency.

LiveKit Explained: The Open-Source Framework for Real-Time Audio and Video at Scale

by Gonzalo Wangüemert Villalba • 18 August 2025

Real-time communication powers the modern digital world, from virtual healthcare and remote education to online gaming and live broadcasting. Today’s users expect seamless, low-latency, high-quality media experiences across devices. For developers building these applications, selecting the right communication infrastructure is critical. That’s where LiveKit comes in. LiveKit is an open-source platform that provides the building blocks for real-time audio and video apps . With support for WebRTC , AI integrations , and flexible deployment options (cloud or self-hosted), LiveKit helps developers build scalable, production-grade communication tools that meet the highest standards for performance and privacy. What Is LiveKit? LiveKit is a real-time communication framework designed to simplify the creation of interactive audio and video applications. Built on WebRTC, LiveKit provides: A robust media server Client SDKs for multiple platforms End-to-end media encryption AI integration support Cloud and self-hosted deployment models Unlike many closed, vendor-locked platforms, LiveKit’s open-source model gives developers complete control over infrastructure and customisation, allowing for tailored experiences across industries. How Does LiveKit Work? At the core of LiveKit is an SFU (Selective Forwarding Unit), which efficiently routes video and audio streams without decoding or re-encoding them. This keeps latency low and CPU usage minimal, especially during group calls or interactive broadcasts. LiveKit’s architecture includes: Client SDKs : Available for JavaScript, iOS (Swift), Android (Kotlin), Flutter, Unity, React Native, Node.js, Go, and Rust. Room and Participant Management : APIs to create, manage, and moderate rooms and users. Track-Level Controls : Fine-grained control to mute, pause, or adjust specific streams dynamically. Webhooks and Events : For integration with third-party tools like analytics, moderation, or AI services. Recording & Streaming : Capture sessions and broadcast them to platforms like YouTube, Twitch, or private servers. Deployment Options: Cloud or Self-Hosted LiveKit offers two deployment options to suit different levels of control and scalability: 1. Self-Hosted Deploy on your own infrastructure using Kubernetes, Docker, or custom configurations. This is ideal for teams with: Strict compliance requirements (HIPAA, GDPR, etc.) Specific scaling strategies Custom AI workflows Full control over data handling 2. LiveKit Cloud A fully managed service hosted by LiveKit. It offers: Instant setup Automatic scaling Usage-based pricing A generous free tier Perfect for startups, MVPs, and teams that want fast time-to-market without managing servers. Scalability and Performance LiveKit was designed for high performance under demanding workloads: End-to-End Encryption (E2EE) : Ensures complete privacy, even LiveKit itself can't access media. Global Scalability : Redis + Kubernetes allow for horizontal scaling across regions. Bandwidth Optimisation : LiveKit dynamically adjusts media quality based on device and network conditions. Adaptive Layers : For video resolution and bitrate to match various screen sizes and connectivity. Stability Under Load : Real-time systems need to be reliable. LiveKit supports thousands of concurrent users with minimal latency and consistent quality. AI-Powered Real-Time Applications One of LiveKit’s unique strengths is its support for real-time AI use cases. Developers can route media streams directly into AI engines for processing and feedback. Popular AI integrations include: Live transcription using Whisper, Google Cloud Speech, or Deep gram Real-time translation and captioning for multilingual audiences Voice-based commands and virtual assistants AI tutors and agents in educational or enterprise platforms Emotion detection and intent classification in support calls Call summarisation and analytics for sales or support. These features allow you to create smarter, more responsive real-time applications. Use Cases Across Industries LiveKit is versatile and can be used in nearly any vertical. Common use cases include: Education : Virtual classrooms, language learning apps, breakout rooms Healthcare : Telehealth consultations, remote diagnostics, patient engagement tools Gaming : In-game team chat, spatial audio, live streams Enterprise : Internal communication tools, secure video conferencing Media & Events : Live webinars, conferences, hybrid events Customer Support : AI-assisted video support desks with chat and analytics Because LiveKit is fully customizable and open-source, developers can adapt it to match specific workflows, branding, or compliance frameworks. Why Developers Choose LiveKit Compared to other real-time video platforms, LiveKit offers: Open Source Freedom : No lock-in, transparent development Extensive SDK Support : Develop once, deploy anywhere Customisable Infrastructure : Tailor the experience end-to-end Self-Hosting Options : For privacy and control Ready for AI : Integrate voice, vision, and NLP models natively Active Developer Community : Ongoing improvements, documentation, and support Whether you’re building an internal collaboration suite or a consumer-facing live streaming platform, LiveKit gives you the tools and flexibility needed to deliver a standout experience. Pricing and Plans LiveKit offers a tiered pricing model to support projects at every stage: Free Tier : Up to 100 participants, 5,000 connection minutes, 50 GB bandwidth Ship Plan ($50/month) : For growing apps with 150,000 minutes and more bandwidth Scale Plan ($500/month) : Suitable for production apps and large user bases Enterprise Plan : Custom support, SLAs, dedicated infrastructure You can also self-host LiveKit for free, which is ideal for teams that prefer full control. Frequently Asked Questions (FAQ) Is LiveKit really free? Yes. The core LiveKit project is open-source and can be self-hosted at no cost. For managed services, there's a free tier to get started. What makes LiveKit different from Twilio or Agora? LiveKit offers full infrastructure control through open-source code, making it ideal for teams that want to customise everything. Twilio and Agora are fully managed but offer less flexibility. Can LiveKit be used with AI models? Absolutely. LiveKit is designed to route audio and video directly into AI services—whether for transcription, translation, summarisation, or emotion analysis. Is it secure for healthcare or financial apps? Yes. LiveKit supports end-to-end encryption, fine-grained access control, and can be deployed in compliance-friendly environments. How many users can LiveKit handle? Thousands. Its horizontally scalable architecture allows you to grow capacity as your audience expands. Conclusion: The Future of Real-Time Communication In a world increasingly driven by real-time interaction and AI-enhanced communication, LiveKit is positioned as a top-tier solution for developers. Its open-source nature, scalable infrastructure, and flexibility make it ideal for any team that wants to build custom, high-performance communication applications. Whether you're launching a telehealth app, a virtual classroom, or a next-gen support platform, LiveKit gives you everything you need, from WebRTC handling to AI-ready pipelines, in one powerful framework.

Runner H and Surfer H: The Autonomous Web Agent Revolution is Here

by Gonzalo Wangüemert Villalba • 14 July 2025

Introduction Runner H is a groundbreaking product recently launched by H Company, designed to redefine how artificial intelligence interacts with the internet. More than just a software tool, Runner H is a complete autonomous web browsing agent. Paired with the open-source Surfer H framework and the Hollow One family of vision-language models, it is setting a new standard for intelligent automation. This article explores the mechanics behind this innovation, its real-world applications, and the reasons it is being described as a breakthrough in browser-based automation. The project is particularly notable for its open-source approach and the high level of transparency in its technical documentation and research methodology. What is Runner H and How Does it work Runner H functions as a digital assistant that receives a task in natural language and executes it independently. Users can request operations such as searching for active listings on eBay, compiling information into a Google Sheet, or navigating complex websites. The agent follows instructions by visually interpreting websites just as a human would, performing clicks, scrolling, typing, and even validating the results of its actions. It can work across several tasks simultaneously, which opens the door to impressive levels of productivity in both consumer and enterprise use cases. Surfer H: The Open Source Framework At the heart of this system is Surfer H, a framework built to empower agents to operate with human-like behaviour in web environments. It does not rely on APIs, structured data access, or pre-defined site integrations. Instead, the system uses screenshots to understand the interface and generate a plan of action. The architecture is composed of three essential components. The Policy module determines what actions should be taken to complete the task. The Localiser identifies where on the screen to interact by predicting coordinates based on the content of screenshots. Finally, the Validator reviews the outcome of each action and confirms whether the task was completed correctly. If not, it feeds back that information and the agent iterates accordingly, refining its strategy until the goal is met or the cost or time budget is reached. Hollow One Models: Small, Fast and Open The models behind this system belong to the Hollow One family. These are compact, efficient vision-language models specifically designed for interpreting web interfaces. Available openly on Hugging Face, they are trained to understand UI images and respond to tasks such as “Book a hotel in Paris for three nights in August.” The model generates a navigation sequence, identifies the relevant parts of the screen, and proceeds as if it were a user. This approach works regardless of the underlying code or document structure of the website. It means the agent can navigate any site as long as it can see it. Developers can even download, fine-tune, and redeploy these models for custom use cases, ensuring full flexibility and long-term viability. Research Highlights: How Surfer H Works The research paper released by H Company provides a detailed explanation of the system. It presents Surfer H as a visual web retrieval agent trained through reinforcement learning techniques. The agent does not use the document object model or accessibility trees but instead relies entirely on visual input and learned behaviour. The three-part system is tightly integrated. The Policy module suggests what to do. The Localiser figures out where to do it. The Validator confirms if it was done correctly. If the result is unsatisfactory, the system updates its memory and revises its actions. This loop continues until the task is completed or a cost or time limit is reached. What makes this setup so powerful is that it mirrors human decision-making without depending on brittle or hard-coded logic. Benchmark Results and Performance In performance tests, particularly the Web Voyager benchmark, Runner H, powered by Hollow One, achieved a state-of-the-art success rate of 92.2 per cent. It also demonstrated exceptional cost efficiency. For example, while using the 7 billion parameter version of Hollow One, each task cost only 13 cents. In contrast, using Surfer H with GPT-4 yielded a lower success rate at a significantly higher cost per task. These results highlight the strategic balance between accuracy and computational cost that Hollow One achieves. Charts in the official documentation demonstrate that Hollow One models consistently outperform their peers in both precision and economic value, making them an ideal choice for real-world deployment. Why This Matters: Real-World Use Cases One of the most powerful demonstrations of Runner H is its ability to handle tasks in parallel. Multiple agents can be launched simultaneously, each pursuing a different task. This level of scalability makes it suitable for enterprise automation scenarios, such as data extraction, lead generation, content monitoring, or customer service integration. For small teams and startups, the open-source nature of the tool allows for experimentation and implementation without major upfront costs, levelling the playing field for innovation. Integrations, Autonomy, and Control Additionally, Runner H includes native support for popular platforms and services. It can integrate with Google Docs, Google Sheets, Notion, Slack, and Zapier. Users can upload files as context, connect their accounts for seamless access, and even set up autonomous workflows that involve document generation or API interaction. Soon, payment capabilities will also be introduced, allowing agents to perform online purchases or transactions on behalf of users. Another feature worth noting is the customisation of automation depth, letting users choose between high human involvement or full autonomy depending on the use case and level of trust in the agent’s ability. Tester H: Automated Web QA is Coming H Company has also introduced Tester H, a related product currently in private beta. It enables fully automated quality assurance testing for websites and applications. By writing natural language test scenarios like “Open Airbnb and confirm the image of the orange bed appears,” users can automate UI testing without scripting. This complements the broader ecosystem being built around intelligent browser agents. Once Tester H is fully released, it could revolutionise the way software development teams validate their releases and deploy code safely. A General and Scalable Solution What truly differentiates Runner H and Surfer H from other automation platforms is their generality. Most browser automation tools depend on specific integrations or developer-maintained scripts. Runner H requires none of that. It is capable of understanding, reasoning, and interacting with any interface it can see. This makes it ideal for environments where APIs do not exist or where flexibility and adaptability are paramount. Because the system is model-agnostic, it can be paired with various large language models and vision systems, giving organisations the ability to choose the right balance of speed, price, and performance. Conclusion In summary, H Company’s release of Runner H and Surfer H is a major milestone in the evolution of web automation. By combining advanced vision-language models with an intuitive and robust framework, they have built an open system that performs human-level web tasks at scale, with minimal setup and outstanding accuracy. Whether used in research, industry, or everyday productivity, Runner H represents the next phase of digital agents. It is autonomous, intelligent, efficient, and fully open to the world. This is not only a technical achievement but a signal that the future of AI-driven human-computer interaction is more accessible and flexible than ever before.

Alpha Evolve by Google DeepMind: The Future of Autonomous Code Optimisation and AI Innovation

by Gonzalo Wangüemert Villalba • 17 June 2025

Introduction: A New Era in Artificial Intelligence In May 2025, Google DeepMind introduced a ground-breaking innovation poised to transform how artificial intelligence is developed, optimised, and deployed. Known as Alpha Evolve, this evolutionary AI system represents a significant shift from passive machine learning models to autonomous, agentic AI, capable of iteratively improving code without human intervention. Alpha Evolve doesn't merely assist with programming—it generates new code, tests variants, evaluates performance, and refines results through a self-directed loop. Built upon the powerful Gemini multimodal model, Alpha Evolve exemplifies the next frontier in AI development: systems that actively evolve and improve their algorithms to achieve greater efficiency and effectiveness. What Is Alpha Evolve? Alpha Evolve is an agentic AI system designed to discover, test, and optimise algorithms in a fully autonomous manner. Unlike traditional AI models that passively await user prompts or require human-supervised training, Alpha Evolve takes initiative. It explores vast solution spaces, retains lessons from previous iterations, and continually refines its output in pursuit of performance goals. At its core, Alpha Evolve performs the following: Generates code variants via intelligent prompt sampling. Evaluates performance based on predefined metrics like speed, efficiency, and energy use. Retains and learns from each attempt using a dynamic program memory. Iterates autonomously , improving algorithms over time with minimal human oversight. This feedback loop allows the AI to tackle complex computational challenges, optimise system performance, and even make scientific discoveries, all without relying on step-by-step human guidance. How Does Alpha Evolve Work? Alpha Evolve follows a cyclical agentic architecture, similar to an evolutionary algorithm. Here's how it operates: Initial Input: The system is given a coding task or optimisation goal (e.g., reduce energy use in a TPU circuit). Code Generation: It creates numerous candidate solutions by mutating existing code or generating new implementations from scratch. Evaluation: Each version is tested through simulations or real-world benchmarks, receiving a performance score. Selection and Retention: The top-performing code is retained and used as a baseline for the next generation. Iteration: This process repeats, refining results with each cycle. Importantly, Alpha Evolve maintains a programme database of prior attempts, which helps prevent redundancy and accelerates convergence on optimal solutions. Real-World Results at Google Alpha Evolve is already making a tangible impact across Google’s infrastructure and AI research pipeline. Here are four significant achievements to date: 1. Optimised Job Scheduling in Google Data Centres By applying Alpha Evolve to the Borg scheduler—Google's job allocation system—engineers recovered 0.7% of compute resources. While this may seem modest, across Google's immense server network, such savings represent millions in cost reductions and substantial energy efficiency gains. 2. Improved TPU Circuit Designs Alpha Evolve was used to re-engineer circuits in Google’s Tensor Processing Units (TPUs). The AI discovered ways to remove redundant components, resulting in: Lower power consumption Reduced heat generation Fewer operational errors This marks a rare example of AI contributing directly to hardware optimisation, not just software efficiency. 3. Faster Gemini Model Training Training large-scale AI models, such as Gemini, is computationally intensive. Alpha Evolve significantly improved kernel-level tiling heuristics for matrix multiplication—a critical operation in model training. Result: 23% faster execution on key kernels Impact: 1% total reduction in training time for Gemini Such improvements compound at scale, saving millions of GPU hours and accelerating development cycles across teams. 4. Broke a 56-Year-Old Matrix Multiplication Record In a stunning demonstration of scientific discovery, Alpha Evolve found a novel way to multiply 4×4 matrices using fewer operations than the classic Strassen algorithm (1969). This breakthrough has significant implications for: Theoretical computer science Graphics rendering Deep learning computation This achievement illustrates how AI systems are now capable of making original contributions to mathematics—a task once thought to require human creativity. The Road to Recursive Self-Improvement One of the most exciting possibilities introduced by Alpha Evolve is the concept of recursive self-improvement. By integrating its optimisation outputs back into base AI models like Gemini, DeepMind may initiate a loop where AI systems continually enhance themselves, refining not only task performance but their own training and development frameworks. While still speculative, this pathway could usher in: Exponential increases in AI capability Shorter iteration cycles Accelerated scientific discovery This feedback mechanism may even lay the groundwork for artificial general intelligence (AGI)—a milestone that could redefine the role of AI in human society. Automating the Entire Research Pipeline Looking beyond code optimisation, Google envisions Alpha Evolve and future agents automating nearly all aspects of AI research: Literature review: Reading and summarising vast academic corpora Hypothesis generation: Formulating testable ideas Experimental design: Structuring trials and simulations Analysis and interpretation: Drawing conclusions and suggesting next steps This end-to-end automation could compress decades of progress into a few months, allowing AI systems to solve problems that are currently intractable due to time and resource constraints. According to internal researchers, such capabilities may become a reality before 2030. Human-AI Collaboration Remains Crucial Despite Alpha Evolve’s autonomy, human involvement remains vital. Human oversight enhances: Exploration boundaries: Guiding the AI toward meaningful solution spaces Ethical safeguards : Preventing misuse or unintended outcomes Creative integration: Combining machine-discovered insights with human intuition Far from replacing developers and researchers, Alpha Evolve is best viewed as an amplifier of human ingenuity, allowing professionals to focus on high-level strategy while the AI handles low-level optimisation. Conclusion: The Future Is Evolutive Alpha Evolve is not just an AI model—it’s a new class of intelligent agent, one capable of advancing science and technology without constant human prompting. By automating code refinement, accelerating hardware design, and contributing original discoveries, Alpha Evolve sets a precedent for what AI can become. The implications are profound: Businesses can optimise infrastructure at scale. Scientists can test theories in days instead of years. AI systems can continually learn and improve themselves. In short, Alpha Evolve evolves AI itself. As we look toward a future of recursive self-improvement and automated research, one thing is clear: agentic AI is here, and it’s changing everything. Frequently Asked Questions (FAQs) 🔹 What makes Alpha Evolve different from other AI systems? Unlike traditional AI models that require human input for every iteration, Alpha Evolve acts independently. It generates, evaluates, and refines code through autonomous feedback loops. 🔹 Is Alpha Evolve available to the public? Currently, Alpha Evolve is used internally at Google. However, DeepMind has suggested future limited access for academic and trusted researchers. 🔹 Could Alpha Evolve replace human software engineers? Not entirely. While Alpha Evolve handles routine and complex optimisation tasks, human guidance and creativity remain essential for setting goals, interpreting results, and ensuring the ethical use of AI.

Google's Agent Space: A New Era in AI-Driven Collaboration

by Gonzalo Wangüemert Villalba • 14 May 2025

In the ever-evolving landscape of artificial intelligence, Google has recently introduced Agent Space. This revolutionary platform is set to redefine how businesses and developers create and deploy AI agents. By introducing the Agent-to-Agent (A2A) protocol, Google is making it easier for AI agents to communicate seamlessly across different platforms, marking a transformative shift in how tasks are performed and automated. But what exactly does Agent Space bring to the table? How will this new technology impact industries, businesses, and everyday users? Let’s dive into this groundbreaking platform and explore its potential for reshaping the AI landscape. What is Agent Space? Google's Agent Space is a platform designed to streamline the process of creating and deploying AI agents that can perform various tasks, ranging from handling customer service inquiries to conducting complex data analysis. These AI agents can work independently or, more interestingly, collaborate through the A2A protocol. The ability for agents to communicate and share information across different platforms marks a huge step forward in terms of interoperability, providing businesses with unprecedented flexibility and efficiency. A key feature of Agent Space is its user-friendly interface, which allows individuals and companies to create custom AI agents without needing any coding expertise. Users can select from a gallery of pre-built agents or develop their own, using simple conversational inputs to define the tasks they want the agent to perform. This ease of use democratises AI development, allowing companies of all sizes to leverage AI without investing heavily in specialised technical resources. The A2A Protocol: A Game Changer in Collaboration At the core of Agent Space's capabilities is the Agent-to-Agent (A2A) protocol, which enables AI agents developed by different providers to communicate securely and efficiently. This protocol ensures that agents can share data, exchange commands, and collaborate on tasks without compatibility issues, all while maintaining strict security and access controls. The A2A protocol is designed to unlock new productivity levels, particularly for businesses that rely on multiple tools and services. For example, a marketing agent could seamlessly work with a finance agent to create a comprehensive report, pulling data from internal systems and external sources. The ability for agents to communicate across platforms, combined with real-time data access, significantly enhances workflow efficiency, eliminating the need for manual data entry or switching between different tools. How Does Agent Space Improve Business Productivity? One of the most exciting benefits of Google’s Agent Space is its ability to boost business productivity. AI agents can automate repetitive tasks, such as scheduling meetings, updating customer records, or generating reports, allowing employees to focus on more strategic activities. The ability for agents to pull real-time data from various internal and external sources means that they can make informed decisions and recommendations without constant oversight. For instance, imagine a scenario where an AI agent monitors a client’s investment portfolio. The agent can access real-time market data, analyse risk factors, and send recommendations to the portfolio manager without human intervention. This kind of automation saves time and ensures that tasks are completed with greater accuracy and efficiency. Moreover, the interoperability offered by the A2A protocol opens up new possibilities for collaboration. Companies can integrate their AI agents into their existing systems, enabling them to work together in a way that was previously impossible. Whether customer service agents collaborate with sales agents to resolve inquiries or supply chain agents work with financial systems to forecast demand, the potential for increased productivity is limitless. The Developer Ecosystem: Empowering Innovation Another significant aspect of Agent Space is its Developer Ecosystem. Google has introduced the Agent Development Kit (ADK), which empowers developers to create and monetise their own AI agents within the platform. This opens up new opportunities for businesses to build niche, custom agents that cater to specific needs within their industry. They also provide developers with a way to earn revenue by offering their agents to other companies. The ADK is designed to be accessible to developers of all skill levels, with comprehensive documentation and support to guide them through the development process. Whether you're building a simple automation tool or a complex agent capable of handling multiple tasks across different sectors, the ADK provides the resources you need to get started. This developer-first approach ensures that Agent Space will continue to evolve and expand as more innovative agents are created, further enhancing the platform’s capabilities. Over time, we may even see the emergence of a marketplace where businesses can source agents based on their needs, similar to how apps are distributed through app stores today. Security and Privacy: A Priority in Agent Space As with any platform that handles sensitive data, security is a top priority in Agent Space. The platform incorporates enterprise-grade authentication and user-level access controls, ensuring that data privacy and security are never compromised. With companies increasingly relying on AI to manage sensitive information, robust security measures are essential. The ability to restrict access to specific data or tasks based on user roles allows businesses to maintain tight control over their operations. Whether dealing with customer data, financial records, or confidential communications, Agent Space ensures that only authorised users and agents have access to specific information. The Future of AI and Search: Moving Beyond Traditional Methods Agent Space also has the potential to transform the way we interact with information on the internet. Traditionally, users have relied on search engines to find information, but Google’s Agent Space offers an alternative approach: AI agents acting on users' behalf. Rather than manually searching for answers, users can interact with AI agents, who will pull relevant information from various sources, make decisions, and complete tasks on their behalf. This shift from search engines to agent-based interactions could fundamentally change how we search for and interact with information. The growing role of AI in our daily lives means that soon, users may not need to visit websites or manually browse through search results to find what they’re looking for. Instead, AI agents will streamline the process, providing users with exactly what they need in real time. Scaling AI Across Industries: The Enterprise Advantage One of the most compelling aspects of Google’s Agent Space is its adaptability across industries of all sizes and complexities. Enterprises can leverage agents to automate everything from supply chain optimisation to regulatory reporting. For instance, in the logistics sector, agents can track shipments in real time, update dashboards dynamically, and communicate with customs agents—all without human intervention. In education, custom-built agents could serve as intelligent tutors, providing personalised feedback, scheduling lessons, and adapting learning paths to individual student needs. For legal firms, agents might review contracts for compliance risks, flag outdated clauses, and prepare summary reports, significantly reducing overhead and turnaround time. This cross-sector flexibility is what makes Agent Space such a transformative platform. It’s not just a tool for tech-savvy startups—it’s an infrastructure layer that can scale with the ambitions of any organisation, whether a multinational enterprise or a growing non-profit. Conclusion: Building the AI Ecosystem of Tomorrow Google’s Agent Space represents more than a step forward—it marks a strategic leap into the era of intelligent collaboration. Blending secure interoperability with intuitive design and developer flexibility creates fertile ground for innovation and growth. Whether you’re a business leader seeking productivity gains, a developer aiming to monetise niche capabilities, or an end-user curious about automation, Agent Space opens the door to a brighter, more connected future. As adoption grows and more agents populate the ecosystem, we may see a profound shift—not just in how we work but in how we think about digital interaction itself. This is the beginning of a human–agent partnership model that puts users at the centre of the AI revolution.

Manus AI: The Autonomous Agent Redefining Artificial Intelligence

by Gonzalo Wangüemert Villalba • 4 April 2025

Innovation is the key differentiator in the rapidly evolving landscape of artificial intelligence (AI). While tech giants like OpenAI, Google, and Anthropic often dominate the headlines, a Chinese startup has quietly emerged, poised to reshape the global AI arena: Manus AI, developed by Butterfly Effect. Touted by many as "the world's first truly autonomous AI agent," Manus represents a significant leap forward in the quest for artificial general intelligence (AGI). But what sets Manus apart from its competitors, and how does it challenge established leaders? In this article, we delve into Manus's capabilities, architecture, and potential impact on the future of AI worldwide. What Is Manus AI? Derived from the Latin word for "hand," Manus is an AI generalist agent capable of transforming abstract thoughts into concrete actions. Unlike conventional chatbots or AI assistants that require continuous human prompting and supervision, Manus operates independently—initiating tasks, navigating the web, gathering information, and managing complex workflows with little to no human input. A defining characteristic of Manus is its ability to adapt its response strategies in real time. This is not merely a predictive model that outputs pre-programmed answers; instead, it is an intelligent system equipped with strategic autonomy, able to dynamically adjust its approach based on live information collected from the digital environment. How Manus Works: Multi-Agent Architecture and Asynchronous Operation Manus' standout feature lies in its multi-agent architecture. Rather than relying on a single monolithic neural network, as seen in most mainstream AI models, Manus intelligently breaks tasks into smaller components and assigns them to specialised sub-agents. Each sub-agent is fine-tuned to execute specific parts of the workflow, ensuring enhanced efficiency and precision. This occurs seamlessly within a unified system, relieving users of the need to integrate multiple AI tools manually. Equally important is its cloud-based asynchronous operation. Unlike typical AI assistants that require continuous engagement and prompt-response cycles, Manus functions quietly in the background. It autonomously carries out tasks and alerts users only when results are ready. This workflow is designed to streamline productivity, allowing users to delegate responsibilities with confidence that Manus will independently handle the process from start to finish without the need for intervention. Real-World Use Cases: Manus in Action Manus' official demonstrations illustrate its wide-ranging applications and how it surpasses existing agents. Some of the standout examples include: Advanced Data Analysis: Manus can perform in-depth evaluations of stock performance, such as providing visual dashboards for companies like Tesla. It also conducts market research and comparative analyses, such as assessing insurance products. Personalised Education: It can create tailored presentations and educational content, adjusting the material based on the learner's profile and objectives. Travel Planning: Manus analyses weather, safety data, rental prices, and user preferences to craft detailed, fully personalised travel itineraries and guides. E-commerce Optimisation: For online retailers, Manus processes sales data from platforms like Amazon and proposes strategies to enhance commercial performance. These real-world examples illustrate how Manus is not merely reactive—it is proactive. It conducts research, organises information, and delivers ready-to-implement solutions with minimal human guidance, ensuring maximum efficiency. Manus vs OpenAI and DeepSeek: The General AI Race The most striking claim made by Manus' developers is that it outperforms OpenAI's Deep Research model in the GAIA Benchmark—a widely respected metric for evaluating generalist AI agents. Surpassing such a powerful model signals Manus's capacity to disrupt the status quo. Historically, models like Deep Research and OpenAI’s GPT-4 have been considered the gold standard in advanced reasoning and autonomy. Yet, Manus’s greater efficiency, lower operational costs, and modular architecture make it a desirable alternative, particularly in industries that demand complex, customised workflows. Moreover, Chinese models such as DeepSeek-R1 have already proven their capability to deliver advanced reasoning at a fraction of the cost compared to their Western counterparts. Manus's emergence further accelerates this shift towards more affordable, robust, and scalable AI solutions—escalating the competition in the global AI landscape significantly. Geopolitical Context and Manus' Controversies Unsurprisingly, Manus's rapid ascent has not been free of scrutiny. Various experts and analysts from the United States have voiced scepticism, questioning the opacity surrounding the project’s funding, access restrictions, and the underlying technology. Some critics have gone so far as to allege that the Butterfly Effect may have leveraged knowledge derived from OpenAI's models to train its agent. This unfolds against the intensifying technological rivalry between China and the United States. Both nations have enacted stringent restrictions on exchanging critical technologies such as semiconductors and AI software. Manus, representing a significant breakthrough in autonomous AI capabilities, has only heightened this ongoing competition. Manus AI: Pioneering the Next Frontier in Autonomous Intelligence Setting aside geopolitical tensions, Manus epitomises a broader trend in artificial intelligence development: the move towards agents capable not only of interpreting human input but also of autonomously executing actions. The vision of an internet where machines generate, distribute, and consume content—independent of human intervention—is fast becoming a reality. The practical implications are vast for businesses and organisations. Manus' capacity to autonomously manage tasks allows companies to offload routine, time-consuming processes, freeing up human resources for strategic decision-making. Whether streamlining administrative duties, performing data-driven market analyses, or even managing supply chains, Manus' modular, autonomous framework offers unprecedented scalability and flexibility. Manus AI is not merely another AI assistant; it is a fully autonomous agent embodying the future trajectory of artificial intelligence. Its multi-agent architecture, asynchronous operation, and ability to adapt in real time position it as a formidable contender in the quest for AGI. While concerns surrounding transparency and geopolitical impact persist, Manus represents a pivotal advancement in AI development. As the industry continues to evolve rapidly, Manus serves as a clear indicator that the era of autonomous, proactive AI agents is no longer a distant vision—it is already here, actively reshaping how we interact with technology at every level of society and business.

The Future is Efficiency: How DeepSeek R1 is Redefining AI

by Gonzalo Wangüemert Villalba • 4 March 2025

The emergence of DeepSeek R1 has shaken the strategies of tech giants, sent shockwaves through financial markets, and ignited a new level of geopolitical competition between the United States and China. But beyond these immediate impacts, DeepSeek R1 represents a fundamental shift in how artificial intelligence (AI) is developed and deployed. Rather than following the traditional "bigger is better" approach, where massive models with trillions of parameters dominate, DeepSeek R1 champions a new paradigm: efficiency. A Break from Tradition: The Efficiency Revolution For years, the prevailing AI philosophy was simple: larger models, more GPUs, and higher energy consumption meant better performance. DeepSeek R1 challenges this notion. Trained at a fraction of the cost of its Western counterparts, just $5.6 million compared to the billions invested by OpenAI and Google, DeepSeek proves that scalability depends not solely on size but algorithmic intelligence. The introduction of R1 raises critical questions about the future of Large Language Models (LLMs). Are these expansive models already on the verge of obsolescence? With rapid advancements in efficiency-driven AI, businesses and researchers must reconsider their dependence on resource-intensive models that leaner, more cost-effective alternatives may soon outpace. The Geopolitical Battle Over AI DeepSeek R1’s arrival is more than a technological breakthrough; it has geopolitical implications. The AI race is now a battleground for global influence, drawing comparisons to Huawei’s dominance in 5G technology. Just as the U.S. took extreme measures to curb Huawei’s expansion, it is now attempting to regulate AI development by restricting advanced GPUs and open-source AI. However, DeepSeek R1 demonstrates that such restrictions cannot slow China’s AI progress. By optimising efficiency and reducing dependency on high-end chips, DeepSeek has circumvented U.S. sanctions and emerged as a formidable competitor. This has raised concerns in the West about the control of AI-generated information. If AI models developed in China become globally dominant, the risk of information control and censorship increases, influencing public discourse on key issues. Open-Source AI vs. Proprietary Models, A Coexisting Future One of the most striking aspects of DeepSeek R1 is its open-source nature. Historically, open-source software has challenged proprietary solutions by dramatically reducing costs and increasing accessibility. We have seen this pattern with Linux in enterprise computing, Android in mobile operating systems, and MySQL in database management. AI is now following the same trajectory. Yet, major Western AI labs, OpenAI, Google, and Anthropic, continue to lead in multimodal AI, safety protocols, and model security. DeepSeek R1 may be efficient, but concerns over its robustness and potential vulnerabilities remain. Microsoft’s immediate integration of DeepSeek R1 into Azure suggests a growing appetite for open models, particularly for businesses looking to balance cost and flexibility. However, proprietary models will continue to play a crucial role in ensuring security and regulatory compliance, leading to a hybrid AI ecosystem where both approaches coexist. The Economic Implications of AI Cost Reduction One of the most debated aspects of DeepSeek R1 is its development cost. While $5.6 million is a fraction of what leading AI firms spend, the figure likely only accounts for training, excluding infrastructure, engineering, and deployment costs. Nevertheless, the real game-changer is inference cost, the cost associated with using AI models in real-world applications. Lower inference costs mean broader adoption, much like declining semiconductor prices fueled the mass adoption of consumer electronics. This shift will have profound economic consequences. As AI becomes more affordable, startups and mid-sized enterprises can integrate advanced AI without requiring massive infrastructure investments. This democratisation of AI will disrupt industries traditionally dominated by a handful of tech giants. The Role of Reinforcement Learning and AI Agents DeepSeek R1 is not just another LLM but a shift toward reasoning-based AI. Historically, LLMs excelled at pattern recognition but struggled with logical reasoning and decision-making. DeepSeek R1 integrates reinforcement learning techniques, allowing it to solve complex problems methodically rather than simply predicting the next word in a sequence. This evolution paves the way for autonomous AI agents capable of adapting to dynamic workflows. From customer service to administrative tasks and data analysis, AI is moving beyond predefined scripts to real-time decision-making. The business world must prepare for a future where AI-driven automation extends beyond simple chatbot interactions into comprehensive, intelligent task execution. The Chip Shortage Driving Algorithmic Innovation The U.S. imposed semiconductor export restrictions to limit China’s AI capabilities. However, these constraints have unintended consequences: they have pushed Chinese researchers to prioritise efficiency over brute computational power. As AI models become more optimised, the demand for high-end chips could decrease, fundamentally altering the AI hardware landscape. While Western AI firms continue to invest heavily in GPU-driven research, China’s focus on efficiency could prove to be a more sustainable long-term strategy. The balance between computational power and algorithmic efficiency will likely define the next phase of AI innovation. What Comes Next? A Shifting AI Landscape DeepSeek R1 is not the final chapter in AI development; it is the beginning of a broader shift. Here are three key takeaways for businesses, regulators, and AI researchers: Efficiency is the new frontier: The AI race will no longer be won by sheer computing power. Algorithmic advancements will drive the next wave of breakthroughs. Regulation must balance security with innovation: Overregulating AI could slow down Western progress while allowing China to take the lead in global adoption. Application matters more than model size: AI accessibility is increasing, but success will depend on how effectively companies integrate AI into their operations. Conclusion: AI’s Future Lies in Strategic Deployment The rise of DeepSeek R1 signals a transformation in AI development. Rather than investing solely in more extensive and expensive models, the industry must focus on efficiency, usability, and strategic deployment. Businesses that adapt to this shift will gain a competitive edge, while regulators must navigate the complex landscape of security, innovation, and geopolitical competition. AI is no longer just about who builds the biggest model, it’s about who uses it most effectively. The future belongs to those who can harness AI’s power efficiently and strategically. DeepSeek R1 is just the beginning.

Claude 3.5 Sonnet and Haiku: The New Frontier in Artificial Intelligence and Coding

by Gonzalo Wangüemert Villalba • 13 February 2025

Artificial intelligence (AI) continues to advance at an incredible pace, revolutionising various industries and transforming the way we work and create. Today, we are witnessing two innovations that are reshaping the AI landscape: Claude 3.5 Sonnet and Claude 3.5 Haiku. Developed by Anthropic, these models redefine efficiency in coding and problem-solving tasks while introducing groundbreaking features, such as enabling AI to interact with computers in a remarkably human-like way. In this article, we will delve into the exceptional capabilities of these models, their key advancements, comparisons with competitors, and how they can dramatically transform workflows to boost productivity in high-tech environments. Claude 3.5 Sonnet: Innovation in Coding and Reasoning The updated version of Claude 3.5 Sonnet is not just an incremental improvement but a qualitative leap in programming and solving complex problems. In specific tasks such as coding, the model demonstrates exceptional performance, achieving a remarkable 93.7% accuracy in coding evaluations like HumanEval. This figure significantly surpasses most models on the market, including renowned competitors like the GPT-4o mini. The improvement in tool use and coding benchmarks is equally impressive. In evaluations like SWE-bench Verified, Claude 3.5 Sonnet improved from 33.4% in its previous version to 49.0%, making it an ideal tool for developers managing complex processes and performing advanced reasoning tasks. But it is not just about the numbers. Major tech companies like GitLab have integrated Claude 3.5 Sonnet into their workflows, achieving up to 10% improvements in DevSecOps tasks without sacrificing processing speed. This demonstrates the model’s incredible ability to handle multi-step tasks and seamlessly adapt to demanding environments. Claude 3.5 Haiku: Speed and Performance at an Affordable Cost If speed and affordability are your top priorities, Claude 3.5 Haiku is the perfect choice. Designed to deliver an exceptional balance between speed and performance, this model excels in practical tasks requiring quick and accurate responses. Claude 3.5 Haiku performs impressively in coding tasks, scoring 88.1% in evaluations like HumanEval. Although more compact than Claude 3.5 Sonnet, it excels in specific tasks such as handling large datasets and personalising user experiences. This makes it an ideal choice for businesses looking to maximise efficiency without incurring excessive costs. Companies like Asana and Canva have already started using Claude 3.5 Haiku to automate repetitive processes, optimise workflows, and generate personalised experiences based on complex data. Early implementations show that even in demanding business environments, Haiku maintains high accuracy and speed without compromising quality. Claude Learns to Use Computers: A Game-Changing Revolution One of the most exciting advancements introduced by Claude 3.5 Sonnet is its ability to "use computers" like a human. This means the model can move a cursor, click buttons, type text, and interact with graphical interfaces. Although this feature is in a beta experimental phase, it is already transforming how businesses tackle complex tasks. For instance, Replit uses this ability to evaluate applications in real-time, while companies like DoorDash and Cognition are exploring how Claude can automate processes that previously required dozens—or even hundreds—of manual steps. This capability allows the model to perform tasks like filling out forms, navigating web pages, managing spreadsheets, and conducting open-ended research—all through simple instructions translated into computer actions. While this feature still has room for improvement, its potential is undeniable. Comparisons with Other Models Claude 3.5 Sonnet and Haiku do not operate in a vacuum—they are designed to compete with other big names in the AI industry. When compared to models like GPT-4o and Gemini 1.5, Claude 3.5 Sonnet shows superior performance in key tasks. In graduate-level reasoning evaluations, Claude 3.5 Sonnet achieves an impressive 65.0%, outperforming GPT-4o and smaller models like GPT-4o mini. In coding tests, its 93.7% accuracy stands out significantly above its direct competitors. Meanwhile, Claude 3.5 Haiku offers competitive performance against models in its category, excelling in speed and low latency. This makes it a viable option for tasks where speed is as critical as accuracy. Practical Applications and Use Cases Developers and businesses adopting these models will find countless practical applications. Claude 3.5 Sonnet is ideal for complex software development projects, from planning to implementation, while Claude 3.5 Haiku is perfect for data analysis, content creation, and real-time information management. Additionally, the ability to use computers opens up a new realm of exciting possibilities. Imagine an AI system capable of automating a process in a CRM, conducting online research, or even testing applications under development. This not only saves time but also reduces human error and improves operational efficiency. A Responsible Approach to Scalability With significant advancements come great responsibilities, and Anthropic understands this perfectly. To ensure safe use of these new capabilities, the company has developed classifiers that detect potential misuse, such as spam or disinformation. Additionally, rigorous testing has been conducted in collaboration with AI safety institutes in the United States and the United Kingdom, ensuring these models meet high safety standards. What’s Next for the Future? Claude 3.5 Sonnet and Haiku represent just the beginning of what promises to be a new era in artificial intelligence. The capabilities they are introducing—from coding advancements to computer usage—are paving the way for more versatile, autonomous, and efficient systems. As more companies adopt these technologies and provide feedback, we can expect rapid improvements and the emergence of new applications we have not even imagined yet. Conclusion Whether you are a developer, entrepreneur, or simply a tech enthusiast, Claude 3.5 Sonnet and Haiku offer innovative tools that can transform the way you work. From more precise coding to the ability to automate complex tasks, these models are redefining what is possible with artificial intelligence.

Step-by-Step Guide: How to Build an AI Crypto Trading Bot for Maximum Profitability

by Gonzalo Wangüemert Villalba • 14 January 2025

The cryptocurrency market is evolving unprecedentedly, and traders increasingly turn to AI-powered trading bots to maximise their profits and maintain a competitive edge. Building your own AI crypto trading bot can seem daunting, but with the right strategy, tools, and approach, it becomes achievable for traders at all levels. This step-by-step guide will walk you through creating an AI crypto trading bot tailored for maximum profitability, ensuring you stay ahead in the dynamic world of cryptocurrency trading. Why Build an AI Crypto Trading Bot? Before diving into the creation process, it's essential to understand why AI crypto trading bots are gaining so much traction: Speed and Precision: Bots can analyse data and execute trades in milliseconds, capitalising on fleeting opportunities. Emotion-Free Trading: AI operates based on algorithms and data, eliminating human emotions like fear and greed. 24/7 Market Monitoring: Unlike humans, bots can monitor the market continuously, ensuring no profitable trade is missed. Scalability: AI bots can handle multiple trading accounts and portfolios simultaneously, providing significant scalability for traders. Customisable Strategies: With AI, you can tailor strategies to match your trading goals, risk tolerance, and market conditions. Now that you know the benefits, let's break down the steps to build your AI-powered crypto trading bot. Step 1: Choose the Right Programming Language The foundation of any AI bot is its programming language. Python is the most popular choice for building AI trading bots due to its simplicity, versatility, and extensive library support. Key Python libraries for AI and data analysis include: Pandas and NumPy: For data manipulation and analysis. TensorFlow and PyTorch : For machine learning model development. Scikit-learn: For implementing various AI algorithms. Matplotlib: For visualising trading data. While Python is ideal for most developers, other languages like JavaScript and C++ can also be used for specific applications requiring speed or browser-based functionality. Step 2: Integrate with a Crypto Exchange AP I Your bot must connect to a cryptocurrency exchange to access real-time data and execute trades. Most exchanges like Binance, Kraken, and Coinbase provide APIs (Application Programming Interfaces) for developers. Sign Up: Choose a reliable exchange and create an account. API Keys: Obtain secure API keys (public and private) from the exchange to allow your bot to interact with the trading platform. Understand API Limits: Familiarize yourself with rate limits, data access permissions, and security protocols to avoid disruptions. At this stage, you’ll program your bot to fetch real-time market data (e.g., price, volume, and order book) and send trade orders securely to the exchange. Step 3: Collect and Prepare Market Data AI-powered bots rely heavily on historical and real-time market data to make informed trading decisions. The types of data you’ll need include: Historical Price Data: Open, high, low, and close (OHLC) data for analysing trends. Order Book Data: To assess liquidity and market depth. News Sentiment: Use Natural Language Processing (NLP) to analyse news articles, social media posts, and market sentiment. Ensure your bot has a robust data pipeline to efficiently collect, clean, and preprocess data. Libraries like Pandas can help with data organisation and preparation. Step 4: Develop the AI Model The AI model is the brain behind your trading bot. This step involves building machine learning algorithms to predict market movements and generate actionable trading signals. Popular AI Techniques for Crypto Trading Bots: Time-Series Analysis: Use models like LSTMs (Long Short-Term Memory) to predict future prices based on historical data. Sentiment Analysis: NLP tools like BERT can extract sentiment from social media and news, helping bots gauge market sentiment. Reinforcement Learning: This allows bots to learn from past trades and adapt strategies based on success or failure. For beginners, start with simpler models like logistic regression or decision trees, then gradually implement deep learning for more complex predictions. Step 5: Define a Real-Time Decision-Making Framework Your bot needs to analyse market data in real time and make decisions instantly. The real-time decision-making framework should include the following: Signal Generation : Identify entry and exit points for trades based on AI predictions. Order Execution: Use the exchange API to place buy and sell orders. Risk Management: Set stop-loss and take-profit levels to minimise potential losses. To achieve real-time responsiveness, use WebSocket connections to stream live market data directly into your bot, ensuring it always operates with up-to-date information. Step 6: Test Your AI Trading Bot Before deploying your bot, it’s crucial to test its performance using historical and live market data. Backtesting: Simulate trades using historical data to evaluate the bot's performance. Tools like Backtrader or Zipline can help with this. Paper Trading: Test the bot in live market conditions without risking actual capital. Performance Metrics: Evaluate key metrics like: Win rate Average return per trade Drawdown Sharpe ratio Refine your AI model and trading strategies based on the test results to ensure optimal performance. Step 7: Deploy the Bot for Live Trading Once testing is complete and the bot is performing well, deploy it for live trading: Cloud Deployment: Use cloud platforms like AWS, Google Cloud, or Azure for seamless and scalable deployment. Security Measures: Implement strong encryption, API key protection, and two-factor authentication to safeguard against cyber threats. Monitoring: Set up real-time dashboards using tools like Grafana to monitor the bot's performance and market behaviour. Continue to track and refine your bot as it trades in live market conditions to optimize profitability. Key Considerations Before Deploying an AI Crypto Trading Bot Market Volatility: Cryptocurrencies are highly volatile. Ensure your bot adapts to sudden price swings and has effective stop-loss mechanisms. Regulatory Compliance: Stay updated on crypto trading regulations in your jurisdiction to avoid legal issues. Risk Management: Implement robust risk parameters to protect your capital from market downturns. Security: Regularly update your bot to address vulnerabilities and prevent unauthorized access. Conclusion: Start Building Your AI Crypto Trading Bot Today Building an AI-powered crypto trading bot is no longer reserved for expert developers. With the right tools, programming knowledge, and step-by-step guidance, anyone can create a bot that automates trades, maximizes profitability, and gives a competitive edge in the dynamic crypto market. While the process requires dedication and continuous refinement, the rewards of having a bot that works tirelessly on your behalf are well worth the effort. Whether you’re a retail trader or an institutional investor, now is the perfect time to leverage AI technology and take your trading strategy to the next level.