Elon’s xAI STUNS With Grok 4 ‘Smartest AI In The World’ (full replay)

TLDR;

This video introduces Grok 4, the latest AI model from XAI, highlighting its advanced reasoning capabilities, superhuman academic performance, and real-world applications. It also discusses upcoming improvements in multimodal understanding and video generation, as well as the release of Grok 4 through the API for developers.

Grok 4 achieves near-perfect scores on graduate-level exams across all disciplines, surpassing most human experts.
The model utilizes advanced tools and multi-agent collaboration to solve complex problems and predict market outcomes.
XAI is focusing on improving Grok's multimodal capabilities, coding abilities, and video generation to create more versatile and practical AI applications.

Introduction to Grok 4 [0:00]

The video begins with an introduction to Grok 4, the latest AI model from XAI, positioned as a groundbreaking creation set to redefine the future. It emphasizes the rapid advancement of artificial intelligence, comparing its learning speed to that of a human but vastly accelerated. Grok 4 is described as capable of achieving perfect scores on the SAT and near-perfect results on graduate-level exams like the GRE across various disciplines, even with unseen questions. The model's reasoning capabilities are highlighted as superhuman, challenging the notion that AI cannot reason effectively.

Training and Development of Grok 4 [2:17]

The discussion shifts to the training and development process of Grok 4, noting a tenfold increase in training compared to its predecessors. A significant amount of compute is dedicated to reasoning and reinforcement learning (RL). Grok 2 is compared to a high school student, illustrating the rapid progress in the field. The development of Colossus, a supercomputer with 100,000 H100 GPUs, is mentioned as crucial for pre-training the model. The ability to collect verifiable outcome rewards enables the model to think from first principles and correct its own mistakes, leading to improved reasoning.

Humanities Last Exam Benchmark [4:08]

The presenters discuss the Humanities Last Exam (HLE) benchmark, a challenging test curated by subject matter experts, comprising 2500 problems across mathematics, natural sciences, engineering, and humanities. Most models initially achieved single-digit accuracy on this benchmark. Examples of the complex problems include those related to natural transformations in category theory, electrocyclic reactions in organic chemistry, and distinguishing between closed and open syllables from a Hebrew source text. Grok 4 is described as performing at a postgraduate level in all subjects, surpassing most PhDs in academic questions.

Capabilities and Limitations of Grok 4 [7:02]

The discussion continues about Grok 4's capabilities, noting that it solved a quarter of the HLE problems without tools. Adding tools capability to the model significantly improved its performance. While the current tool use is primitive compared to those used in companies like Tesla and SpaceX, plans are in place to provide Grok with more powerful tools, including accurate physics simulators. The ultimate goal is to combine Grok with humanoid robots like Optimus to interact with the real world and test hypotheses. The importance of AI safety and truth-seeking is emphasized, with the aim of instilling good values in the AI.

Compute, Tools, and Real-World Interaction [10:51]

The conversation addresses the requirements for advancing AI, including compute power, the right tools, and the ability to interact with the physical world. The potential for AI to drive economic growth is discussed, with projections of economies thousands or millions of times larger than the current one. The discussion touches on the Kardashev scale and the potential for civilization to reach Kardashev 1 and 2 levels. The need to solve the data bottleneck and find challenging RL problems is highlighted, as current test questions are becoming trivial for AI. Reality is presented as the ultimate judge of AI's reasoning abilities, with the ability to invent new technologies and improve existing designs being key.

Multi-Agent Collaboration and Grok 4 Versions [14:47]

The presenters introduce the concept of multi-agent collaboration, where multiple AI agents work in parallel to solve problems. This approach has enabled Grok 4 to solve over 50% of the text-only subset of the HLE problems. Grok 4 is available in two versions: a single-agent version and Grok 4 Heavy, which utilizes multiple agents. The agents work independently, compare their work, and share solutions to arrive at the best result.

Demonstrations of Grok 4 in Action [16:46]

The video includes live demonstrations of Grok 4 solving a math problem from the HLE and predicting the World Series odds using Poly Market. Another demonstration shows Grok 4 generating a visualization of two black holes colliding, referencing undergraduate texts and real-world data. The model's ability to understand and answer complex questions is showcased, such as finding the XAI employee with the weirdest profile photo. Grok 4's ability to create a timeline of X posts detailing changes in scores over time is also demonstrated.

Performance on Multimodal Subsets and Other Benchmarks [23:13]

The discussion addresses Grok 4's performance on multimodal subsets, noting a slight dip in numbers due to weaker image understanding capabilities. Improvements in this area are expected with the upcoming version 7 of the foundation model. The model's performance on other reasoning benchmarks, such as GBQA, AMIE 25, and various coding and math exams, is highlighted, often showing a significant lead over other models. The goal is for Grok to eventually get every answer right and provide explanations for ambiguous questions.

Voice Mode and API Release [27:20]

The video introduces improvements to Grok's voice mode, including reduced latency and new voices with exceptional naturalness and prosody. A demonstration of the new British voice, Eve, is presented, showcasing its ability to engage in natural conversations and perform tasks like singing an opera on Diet Coke. The release of Grok 4 through the API is announced, allowing developers to build various applications.

API Performance and Real-World Applications [30:51]

The performance of the Grok 4 API is discussed, highlighting its success on the RKGI v2 benchmark, where it achieved 15.8% accuracy, double that of the second-place model. Its intelligence per dollar is also emphasized. Endon Labs shares their experience using Grok 4 on Vending Bench, an AI simulation of a business scenario involving vending machines. Grok 4 ranked number one, doubling the net worth compared to other models, demonstrating its ability to formulate and adhere to a strategy over long periods.

Early Adopters and Future Developments [34:50]

Early adopters of the Grok 4 API, such as the ARC Institute, are using it to automate research flows and analyze millions of experiment logs. Grok 4 is also being used in the financial sector and is available on hyperscalers. A video game designer, Denny, used Grok 4 to create a first-person shooting game in four hours, highlighting its ability to automate asset sourcing. Future developments include improving video understanding and tool use to enable Grok to play and assess video games. The training of a video model with over 100,000 GB200s is planned, with expectations of spectacular video generation capabilities.

Future Focus: Coding, Multimodal Capabilities, and Video Generation [38:55]

The video concludes by recapping the introduction of Grok 4 and its advanced reasoning capabilities. The focus for future development includes creating fast and smart coding models, improving multimodal capabilities, and enhancing video generation. The goal is to enable Grok to hear and see the world like humans, unlocking new applications. The presenters envision a future with infinite content on the X platform, where users can intervene and create their own adventures.