Brief Summary
This video tests the DeepSeek R1 model, a large language model (LLM) with 671 billion parameters, using a comprehensive rubric. The model demonstrates impressive capabilities in coding, logic, reasoning, and even exhibits censorship. The video highlights the power of test-time compute and the human-like thinking process of the model.
- The model successfully generates code for a Snake game and a Tetris game in Python, showcasing its ability to plan and execute complex tasks.
- It demonstrates strong reasoning skills by solving logic puzzles and interpreting instructions correctly.
- The model exhibits censorship, refusing to answer questions about sensitive topics like Tiananmen Square and Taiwan's independence.
Model Testing: DeepSeek R1
The video begins by showcasing the DeepSeek R1 model running on Vulture's cloud infrastructure. The model is tested with a simple task: counting the number of "R"s in the word "strawberry." The model demonstrates human-like internal monologue, thinking out loud and engaging in back-and-forth reasoning before arriving at the correct answer.
Coding: Snake Game
The video then tests the model's coding abilities by asking it to write the Snake game in Python. The model first outlines the steps it will take, demonstrating a thoughtful approach to problem-solving. It then generates the code, which works flawlessly on the first try. The model even provides instructions on how to play the game, showcasing its ability to provide comprehensive output.
Coding: Tetris Game
The video presents a more challenging coding task: writing the Tetris game in Python. The model again demonstrates its ability to think through the problem, considering different approaches and potential issues. It takes several minutes to generate the code, highlighting the time required for complex tasks. The final output is a 179-line code that successfully implements the Tetris game, showcasing the model's impressive capabilities.
Logic and Reasoning
The video moves on to testing the model's logic and reasoning skills. It presents a problem about a mailable envelope with size restrictions. The model demonstrates its ability to interpret instructions correctly, convert units, and consider different scenarios before arriving at the correct answer.
Trick Question
The video presents a trick question: "How many words are in your response to this prompt?" The model recognizes the self-referential nature of the question and attempts to count the words in its response. It acknowledges the difficulty of providing an accurate count before generating the final output, which correctly states the number of words in its response.
Killer Problem
The video presents a logic puzzle about killers in a room. The model demonstrates its ability to think through the problem step-by-step, considering different interpretations and potential ambiguities. It arrives at the correct answer, showcasing its ability to handle complex reasoning tasks.
Marble Puzzle
The video presents a simple puzzle about a marble in a glass cup. The model successfully solves the puzzle, demonstrating its ability to understand and follow instructions.
Number Comparison
The video presents a straightforward task: comparing the numbers 9.11 and 9.9. The model demonstrates its ability to perform basic arithmetic operations and arrive at the correct answer.
Censorship
The video explores the censorship present in the DeepSeek R1 model. It attempts to ask questions about sensitive topics like Tiananmen Square and Taiwan's independence. The model refuses to answer these questions, demonstrating its built-in censorship. The video also explores the censorship present in US models, finding that they are also censored in certain areas.
Sentence Generation
The video concludes by testing the model's ability to generate sentences. It asks the model to generate 10 sentences that end with the word "apple." The model successfully completes the task, showcasing its ability to follow instructions and generate creative text.
The video concludes with a thank you to Vulture for providing the GPU resources needed to power the DeepSeek R1 model. The video encourages viewers to check out Vulture and use the code "bman300" to receive $300 in free credits.