Brief Summary
This video evaluates various AI coding assistants based on their instruction following, accuracy, and code quality. The presenter uses detailed prompts and unit tests to measure performance, and introduces a points-based ranking system to compare the tools. Key highlights include the significant improvement of GitHub Copilot, the strong performance of Open Code, and the continued dominance of Claude Code. The presenter also shares subjective rankings for personal use in the coming month, emphasizing the importance of individual preferences and needs when choosing an AI coding assistant.
- GitHub Copilot showed significant improvement.
- Open Code performed strongly.
- Claude Code remains dominant.
Introduction
The presenter introduces a new evaluation of AI coding assistance, expanding the list to include Codeci, Gemini CLI, Kilo Code, Open Code, IDER, and warp.dev, resulting in a total of 17 tools tested. The evaluation focuses on instruction following, accuracy, and code quality, using detailed prompts and unit tests. The presenter uses LLMs as judges, applying strict criteria to assess performance.
Evaluation Methodology
The presenter measures instruction following by providing detailed prompts and assessing if the AI coding assistants can accurately execute the instructions. The quality of the code is evaluated through a series of unit tests and judged by LLMs based on strict criteria. The presenter uses clawed code to automate much of the testing and grading process. The evaluations involve larger tasks such as fixing bugs in existing projects and implementing new features based on detailed specifications.
Ranking System
The presenter has moved to a points-based ranking system to compare the AI coding assistants. The points are designed to show the variance between the tools, and the system is updated monthly to incorporate new tests and improvements. This month, Claude 4 is the primary model being tested due to its consistent performance. Codeci is tested with GPT4.1, and Gemini CLI is tested with Gemini 2.5 Pro.
Top Performers
Claude Code is the top performer, achieving the highest score ever recorded in the evaluations. Open Code secures the second position, praised for its exceptional job and low cost. GitHub Copilot shows a remarkable turnaround, moving from being nearly unusable to a solid performer, securing the third position.
Low Performers
Codeex CLI with GPT4.1 scores poorly, with a score of 1,700, while Gemini CLI 2.5 Pro scores 8,780, also considered not very good. The presenter finds it difficult to determine if the poor performance is due to the AI coding agent or the underlying model. Kilo Code, R Code, and Klein score almost identically around 15,714, highlighting the consistency of the tests.
Overall Chart Analysis
The overall chart shows that most AI coding assistants score between 14,000 and 17,000. Claude Code with Sonnet 4 is number one, followed by Open Code with Sonnet 4 and GitHub Copilot with Sonnet 4. Trey performs well, consistently scoring high in the tests. Zed is in line with Claude Code Opus, although Claude Code Opus was expected to perform better. Warp.dev is a new entry, performing slightly better than Root Code. Augment Code's agentic capabilities are noted as needing improvement. IDER surprises with a better-than-expected performance, while Void performs poorly with Sonic 4.
Kilo Code Discussion
The presenter expresses concerns about Kilo Code's marketing and approach, noting that it seems to be combining features from other tools without bringing anything new. The presenter is also skeptical about Kilo Code's claim of offering open router without the 5% markup. While acknowledging the nature of forking in open source, the presenter is worried about Kilo Code's overall strategy and finds it hard to recommend at this point.
Additional Insights
The presenter highlights GitHub Copilot's significant improvement and emphasizes the use of personal keys during testing. Zed is noted as being expensive to run due to its iterative looping, especially when using personal keys without built-in limits. Augment Code's limitations in handling large projects and its strength in planning and ideation are reiterated. The presenter also compares the current results with those from June, noting the significant jump in GitHub Copilot's performance.
Subjective Rankings
The presenter shares personal subjective rankings for the coming month, placing Claude Code at number one, followed by Open Code and Augment Code. Augment Code is moved down from second to third place due to a decrease in usage, though its context engine remains highly valued. Open Code is praised for its CLI-based interface and ability to switch between different models, making it a potential candidate for automating workflows. The presenter emphasizes that these rankings are based on personal preferences and usage patterns.
Concluding Remarks
The presenter concludes by reiterating the value of Claude Code and expressing excitement about experimenting with Open Code for automating marketing evaluations and other workflows. The presenter also plans to explore local models with Open Code and test Devstrol as a local model development agent. The presenter encourages viewers to share their thoughts and experiences in the comments and invites them to join the Discord community.