No Priors Ep. 124 | With SurgeAI Founder and CEO Edwin Chen

TLDR;

Edwin Chen, founder and CEO of SurgeAI, discusses the company's focus on high-quality human data for AI training and evaluation. SurgeAI, a bootstrapped company that surpassed $1 billion in revenue, emphasizes the importance of quality over quantity in data and the necessity of human input even as models become more advanced. Chen also touches on the competitive landscape of frontier models, the significance of human evaluation in benchmarking, and SurgeAI's future directions in public research and industry education.

SurgeAI focuses on delivering high-quality data for AI model training and evaluation, differentiating itself from competitors by emphasizing technology and quality measurement.
Human evaluation is crucial for benchmarking and ensuring models align with desired objectives, as automated metrics can be misleading.
Despite advancements in AI, human feedback remains essential for aligning models with real-world objectives and addressing unexpected outputs.

Edwin Chen Introduction [0:00]

Edwin Chen, the founder and CEO of SurgeAI, is introduced. SurgeAI is a bootstrapped human data startup that has achieved over a billion in revenue and serves prominent clients like Google, OpenAI, and Anthropic. The discussion will cover the meaning of high-quality human data, the role of humans in advanced AI models, benchmark hacking, the frontier model landscape, and the importance of environment quality for reinforcement learning.

Overview of SurgeAI [0:41]

SurgeAI has been operating somewhat under the radar until recently. The company has reached over a billion in revenue and has a team of just over 100 people. The founding thesis was rooted in the belief in the power of human data to advance AI, with a strong emphasis on ensuring the highest possible data quality. SurgeAI has been around for five years, starting in 2020. Before SurgeAI, Edwin Chen worked at Google, Facebook, and Twitter, where he consistently encountered the problem of obtaining the necessary data to train machine learning models.

Why SurgeAI Bootstrapped Instead of Raising Funds [2:28]

SurgeAI chose to bootstrap because it was profitable from the start and didn't need external funding. Edwin Chen expresses a dislike for the Silicon Valley trend of raising money for the sake of raising it, rather than focusing on building a product that solves a real problem. He believes founders should first focus on building their product and only consider raising money if they encounter financial problems. He also questions the need for early-stage startups to hire many people, suggesting that founders should initially focus on building the product themselves with a small, hands-on engineering team.

Explaining SurgeAI’s Product [7:59]

SurgeAI's primary product is the data it delivers to companies for training and evaluating their AI models. This data can take various forms, such as coding solutions, unit tests, or preference data. For example, if a frontier lab wants to improve its model's coding abilities, SurgeAI gathers coding data, which may include writing coding solutions, creating unit tests, or determining preferences between different pieces of code. In addition to data, SurgeAI also delivers insights to its customers, including loss patterns and failure modes.

Differentiating SurgeAI from Competitors [9:39]

SurgeAI differentiates itself from competitors by not being just a body shop. Many other companies in the space simply provide warm bodies to companies, lacking any real technology. SurgeAI believes that quality is the most important thing, and they have a platform with technology to measure the quality of the data their workers or annotators generate. The company emphasizes that in the realm of generative AI, there is an almost unlimited ceiling on the type of quality that can be achieved.

Measuring the Quality of SurgeAI’s Output [11:27]

SurgeAI measures the quality of its output through a combination of human evaluation and model-based evaluation. The company gathers various signals about its annotators, the work they perform, and their activity on the site, feeding this data into machine learning algorithms. This approach is analogous to how Google Search or YouTube evaluates the quality of web pages or videos, using a multitude of signals to determine if content is high quality or spammy.

Role of Scalable Oversight at SurgeAI [12:25]

SurgeAI conducts internal research on AI alignment, specifically in the field of scalable oversight. Scalable oversight involves models and humans working together to produce data that is better than either could achieve on their own. For instance, when writing an SAT story, a model might generate a basic draft, which a human then edits and refines. SurgeAI focuses on building the right interfaces and tools to combine people with AI effectively, making them more efficient.

Challenges of Building Rich RL Environments [14:02]

Building rich reinforcement learning (RL) environments is complex and cannot be easily synthetically generated. For example, simulating the world of a salesperson requires modeling interactions with Salesforce, Gmail, Slack, Excel sheets, Google Docs, and PowerPoint presentations. These environments need to simulate real-world events and ensure consistency across different elements. Generating the data for these environments, such as thousands of Slack messages and hundreds of emails, requires significant thought and sophistication to ensure they are realistic and congruent. There is no ceiling on the realism or complexity that is useful in these environments, as more richness allows models to learn more effectively.

Predicting Future Needs for Training AI Models [16:39]

The demand for training AI models will likely encompass RL environments, expert reasoning traces, and other areas. RL environments alone may not suffice, as they often involve very rich and long trajectories. A single reward may not be rich enough to capture all the work that goes into a model solving a complicated goal. Therefore, a combination of different data types will likely be necessary.

Role of Humans in Data Generation [17:29]

Human feedback will never run out as a valuable resource, even with advancements in AI. Synthetic data is useful for supplementing human efforts, but it often falls short in quality. Many customers find that only a small percentage of synthetic data is actually useful, and high-quality human data can be more valuable. Models think differently from humans, so external human input is needed to align them with desired objectives. For example, some frontier models may produce random characters in their responses, indicating a lack of self-consistency that requires human correction.

Importance of Human Evaluation for Quality Data [21:27]

Academic and industrial benchmarks are easily hacked and may not accurately gauge performance. The alternative is proper human evaluation, where evaluators take the time to fact-check responses, verify instructions, and assess writing quality. This is crucial because training models on superficial metrics is akin to training them on clickbait, which harms model progress.

SurgeAI’s Work Toward Standardization of Human Evals [22:51]

SurgeAI is actively involved in helping frontier labs understand their models through constant evaluation and surfacing areas for improvement. While much of this work is currently internal, SurgeAI aims to externalize these efforts to educate the broader landscape on the capabilities of different models. This includes highlighting which models excel at coding, instruction following, and which are prone to hallucination.

What the Meta/ScaleAI Deal Means for SurgeAI [23:37]

The Meta/ScaleAI deal has been beneficial for SurgeAI, as it reinforces the importance of high-quality data. Some legacy teams using lower-quality data solutions may have had negative experiences with human data, leading them to avoid it. By promoting the use of high-quality data, SurgeAI believes it can improve model progress across the industry.

Edwin’s Underdog Pick to Catch Up to Big AI Companies [24:35]

If Edwin Chen were to bet on an underdog to catch up to OpenAI, Anthropic, and DeepMind, he would choose XAI. He believes XAI is hungry and mission-oriented, giving them unique advantages.

The Future Frontier Model Landscape [24:50]

More frontier models will open up over time, and these models will not be commodities. Each model has its own focus and unique strengths. For example, GitHub Copilot excels at coding, OpenAI has a consumer focus due to ChatGPT, and Grok has a different set of principles. This diversity leads to different strengths for each model provider, and users will switch between models depending on their specific needs.

Future Directions for SurgeAI [26:25]

SurgeAI is excited about its public research push, especially given that many frontier labs are no longer publishing their research. This lack of transparency can lead to negative incentives and concerning trends in the industry. For example, researchers may be pressured to focus on metrics that improve leaderboard rankings, even if it means sacrificing factuality or instruction following. SurgeAI aims to educate the industry and steer it in a better direction through its own research and publications.

What Does High Quality Data Mean? [29:29]

High-quality data goes beyond simply meeting basic requirements. For example, training a model to write a poem about the moon requires more than just hiring people to write eight-line poems that contain the word "moon." It requires capturing the richness and subjectivity of poetry, recognizing that there are thousands of ways to write a poem about the moon. High-quality data embraces human intelligence and creativity, allowing models to learn deeper patterns about language and the world.