Canva and the Thundering Herd

TLDR;

This podcast episode features an interview with Simon Newton, Head of Platforms at Canva, discussing their first public incident report. The incident involved a confluence of factors, including network issues with their edge provider, a long origin fetch, and a performance issue related to a metrics library. The discussion covers how Canva responded to the incident, the processes and tools they use, and the lessons they learned.

Canva's incident involved multiple contributing factors, including network issues, a long origin fetch, and a metrics library performance issue.
Canva's incident response includes automated Slack room creation, Zoom calls with disabled chat, and a dedicated incident coordinator team.
Key learnings from the incident include the need to practice controls on edge providers and the importance of user-friendly error messages.

Introduction [0:00]

The host introduces season 2 of the Void podcast, now available in video format on YouTube. They mention the podcast has a sponsor, Uptime Labs, which specializes in immersive incident response training. The episode features Simon Newton, Head of Platforms at Canva, discussing their first public incident report.

Guest Introduction [1:17]

Simon Newton introduces himself as the Head of Platforms at Canva, where he manages the teams responsible for the edge and gateway, as well as cloud resources. He explains that Canva is a visual communication company that provides tools for creating visual content, including whiteboards, presentations, and social media posts.

Motivation for Publishing a Public Incident Report [3:09]

Canva has been publishing incident reports internally since 2017. The decision to publish a public incident report reflects the company's evolution and increasing adoption by enterprises, which have different customer requirements and expectations. Publishing the report demonstrates Canva's commitment to transparency and benefits the broader industry by sharing learnings from failures.

Summary of the Incident [5:21]

Simon provides a summary of the incident, explaining that it was caused by a combination of factors. Canva's editor is a single-page app that is deployed multiple times a day. During a deployment, network issues with their edge provider occurred, but the provider's automated mitigation system failed due to stale configuration. This resulted in a long origin fetch for a JavaScript asset, which caused a thundering herd of API requests that overwhelmed the gateways.

Discovery and Escalation [10:33]

The first sign of the incident was a drop in search traffic about 10 minutes after the deployment. As the search team investigated, other service owners reported similar issues. When the origin fetch completed, the gateway and edge teams were paged for massive gateway failures. The incident was then upgraded to a SEV0, and the incident coordinator was activated.

Incident Response Process [11:41]

Canva's incident response process involves service on-call teams triaging alerts and opening incidents. For SEV0 and SEV1 incidents, the incident coordinator rotation is activated. A representative from customer support is also assigned to coalesce user reports and provide feedback to user support. Post-incident, reports are written, and AI is used to extract common themes.

Incident Coordinator Role [13:50]

The incident coordinator (IC) team is a dedicated reliability function at Canva. When not on call, they focus on improving the incident process, looking for patterns in incidents, and planning for large launch events. They handle capacity planning, risk assessment, and mitigation for each launch.

Executive Involvement [15:14]

During significant incidents, Canva's founders may join the incident channel. Typically, a senior leader acts as the primary communication channel to the founders, providing updates and context.

Cloudflare's Role [15:51]

The network configuration issue with Cloudflare, where traffic was going over the public internet instead of their private backbone, was discovered later during the investigation.

Unfolding of the Incident [16:27]

The incident unfolded rapidly. By the time teams realized the issue was broader than search, the origin fetch had completed, and the gateways began to fail. This quick progression made it difficult to anticipate the severity of the situation in real-time.

Parallel Efforts During the Incident [17:26]

During the incident, multiple work streams were initiated in parallel. One team contacted vendors, following well-practiced escalation procedures. Another engineer profiled the gateway and identified a performance issue related to a change in the metrics library, which was causing contention and reducing gateway capacity.

Metrics Library Issue [19:27]

The performance issue was traced to a change in the metrics library that inadvertently put metric registration behind a lock, reducing the capacity of the gateways. The fix for this issue was already merged and scheduled for deployment shortly after the incident.

Load Balancer Issues and Mitigation [22:02]

The load balancers on the ECS containers were overwhelmed. To stabilize the system, the load needed to be reduced. Since there was no fine-grained control over incoming demand, country-level controls were used to block traffic and display a status message. Europe was re-enabled first due to its peak load at the time, followed by the rest of the world.

Remote Incident Response [24:50]

Canva has a hybrid work setup, with most engineers in the New Zealand to west coast of Australia time zone. When an incident is triggered, a Slack room and Zoom call are automatically created. The Zoom chat is disabled to funnel all communication through the Slack room for better record-keeping.

Post-Incident Analysis and Reporting [26:35]

Canva does not have a dedicated incident analysis team. Incident coordinators work with service teams to create post-incident reports (PIRs). These reports are linked in the Slack room, and the state of the incident is captured in a timeline.

Sharing and Utilizing Incident Reports [27:57]

Incident reports are published in the Slack room and reviewed in a weekly meeting. A summary of incidents is also presented in the monthly engineering leads meeting to ensure visibility and identify common themes. Action items from the reports are created in Jira and prioritized in team backlogs.

Surprising Aspects of the Incident [29:48]

The most surprising aspect was that the edge provider would coalesce requests indefinitely without any timeout. This allowed the thundering herd to develop.

Automation Making Things Worse [31:04]

The incident highlighted how automation can sometimes worsen situations. For example, the edge provider's automated mitigation system failed due to stale configuration. In response, Canva implemented processes to freeze autoscaling during incidents and allow humans to take control.

Biggest Learning from the Incident [34:10]

The biggest learning was the importance of practicing controls on the edge provider's network. Canva now has a canned response ready to go in a user-friendly style and regularly practices using these controls. They also emphasize exercising emergency or failure modes daily to be prepared.

Preparedness and Drills [36:13]

Canva conducts various drills, including wheel of misfortune-style exercises and larger business continuity drills. These drills help teams prepare for different failure scenarios.

Conclusion [36:48]

The host thanks Simon for sharing Canva's internal processes and incident report. They emphasize the benefit of sharing these experiences with the broader industry. Simon expresses hope that others can learn from their experiences and improve together.

Watch the Video

Date: 6/19/2025 Source: www.youtube.com