Stress Testing With BlazeMeter: How The New York Times Prepared for the Election Season
In this post, we’ll share the customer story of how one of the most widely recognized news sources in the world prepared for the most anticipated and watched US elections in modern history—and what strategies for Stress Testing were implemented to prepare their systems.
The New York Times’ Shesh Patel, Engineering Manager, and Prathamesh Nagle, Lead Software Engineer, spoke to the BlazeMeter team to discuss how the news powerhouse ensured site performance and stability before and during the 2020 elections. You can watch the entire discussion here.
Stress Testing in 2018: Lessons Learned
The first time the New York Times performed any operational maturity exercise of stress testing was two years prior, in the midterm session of 2018. The testing was geared toward being more reactive, so the new organization started doing stress testing only a month before the midterm 2018 election results. Their goal at that time was to achieve the best scaling that they could do in the moment to handle the traffic that was coming in 2018.
Shesh and his team quickly decided that for the 2020 Election, they would be more proactive. They started conducting stress testing nearly a year before the 2020 election.
Load Testing Goals Before Major Events
Shesh outlined two major goals for preparing for the 2020 elections:
1. Identifying Bottlenecks to Ensure Stability
First, the team wanted to identify any bottlenecks that would not allow scalability during the election results week. By identifying them on time and building a resiliency plan, they could ensure their ability to remain a key destination for reliable, up-to-date election results.
2. System Performance Resiliency
Sesh’s team “also wanted to exercise resiliency. What this means is although you can try as much as you want, there will always be a hiccup, but when there are hiccups in a system performance degradation or completely knocked out, we want to make sure that we exercise our resiliency plan.”
Stress Testing Challenges
When setting out to achieve their goals, the team knew they were facing some challenges.
The First Challenge: Traffic
The 2020 election was anticipated to be the most viewed in modern history. But not only on election night— the team was anticipating several spikes throughout the week, as well as sustained traffic. And the preparation was different than in the past due the ambiguity around this time.
“If you have to put a number, we received roughly 10x traffic for 2016,” Shesh said. “For the midterm 2018, we almost got 40x for some of the applications… So the biggest question was: what’s in our door for 2020?”
In the left graph below, the top represents The New York Times’ typical spike gained in the election results night—and also around any event that goes along with the election. It is a unique challenge when it comes to preparing and handling that spike.
The Second Challenge: Infrastructure
In 2016 the New York Times had an on-prem architecture But since then “we migrated to the cloud, several applications were deployed into Cloud either through a Serveles, Kubenetes, or Virtual machines. Today, the architecture is completely different from that time, and was never tested before for such a level of traffic.”
The Third Challenge: Growth
“It’s an interesting and positive challenge: Since the number of subscribers is an all-time high, more subscribers lead to more users. This increase generates a higher than normal level of traffic through various sites due to the subscription growth.”
With these goals and challenges outlined, the New York Times set out to build and execute a testing plan.
Stress Testing Prerequisites
1. Stress Testing Tools
The first executive decision was about tooling and choosing the right platform for such a massive stress test. To create the script, they used Apache JMeter, an open source, Java-based application. But, “we knew the existing machines were not capable of generating the high number of users and requests on the 2020 Election night.” Shesh said, which is what they chose to scale with BlazeMeter.
2. Testing Environment
Then, they chose the environment. Sesh made an interesting choice. “We all agreed that our staging will not be able to replicate our production system… so we did all the stress testing into the production environment.” This was a bold move, so there was a great collaborative effort from the engineering to the newsroom leadership, which allowed them to perform testing in production and learn from it.
3. Team and Partner Collaboration
20 teams alongside multiple providers like Cloud providers and CDNs participated in the testing, each with different timelines. They all need to be aligned to eliminate surprises and avoid major testing events.
It was key to have constant communication to enable preparation and a seamless execution. In addition, they created a failover plan, in case an extreme load crashed their production environment and affected their users and partners.
Test Planning and Execution
1. Test Scenarios
It was key for the New York Times to understand the business use case of each and every application. They spent time understanding their applications, what they were doing, and their internal and external dependencies.
When they created their JMeter script, they ensured they could mimic all the requests getting generated when visiting nytimes.com. This means ensuring that all relevant API endpoints in internal systems were reached.
In addition, traffic was expected from various locations. Therefore, they simulated traffic from various US, European and Asian regions. They designed their scripts to test the applications’s performances in Cloud providers, whether these applications were deployed in the same provider, or different ones. These features are also supported in BlazeMeter.
2. Test Frequency
The New York Times conducted 7-8 stress tests throughout 2020, each up to two hours long.
3. Coping Mechanisms
It was important for NYT to launch the test during day time, so all relevant teams could provide support and monitor the systems and applications. Unlike their tests in 2018, this time they asked teams not to scale up. They wanted to identify their bottlenecks and baseline to see what was the maximum traffic that they could handle for their application without having severely degraded performance.
In addition, tests were not stopped when there was an issue. Instead, their leadership proposed that each project or team supporting an application would have limited time to identify and fix the issue while the test was in progress, which helped them test their organization’s incident management plan during the stress testing.
4. Analysis and Optimization
The team wanted to learn from every stress test that took place, and performed a “blameless post-mortem,” to make improvements along the way. This ensured the systems were capable of handling the load by the time of the election results.
Stress Testing with BlazeMeter
BlazeMeter provided The New York Times with the following capabilities for their stress testing:
“We did not have to worry about any infrastructure related issues, and simply uploaded the script created—and BlazeMeter provided hundreds of engines within minutes”, Shesh said.
The ability to select load from US, Europe or Asia regions in any of supported Cloud providers enabled the New York Times to have flexibility for their designed scenarios on-demand.
BlazeMeter’s summary, aggregate and log report proved to be helpful for the New York Times, because they were able to monitor how their systems were performing when the tests were running.
Another utilized feature was the private locations. These are agents that can be configured to run in their internal network while ensuring all the features between these agents and the agents in the public cloud are the same. The ability to use all the reports and logs to monitor the tests though they were using agents deployed in their network was crucial to the team.
For example, in one of their scenarios, they used the Private Locations feature for testing an application that was not deployed in production. First, this team tested their script in dev and staging environments within their internal network, Then, upon getting the desired results, they enabled the application to be ready for the production execution.
Dynamic Control of RPS
The NYT was able to change the number of requests generated throughout the test without stopping the test. They were able to simulate or generate the pattern they were expecting to see during election night .
In BlazeMeter, you can change the RPS (Request per Second) dynamically by using the Live Remote Control feature. This feature also allows you to change other specific properties in your test while running in the platform.
Stress Testing Results
Following this thorough planning and execution, the New York Times achieved a few key learnings:
- They learned to optimize the auto-scaling of their infrastructure, whether it was DB consumption, caching, or other components.
- They learned about Cloud features and constraints, which allocate resources depending on the application and the region these were deployed. This enabled them to optimize their resources.
As a result, the NYT was able to enjoy a successful election season, with no crashes.
To get started with your own stress testing before major events, sign up to BlazeMeter for free now.