How Cruise Uses BlazeMeter API Monitoring as its First Line of Defense for Kubernetes Cluster Health

ABOUT CRUISE

Cruise is a San-Francisco-based autonomous vehicle company – offering self-driving services that help people and items get to places with greater safety and efficiency.

Cruise’s platform engineers maintain the tooling and processes for collecting and monitoring data, including metrics, logging and incident management. In addition, they manage over 60 Kubernetes clusters, which are used by Cruise’s developers to run production workloads and services.

“BlazeMeter is often the first tool to catch incidents. It’s solid and reliable and an essential part of our infrastructure engineering stack.”

 

THE CHALLENGE

Cruise’s platform engineers maintain 60+ Kubernetes clusters, on which developers develop and run services. Cruise needed a solution that would monitor these clusters for uptime. Such data monitoring is essential to ensure developers can use them for development and production-readiness.

In addition to ongoing cluster health, Cruise also needed a solution that would enable developers to monitor their own services. Finally, they were looking for a solution that could support their internal network, where most services are run. However, they discovered that most API monitoring services only support external APIs.

THE SOLUTION

Cruise’s engineering team has been using BlazeMeter’s API monitoring solution for three years as its primary tool for uptime detection. Considered their “first line of defense”, BlazeMeter monitors Cruise’s internal APIs and clusters at all times.

Cruise uses BlazeMeter in two primary ways:

1. BlazeMeter for Platform Engineers

The platform engineers on the team use BlazeMeter to monitor their Kubernetes clusters that have production workloads. A standard set of ten (10) tests run per cluster to ensure it is up and running, covering ingresses, Nginx, STO features, authorization and functionality.  The configuration of these tests are managed by a standard command script so the administrator simply needs to add one line to cluster setup scripts to create and initiate the tests that are run on BlazeMeter’s on-premise agents.  

Tests are typically run every five minutes. A total of 2.2 million calls are made every day across the account.

Cruise takes advantage of BlazeMeter’s ability to integrate with DataDog. Metrics and status are propagated from BlazeMeter to DataDog. When an incident is detected, Cruise’s PagerDuty will notify the correct team / individuals (currently there are a total of about 450 individuals who could be notified) of the issue. The appropriate team can then fix the issue before the situation worsens and possibly impact internal and external customers. Teams can inspect the results of the failed tests within BlazeMeter to gain a deeper understanding of the source of the issue that raised the incident.

2. BlazeMeter for Developers

In addition to the system administration monitoring of Kubernetes clusters, Cruise developers use BlazeMeter to monitor their services. Cruise has made it simple to set this up. Every time a new service is deployed to Kubernetes, the developer simply has to add one line of code to create and configure the testing of their services. This has become part of Cruise’s standard operating procedure.

THE RESULTS

Cruise leverages BlazeMeter for its API monitoring needs, which is organized into over 200 buckets and has access by over 450 users. That being said, consistency is key to setting up and organizing, as well as continuing to be able to scale to their needs.

BlazeMeter tests run continuously and are the first to notify Cruise’s teams of a service outage, before other tools. This enables issues to be quickly triaged and fixed before further impacting services or customers. As a result, BlazeMeter monitors cluster health and uptime at all times.

In addition, BlazeMeter enables detecting other types of issues. Since BlazeMeter agents are run from different regional networks, detected issues could also imply network incidents, which is also useful for the engineering team to ensure uptime and service availability.  Since BlazeMeter supports on-premise agents, this helps ensure the effective monitoring of internal services.

CUSTOMER FEEDBACK

“BlazeMeter is our first line of defense. We catch incidents early and fix them before they have an impact.”

“BlazeMeter enables us to fire and forget. We have confidence that our Kubernetes clusters are healthy.”

Marko Kudjerski, Senior Site Reliability Engineer and Wil Reichert, Senior Cloud Engineer at Cruise

START TESTING NOW