Making Your Performance Engineering More Efficient - Part 1
Performance or load tests can produce a sea of data which is sometimes overwhelming to analyze. In this series of blog posts, I will present methodical practices that can help you as performance engineers become more efficient. These practices are based on my experience. I've been active in the performance engineering industry for 17 years now. I've performance tested and tuned many different web, mobile and Internet of Things applications for a variety of companies. So all my blogs and webinars are real field experience, not just based on theory.
I'm not here to say certain processes are right or wrong, but I do want to share some of my creativity. But first, I will start out with the difference between performance engineering and performance reporting, and why there is no replacement to humans as performance engineers. Get the full stack of practices and tips from my webinar, here.
Performance Engineering vs. Performance Reporting
Performance reporting quantifies the scalability of an application. These kinds of reports deliver value to all stakeholders, but reporting uses inexact data such as averages and percentiles to determine whether SLAs are met. SLAs are agreements as the performance criteria of transactions. For example, under a load of 1800 concurrent users, 90% of the transactions had response times of under 1.5 seconds. This is the type of information that would be found in a performance report.
Performance engineering goes beyond this type of quantification. It exposes the scalability ceiling by isolating the limiting resource. Methodical performance engineering identifies the bottleneck. Then, this bottleneck can be alleviated via tuning, configuration of servers, code profiling for efficiency, re-architecting the deployment or a number of other ways depending on the type of bottleneck discovered.
Determining root causes in scalability issues requires exact data. Therefore, performance engineering is based on raw data and absolutes. You need the whole picture, the lows and the highs, in order to isolate and correlate scalability limitations.
Human Performance Engineers over Engineering Tools
A trained analytical performance engineer can quickly identify trends, spot anomalies, separate out the noise in busy graphs, differentiate between symptoms and root causes, prove or disprove theories, isolate bottlenecks and deliver actionable results. Tools can generate loads, collect metrics and graphs, correlate results, and generate reports. But it takes an analytical mind to understand and interpret this data. There are tools that can help with pattern recognition or categorize data based on standard deviations, however it really takes an analytical mind to determine whether that information is relevant or useful. A human. Not a tool.
Now that we’ve determined the importance of performance engineering and why humans should be analyzing data, here are some best practices I’ve accumulated over the years:
1. Identify Tier-Based Engineering Transactions
Engineering scripts contain a single transaction that targets a specific tier of your deployment. Monitoring frontend KPIs for engineering transactions (TPS (transactions per second), Response Times) will drastically save you time in identifying the root cause bottlenecks. Degradation in a specific engineering transaction will help isolate the tier of the deployment which you need to concentrate your efforts on.
Every deployment is unique but here are some examples:
- Web tier: A transaction that GETs a static non cached file.
- App tier: A transaction that executes a method and creates objects but stops there and does not go to the DB tier.
- DB tier: A transaction that requires a query from the DB. Another that definitely does an update.
Take your time and isolate tier based transactions as these will help you in the analysis phase. If you are unsure which transactions hit which tiers, ask the development or supporting infrastructure team. Collaboration is key.
I recommend you make each of these engineering transaction it’s own script, so you can graph out its own TPS and response time values independently of all the other business transactions. Also, pause these engineering scripts to space out the intervals of execution and therefore create a consistent sampling rate.
2. Monitor KPIs Cleverly
KPI monitoring is what solves bottlenecks. Frontend KPIs give you the current capacity, like use load, TPS, response time and error rates. Monitored KPIs are the ones that tell the performance story of why the application scales to that capacity.
Hit Rates and Free Resources are two very illuminating KPIs to every “server”, which can tell us performance stories. Therefore, we want to monitor them.
The hit rate will trend with the workload. As the workload increases with a ramping load test, so does the hit rate. These resources are monitoring by APM’s (Monitoring solutions).
Hit Rate Type Examples:
- For every OS: TCP connection rate
- Webserver: Requests per Second
- Messaging: Enqueue/Dequeue Count
- DB: Queries per Second
Remember that each deployment is unique, so you will need to decide what qualifies as a good Hit Rate per server for you, and hook up the required monitoring.
The next set of KPIs are Free Resources. I tend to use free resources instead of used resources because “free” will trend inversely with the workload, making the lines on a graph visually easier to identify bottlenecks. However, sometimes a free counter is not available for a resource. That’s ok, just use the used metric instead. Also, if target resource has queueing strategies, be sure to add a queued counter to showing waiting requests.
Free Resources Type Examples:
- OS: CPU average IDLE
- Webserver: Waiting requests
- APP server: free worker threads
- Messaging: Enqueue/Dequeue Wait time
- DB: Free connections in thread pool
To determine relevant KPIs or hook them in, start by studying an architectural diagram of the deployment. Every touch point where the data is received or transformed is a potential bottleneck and therefore a candidate for monitoring. The more relevant KPIs you have, the more clear the performance story.
Now it’s time to prove your KPIs' worth. Assuming you have built a rock solid performance test harness it’s time to spin up a load test using using both the user workflow AND those engineering scripts.
Set up a slow ramping test (adding one user every 45 seconds up to let's say 200 virtual users - this is not a goal test). Once this test is complete, graph out all your monitored KPIs and make sure that they either have a direct or inverse relationship to the TPS/workload reported by the load tool. Have patience here and graph out everything, the information you collect from this test is worth it’s weight in gold.
Here, you are exercising the application in order to validate your KPIs will trend with the workload. If the KPI doesn’t budge or make sense, it gets tossed out. Plain and simple.
Also, set up your monitoring interval to collect 3 values per sustained load. In this case since we are increasing every 45 seconds, you will want to have the load tool sampling every 15 seconds. The reason for 3 is that when graphed out, 3 sustained data points gives a plateau on the graphed line, while 1 will give a peak. Plateaus are trends. More on this in the next blog.
Catch unanticipated resources. Perhaps not all of the resources are caught during the review of the architecture diagram, so next, spin up a fast ramping load test. Again, we don’t care about the results, this is just an investigation to see what processes and OS activites spin up. If you notice an external process and have no idea what it is doing, ask! Could be a KPI candidate to add to your harness.
3. Clear the Clutter - Reduce the Number of Transactions You Analyze
Now that we are getting into the analysis phase, we are going to significantly reduce the number of transactions that we will graph out and actually use for analysis. The reason behind this is because there are likely 25, 50, maybe 100’s tagged business transactions. This is too many to efficiently analyze.
All of these business transactions are using shared resources of the deployment. So you are going to pick a few to avoid analysis paralysis. Which ones? I recommend you choose your transactions based on your unique application.
From the results of your upcoming single user load test, choose the landing page, the login, the business transaction that has the highest response time, and the transaction with the lowest response time.
You will also include and graph out ALL the engineering transactions. The number of engineering transactions will differ depending on how many tiers in the deployment. If the deployment has 5 tiers, that’s 5 engineering transactions.
Now, instead of analyzing the sum total of all transactions executing in a load test emulating a realistic load, you are graphing out only a subset and the response times graph will be less chaotic and far easier to analyze. Of course, when you are creating performance reports, you will include response times for all of the business transactions.