Don't Fly Blind: Why You Need Log Data When Load Testing
Load testing without logs is a bit like flying a plane without ever looking at your dashboard
By Trevor Parsons, Co-founder and Chief Scientist, Logentries
Having spent many years completing a PhD in distributed systems with a research group that goes by the name of the 'Performance Engineering Lab,' load testing has ever since been close to my heart :) In fact Logentries was originally spawned after working closely on a number of research projects with IBM’s Performance and System Test teams. These teams were responsible for load and stress testing very large enterprise applications and needed a better way to understand if their load tests had passed or failed.
Test teams are regularly faced with questions from management after a load test has completed: “Did we pass?”, “Can we ship?”
And very often a measured response from the test team is: “Give us some time to check the logs to make sure we didn’t miss anything.”
Due to the nature of load testing however, the volume of data produced during long (e.g. 7 day) test runs or during stress tests can be enormous and very difficult to properly analyze without the correct tools. Logentries was designed to take large volumes of log data and very quickly identify if that data contained errors, exceptions, warnings etc. so that test teams could quickly understand system behavior during test runs.
However every load test starts with a load testing technology like BlazeMeter where you can design and shape your tests such that they stress your system to fit your expected usage patterns. BlazeMeter also gives you a great pulse on how your system is performing during a load test - for example, BlazeMeter’s dashboard will show you response times, throughput rates, HTTP status codes etc. This is where load testing and log management fit so well together… if something looks awry form your BlazeMeter dashboard you can use Logentries to investigate the issue and figure out what caused it.
In fact, in a recent webinar with BlazeMeter we explored some typical scenarios where Logentries can be used alongside BlazeMeter to get to the bottom of typical issues that the BlazeMeter dashboard catches during a test run. For example, investigating what caused slow response times or HTTP status codes that point to an undesirable response from the server.
BlazeMeter Dashboard Showing Test Results
However, to truly understand how your system has behaved under load you will want to look a little further under the hood even if everything looks ok in your BlazeMeter dashboard. It’s not uncommon for subtle issues or ‘cracks' to appear in your system as you begin to turn up the heat in terms of system load, for example:
- Exceptions thrown, caught and handled: Exceptions may start to be thrown by your system as load increases significantly. These may not necessarily be immediately obvious via the BlazeMeter dashboard as your system might be catching and handling these appropriately, or they may be non critical or on a part of the system that is not user facing. That being said, you want to know about these as they are often symptoms of more critical issues that occur.
- Capacity thresholds breached: It can make sense to run your system within particular system capacity bounds. For example: your system might run fine with CPU at 95%, but if load increases only slightly you might start to max out and start dropping requests. Your load test might have passed but maybe only by the skin of your teeth :)
- The dreaded memory leak: It’s not uncommon to perform long run load tests to catch issues like memory leaks which might only be otherwise identified when your system grinds to halt with an out of memory issue. Again, with memory leaks, everything might look fine from a user’s perspective. For example: response times look good, the system is functioning as expected etc. However, memory might be slowly diminishing further and further with each user request. If you are not analyzing your memory consumption you may never spot this during a load test and it might only raise its ugly head after your system has been live for a few weeks - resulting in a major outage.
- Internal queues or buffers growing: Not unlike the memory leak issue above, software systems today often have their own internal queues or buffers to handle varying load in a system. If these begin to grow they may not affect customers immediately, as they are usually designed to handle sudden spikes in load. However, if they continue to grow at some point they may fill up and can cause issues that start to affect your users.
4 Tips for looking under the hood during load tests:
Look beyond the HTTP status code - Track exceptions and errors: HTTP status codes, such as 503s will tell you when the server didn’t provide a valid response. However, you should track application-level exceptions, warnings or errors so that you can truly understand what is happening under the hood. In Logentries you can use Tags to highlight, track and visualize these types of events in your application logs
Logentries: Tagging Critical Issues During Load Testing
Monitor resource usage and available capacity: How close to the limit are you running? How do you know how far away from the edge you actually are? During load testing it’s always a good idea to keep an eye on resource usage metrics such as CPU, Network, Memory etc. It can also be a good idea to set bounds within which you believe it is ‘safe’ for your system to run within, or to always run your system with some extra capacity such that you can always handle sudden spikes in load. Even if you are taking advantage of auto-scaling, it may take some time to bring up additional resources and it’s almost always a good idea to leave some ‘head room’ so you’re not flying too close to the sun. Logentries will capture server resource usage information, and stream it into a log file so you can cross-correlate it with your access logs and server requests and then visualize both metrics in a dashboard. During load tests you can use this data to keep one eye on resource usage and it will allow you to better plan for capacity requirements.
Tracking Server Monitoring Information
Get to know your heap: Load tests are a great way to flush out potential memory leaks. By tracking GC time and total heap size you can track how your application memory usage is trending. If memory used is constantly creeping upwards and GC time is getting larger and larger AND your load is holding steady you may be experiencing a memory leak. While this might not be immediately affecting system performance or your users' experience in any significant way right now, if left untreated it may result in your system falling over at some point down the road. GC times can be logged and visualized to keep track of them, as well as total memory used (either at the OS or at the application level).
Collecting Raw GC Time
GC Time over Time (Nanoseconds)
Log your application’s key performance metrics and leading performance indicators: What is important for you from a performance perspective? What can cause your system to grind to a halt? It is important to understand what the most crucial metrics are for knowing if your application is functioning properly. By way of example, a real time data processing system might make use of internal queues to handle increases in load. In such situations you might want to understand (1) Is data being queued, due to CPU being overloaded? (2) How far behind is the processing queue? (3) Is there enough room in my internal buffers to continue with this load? (4) Are users affected right now/ will they be affected if this trend continues? In this situation, logging metrics to give you details on the queue size, and how it is trending and subsequently analyzing these metrics during your load tests provides you with a level of predictability such that you can plan how your system should react in situations where buffers are filling up for example.
Combining load testing and logging gives you the ability to both stress your system over time, but also the ability to see under the hood and to figure out what parts of your system are overheating and what is causing this. Load testing without logs is a little like flying a plane without the ever looking at your dashboard.