Quick Links


  Take the
Tour


  Free Trial
Signup


  Customer
Quotes


  Scanner
Map



Client Login


Username:
   
Password:
   


 
Technical Case Study: Peak Time Breakdown

Since the launch of Streamcheck a year and a half ago, we at Streamcheck have seen countless cases of ailing streams. During that time we've learned to identify telltale signs of specific problems that can afflict a streaming operation. The earlier one can zero in on the problem, the quicker it can be fixed.

    • The background
    • Focusing in on the problem
    • Bring on the traceroutes
    • Conclusion


The background


Presented here is a case study showing a very typical case of server overload. The company in question is a well known news broadcaster and the stream in question is a hi-bitrate live audio feed of business news, targeted mostly at office listeners.

Streamcheck began monitoring this stream in November of 2001. Right away the results caused concern. Although the connection success rate was still in the 90's overall, the StreamQ was in the C to D range. Such a low StreamQ means that almost half of the first 60 seconds of the stream were spent either buffering or rebuffering. Both Streamcheck and the client knew that such poor performance would definitely cause a loss of audience. Since the client was preparing to begin streaming ad insertion, it was important to understand and fix this problem.

Figure 1

Click to enlarge.

Clearly, the quality was dropping dramatically between 8am and 4pm - office listening hours. As well, this effect was absent on weekend days (Nov 10, 11, 17 and 18 in figure 1), further supporting the idea that the load of office-time listeners was bringing down the quality.

The next question was to find out if this problem was location specific. If one of the Streamcheck Scanners had a bad route to the servers, it could bring down the whole average. Figure 2 depicts the connection success rate by location and shows that the 12 locations used in the test saw the same availability. Notice that if this were the only data you looked at, you might conclude that the quality wasn't so bad - availability per location averages to about 97%. This illustrates the importance of doing a time-pattern breakdown like the one provided by the StreamQ grid.

Figure 2

Click to enlarge.



Focusing in on the problem


So what's the cause of the poor StreamQ? Is the player spending too much time trying to connect to the streaming server? Too much time buffering? Rebuffering? Figure 3 has some answers. This chart is built by combining the results from a given hour-long period across all the days of the test. So the first bar on the left shows the frustration time profile for all checks done between midnight and 12:59am on all 14 days. The problematic hours are obvious and concur with the StreamQ grid. During the peak hours, the buffering time (in green) doubles from around 4 seconds to around 7 seconds. Rebuffering increases to a dreadful 8 seconds on average during the peak hours. And the most dramatic jump was in connect time, which went from below one second to an average of 6!

Figure 3

Click to enlarge.

Again, it was necessary to make sure that location specific problems weren't affecting our analysis. Figure 4 shows the same frustration time profile, sorted by location. Although there was some fluctuation in connect time (connecting from Germany took twice as long as from Alexandria), buffering and rebuffering were very close across the board.

Figure 4

Click to enlarge.

So back to the main problems... Buffering and rebuffering are indicative of insufficient and inconsistent bandwidth, respectively. They can also be indicative of overloaded media servers. And high connect times can be symptoms of several problems- insufficient bandwidth, poorly performing DNS, load-balancer problems, overloaded servers, etc. These can be hard problems to diagnose. However, these problems were occurring repeatedly during peak viewing hours which allowed us to speculate that the problem was one of three things:
  • General Internet congestion at peak hours
  • Insufficient bandwidth at the stream host's data center
  • Overloaded streaming servers


Bring on the traceroutes


So how did we determine which one of these three problems is the root cause? Traceroutes. Every time a stream is checked, the Streamcheck Scanners record DNS lookup times and perform a client-to-server traceroute. On average, unacceptably high ping times should be seen in the middle of the traceroute if a specific Internet segment was congested, or seen at the stream host's server gateway if they had insufficient bandwidth for these streams. A review of traceroute data during the peak hours revealed acceptable ping times throughout the traceroute, so congestion and insufficient bandwidth were clearly not causes.

Furthermore, a more detailed analysis of the traceroute data showed that every stream was being served from the same IP address. Since we had been informed by the client that streams were being served from a load-balanced server farm, this was concerning.



Conclusion


Several discussions with the client's service provider resulted in the correcting of a problem with their load-balancing infrastructure. With the increased server capacity available, the client's stream performance was found to increase dramatically during peak hours. Furthermore, we were able to reduce connect times overall by advising our client to remove a dynamically-generated HTTP metafile which was being served from their site. The client web servers were found to be inadequate when handling a high volume of requests during peak hours. We suggested replacing the dynamic metafile with a direct RTSP link to the streaming provider.

In the end, we were able to improve availability during peak hours to over 96% and StreamQ to A- or better. And the best news of all was that the client didn't have to spend additional money on hardware or stream hosting at all.

 
 
Streamcheck: The Streaming Metrics Provider
::: © 2003 Streamcheck. All rights reserved. :::