The Case of the Recurring Network Timeout

Crossposted from the now defunct Ticketmaster tech blog (RIP tech.ticketmaster.com), and co-authored with Audyn Espinoza

At Ticketmaster we’re passionate about monitoring our production systems. As a result, we occasionally come across interesting issues affecting our services that would otherwise go unnoticed. Unfortunately, monitoring only indicates the symptoms of what’s wrong and not necessarily the cause. Digging in deeper and getting to the root cause is a whole different ball game. This is one such example.

This story starts out with the observation that one of our web service calls had an unusually high number of timeouts. It was particularly unusual because the web service in question typically responds in about 50 ms, and the client times out at 1s. To add to that, the metrics at the web service level was still reporting a 99th percentile response time in the 50ms range. The issue had to be in the network between the client and the service.

We took a closer look at the metrics on the client side and a pattern emerged that we had missed earlier:

Time chart of web service response times as observed at the client

Time chart of web service response times as observed at the client

For a given cluster, the timeouts were occurring every minute at the same second mark. For example, on cluster A, timeouts would occur at 5:02:27, 5:03:27, 5:04:27, and at 5:02:55, 5:03:55, 5:04:55 on another cluster. While perplexing and a great data point, we were still nowhere close to the root cause. It was time for tcpdump.

We started snooping the chatter on the client node and on a web service node and waited patiently for the second mark where the timeouts occur.

tcpdump -i eth0 -s 0 -xX port 8080 src <client_address> and dst <service_address>

Packet trace captured at clientPacket trace captured at client

Packet trace captured at serverPacket trace captured at server

Notice that the client sends the SYN packet, waits for the SYN/ACK from the server doesn’t get it and re-sends the SYN packet 3s later (3s is the default timeout for re-transmission). However, the server behaves correctly and sends the SYN/ACK in a timely manner, which the client does not receive. It was time to bring in the big guns – enter Audyn Espinoza, our resident network expert.

Based on the information already gathered, we determined that the problem occurred specifically when traversing the hardware firewall devices. This was true whether we tested within the same datacenter or between data centers, which ruled out the WAN MPLS as a possible culprit. Considering that the problem was always related to establishing a TCP connection, we immediately suspected the firewall layer. The fact that firewalls (and load balancers) are the only kind of network devices that understand the preamble and the end of a TCP network connection factored into our suspicions. How was it possible that a router or switch can randomly drop only the SYN/ACK packet and not normal data packets if routers and switches cannot tell the difference? Possible, but very improbable. We needed to go deeper and perform packet analysis on these devices to find the culprit.

With that said, we began our search by capturing data using OPNET along with Wireshark. Initially we started out by capturing traffic between systems across the two data centers. We immediately noticed a successful SYN packet traversing the firewall but the return SYN/ACK packet gets dropped by the firewall. The flow can be seen in the below network diagram:

Network Diagram between client and server where timeouts were observedNetwork Diagram between client and server where timeouts were observed

The same test was performed from the opposite direction with the same result. We can see the system properly responds with a SYN/ACK packet but it is dropped by its local firewall.  Obviously, we validated the necessary ACL was in place so something outside of the policy was dropping this packet.

While, we now had circumstantial evidence that the issue had to be the firewall, we needed to be convinced. For instance there was the possibility – however remote – that the issue could have been at the MPLS. So, we decided to perform another test, this time within the same datacenter where the packets did not cross the MPLS. In addition, we performed the network trace at the port-channels that connect directly to the physical firewall. It involved a client from Network1 attempting to establish a TCP connection to a server on Network3. If you look at the picture below, there are 3 separate VLANs on the layer3 switch and 2 Virtual firewalls that the client needs to pass through in order to reach the server.

Network diagram for detailed packet analysisNetwork diagram for detailed packet analysis

The purple line shows the SYN packet originating from the client and it crosses all 3 networks successfully. We were able to see all three instances of the same packet in our Wireshark captures.

We should expect to see three SYN/ACK packets passing through all 3 networks, but we only see one instance on our captures which tells us the Virtual Firewall “B” in this case was not sending it out into the Network 2.

3 seconds after TCP’s RTO timer expires, a SYN packet retransmission is sent. Based on our packet captures, this time the TCP connection was successfully established b/c we saw 3 SYN packet instances traversing all 3 networks as well as 3 SYN/ACK packet instances.

This was conclusive and indisputable evidence that the cause of the problem existed at the firewall layer. Now that we knew the firewall was indeed our culprit, we needed to focus on it to find the smoking gun.

After an hour’s worth of looking at the firewall stats, we found that every 60 seconds SNMP was hogging the CPU. We were able to see this at the system level of the firewall (not at the virtual firewall level). This is exactly the same time that spikes were observed on the web service graphs.

The root cause of this problem turned out to be an SNMP bug within the firewall code. To completely solve the TCP connection timeout problem, we disabled SNMP polling on both the firewalls and the SNMP poller.

This is a great example of how packet analysis and our monitoring tools helped Ticketmaster identify and fix an issue.

A devops lesson from Michael Connelly’s Black Echo

Michael Connelly offers up a cautionary devops tale of what can happen when your alerts are too sensitive or generate too much noise:

“The vault’s sensor alarm had repeatedly been going off all week. [The thieves], with their digging and their drills, must have been tripping the alarms. Four straight nights the cops are called out along with the manager. Sometimes three times in one night. They don’t find anything and begin to think it’s the alarm. The sound-and-movement sensor is off balance. So the manager calls the alarm company and they can’t get anybody out until after the holiday weekend, you know, Labor Day. So this guy, the manager—”

“Turns the alarm off.” Bosch finished for her.

“You got it. He decides he isn’t going to get called out each night during the weekend. He’s supposed to go down to the Springs to his time-share condo and play golf. He turns the alarms off. Of course, he no longer works for WestLand National.”

From The Black Echo – the first book in Connelly’s brilliant Harry Bosch series.

The Wisdom of the Crowd

From Wired magazine, I came across this fascinating online experiment, where Stanford researcher Erik Steiner is soliciting guesses from the Internet about how many coins are in the pictured coin jar. You can participate and submit your own guess before December 8th here.

I’ll be curious to see how this experiment pans out. His early update on the findings is interesting:

First, thanks for your participation. Second, some early returns…
So far, it turns out that the most accurate guessers are the people who spend the least amount of time thinking about it. Somewhat surprisingly, those that answered “I actually did some math” are the least accurate, on average.

At the risk of exposing my own confirmation bias, I’m not that surprised by the early findings as I suspect it is intuition – gut feel or what Kahneman calls System 1 thinking – at work. System 2 probably fails because there isn’t enough information to analytically come up with a solution.

My interest in the wisdom of the crowd is not just one of pop-science fascination, but I’ve always wondered about its applicability in forecasting large software projects. In a way, the agile world adopts crowd-sourced estimates with techniques like sprint poker and story point estimation. However, those are typically analytical exercises (System 2) and finer grained i.e., at the story level. Story point estimates can of course then be aggregated to come up with an estimate for the entire project. But, for very large projects – think Obamacare or larger – getting a backlog with enough detail and estimating each story can itself be a significant undertaking. And that is where I would be curious to look at research around crowd sourcing estimates for large software projects.

This is how I picture the experiment being structured: Engineers, product managers and program managers in an organization are provided with the project description and a way to anonymously provide a guesstimate. May be, they are even instructed not to discuss the project amongst themselves before providing an estimate so as to not bias[1] their individual estimates. Perhaps, a control question to reveal their biases [2] would also be in order. This would not work in small organizations as you wouldn’t have enough of a “crowd” to crowd-size. The aggregated estimate (mean, geometric mean[3]?) would then have to be compared against the traditionally calculated estimate or tracked against actual project completion.

Even if unsuccessful, these experiments could have interesting results – do engineers tend to be more accurate or inaccurate compared to program managers, do experienced engineers tend to do better or worse than less experienced engineers at forecasting. Software estimation is notoriously hard and error prone and if successful, a crowd-sourced estimate could provide another useful data point to aid long term planning.

  1. The wisdom of the crowd has been proven to break down in the face of shared biases and social influence, which makes it particularly tricky to apply it to an organization where everyone typically shares the same biases. Kahneman talks about this some more in this interview. ↩
  2. Another study shows that the bias can be eliminated by identifying the independent thinkers or the not-so-easily influenced and aggregating their estimates. They even propose a way to identify the independent thinkers. ↩
  3. There are contradictory studies on the validity of the wisdom of the crowd as this post discusses a study where the geometric mean was used to massage away the wildly inaccurate guesses of the crowd. ↩

Weekly Reading List

Engineering Serendipity: On how creativity is sparked by your networks, and face-to-face communication.

Distributed Systems and the End of the API: Chas Emerick discusses issues with distributed systems and how to solve them.

The Story of the Hearing Glove: When Norbert Wiener presented a device that could translate sound to touch, people couldn’t wait to try it out. Testing didn’t go as planned.

Juking The Stats

You juke the stats and majors become colonels

If you’re a fan of the HBO show The Wire[1], “juking the stats” would be a familiar concept. In the show, Baltimore city cops – under pressure from management to improve crime numbers – resort to short term tactics that get better numbers but don’t necessarily reduce crime. Reclassifying crimes to lower categories, increasing the arrest rate by arresting for minor offenses, under reporting crimes are all part of the play book. And as Pryzbylewski – a former cop who becomes a teacher – later finds out, the same story repeats itself in the city schools. Under pressure from the state, to improve standardized test scores teachers focus on teaching for the tests rather than actually educating their students.

Juking the stats is however, not just a great sound bite on a TV show. It is an all too real issue that plagues organizations – public and private sector alike[2][3]. Performance measurements introduce perverse incentives and it is human nature, when measured, to optimize for the metric against which they are being judged[4].

The world of software engineering is no stranger to this problem. Software engineering and its management is a complex beast and relative to other engineering disciplines is still in its infancy. We are still figuring out effective ways to track and measure performance. Most methods are far from perfect and suffer from unintended consequences.

In some agile organizations – especially those that are new to agile – measuring team performance by their sprint velocity has become common practice. Far too often, this leads to teams – under pressure to deliver the committed story points in that sprint – unintentionally cutting corners on critical aspects like quality and testing only to pay the price later[5].

Large engineering programs require teams to report status on a weekly basis, typically as red, yellow or green or some variation thereof. The stigma attached with reporting one’s status as red can lead to teams suppressing problems. Being honest about these issues ahead of time could have fixed those issues, but the pressure[6] to not report red, means these issues remain buried until it’s too late.
We do not put red on the board

In less mature organizations, QA teams are sometimes incentivized by the number and priority of bugs that they open. This invariably leads to bug priority inflation and battles with the development teams. Low team morale is an inevitable side effect.

Then, there is the possibly apocryphal tale of IBM incentivizing programmers by lines of code only to result in programmers intentionally writing verbose code.

In all of these cases, you see teams when pressured by poorly designed incentives and metrics, lose sight of the long term goals and focus on the short term statistics – sometimes overtly, but usually inadvertently. Qualitative attributes like software quality, good design and resilience end up taking a back seat. Measuring and tracking performance is a good thing and is essential for continuous improvement. However, it’s just as important to be aware of the possibility that more often than not, unintended consequences may rear its ugly head. When it does, it is imperative that leaders react and be prepared to either fix the metric or dump them entirely.

“Don’t matter how many times you get burnt, you just keep doin’ the same.” – Bodie[7]

  1. If you’re not, you should be. Apart from having a great storyline and an excellent cast of characters, it is rich with lessons in economics, management and human behavior. ↩
  2. Crime Report Manipulation Is Common Among New York Police, Study Finds – NYT ↩
  3. Criticism for standardized testing as a measure for driving school funding as collected by Wikipedia ↩
  4. There are a number of examples of the unintended consequences of perverse incentives at play. One of the more interesting examples from Bill Bryson’s A Short History of Everything is the story of paleontologists paying the locals for each fossil fragment they turn in. The paleontologists later find that the locals were smashing larger bone fragments into smaller pieces to maximize gain and in the process rendering the fossils worthless. The authors of Freakonomics collect a few more examples in this NYT article. ↩
  5. Joshua Kerievsky makes an interesting case for doing away with story points. ↩
  6. albeit imagined ↩
  7. Epigraph from “Time after Time”, season 3, episode 1 of The Wire ↩

Weekly Reading List

How Complex Systems Fail: A great, but short paper by Dr. Richard Cook, about how complex systems fail, drawn from his experience in health care. Bonus: Dr. Cook’s presentation at Velocity Conf.

The Trouble With Harvard: The Ivy League is broken and only standardized tests can fix it

I’m Chevy Chase and You’re Not: An excerpt from Saturday Night: A Backstage History of Saturday Night Live

Dear Novak, Love Roger: Novak Djokovic and Roger Federer exchange emails during the US Open Final – another masterpiece by Brian Phillips.

Why You Should Refrigerate Tomatoes and Ignore Anyone Who Says Otherwise: Serious Eats at it again, challenging conventional culinary wisdom with experimentation and data.

Lawrence in Arabia

Lawrence of Arabia 1919 Painting by Augustus JohnI recently read Lawrence in Arabia: War, Deceit, Imperial Folly and the Making of the Modern Middle East – Scott Anderson’s history of the Middle East circa World War I with T. E. Lawrence as the central character. While the book revolves around Lawrence, Anderson also introduces a coterie of peripheral characters who were just as influential in shaping the history of the time.

Unlike what you might expect from a book on history, the book maintains a breezy pace and is a page turner. Anderson also presents a balanced appraisal of the characters involved – Lawrence in particular, which given his legendary status couldn’t have been easy. What makes the book particularly engrossing and relevant is how consequential those events that occurred nearly a hundred years ago were – you could draw a direct correlation between the events described in the book to the state of the Middle East and may be even the world today. A must read for a history buff and highly recommended even otherwise.

Federer is back… may be

Great to see Federer back in top form against Del Potro. Not all of it was pretty, but there were moments where the Federer of old shone through. Next up Nadal. My prediction: they split the first two sets and Nadal runs away with the third, though I’ll be rooting for Federer all the way.