Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Winkle-Daddy
Mar 10, 2007
I'm hoping someone can answer a quick question about CISCO routers. I know that mtr is unreliable when reporting response time from a CISCO device that is your network gateway because it offloads requests to the gateway to a different CPU (I think?) and then responds whenever load is low enough to do so. This results in really erratic response times. -- Or at least this is my "I'm only passingly familiar with CISCO hardware" understanding.

Now let's say I'm looking for routes to the gateway that might be problematic. My approach was to simply sample mtr data over a long period of time (48 hours) and then hope the data normalizes some. What I saw was the following:

Host A -> Gateway -> Host B results in up to 500-700ms extremes in the data collected.
Host C -> Gateway -> Host B results in 40-60ms extremes in the data collected.

The extreme data points happen 15-20 times per hour, and the standard deviation across the Host A path is much larger. It should be noted that these are very direct routes, as in, what I illustrated is it (aside from some switches at the top of the rack). My sample rate is 1 second intervals and the tests are running at the same time.

I brought my findings to the network team, but they simply dismissed it outright as "mtr is unreliable when used to measure performance of CISCO devices." Am I wrong to think these should be consistently unreliable? As in, both paths should have their traffic de-prioritized in the same way and the mtr data should be all over the place...but the extremes should be similar in both tests.

The other thing to note is that Host C is actually a VM and is going through UCS, while Host A and Host B are both physical hosts. I'm just trying to assess if this difference is indicative of a problem through one path and not the other.

e: the gateway in question is a beefy 7K router.

Winkle-Daddy fucked around with this message at 19:59 on Nov 4, 2015

Adbot
ADBOT LOVES YOU

Winkle-Daddy
Mar 10, 2007

Slickdrac posted:

Is "gateway" the same exact IP/interface?
Yes

Slickdrac posted:

Do they go through the same or different switches?
Different, there's a switch at the top of every rack and the UCS is in a different row than the physical host; each go through a switch to the gateway.

Slickdrac posted:

Is this an always live and active network, or do the times during off hours become stable and lower and similar then?
It's always live.

Slickdrac posted:

If you hit between devices within the same subnet, do you still see triple digit extremes from A?
If they're on the same subnet we do not see that extreme, but also they do not go through the gateway (obviously). The same subnet has been tested less, however, we have not seen a single outlier in these tests so they have only been run for 60-90 minutes at a time. Generally 1 hour is good enough to collect a handful of outlier response times.

Slickdrac posted:

Is the overall average time in the single digits/VERY low double digits? If the answer is Yes/Yes, same, minimal change, yes, no. Then could be a bad cable to the host A or in the pathway of it. If any of those answers deviate, then it could just be the amount of traffic hitting one of the devices/interfaces on the line (or still a problem with the cable, because no one expects Layer 1 except helpdesk support, higher support overthinks too much)

Instead of just giving vague generalities about how long it is, I'll just provide some real numbers. These numbers reflect the overall averages over a 48hr period (however, each hour is characterized almost exactly the same as we did that, too):

Host A (physical) => Host B (physical), the data is for ping time to gateway
Longest Response: 670.9ms
Shortest Response: 0.3ms
Average Response: 1.27ms
Standard deviation: 5.15ms

Host C (virtual) => Host B (physical), the data is for ping time to gateway
Longest Response: 42.7ms
Shortest Response: 0.3ms
Average Response: 0.74ms
Standard Deviation: 1.35ms

So again, my assumption in the way that the ICMP traffic is de-prioritized leads me to believe that this indicates a problem, as I would expect that the standard deviation from both tests would be about the same simply due to the amount of data collected over multiple days.

edit: Because I'm bad at stating my specific question, it is: Does this data indicate there is a likely problem, or is gateway pinging with expired TTL such that there is really no way to tell without getting other tools involved? (I'm testing several other things right now because while I believe this may indicate an issue, I do not believe it indicates a very large issue).

Powercrazy posted:

You said in the other thread Host A and Host C were on a different subnet. This opens up a huge pool of potential causes for the latency. So to answer your questions.
"Am I wrong to think these should be consistently unreliable?" Yes. The specific reasons why are difficult and maybe even impossible to answer.
"is this actually indicative of a problem, or is this expected given the nature of CISCO devices." It is not by itself indicative of a problem, and it is also not unique to Cisco devices as most network stacks prioritize control traffic differently then traffic passing through them.

Thanks! I don't know if the additional data I posted above adds any context that could help in one way or another.

Winkle-Daddy fucked around with this message at 22:40 on Nov 4, 2015

Winkle-Daddy
Mar 10, 2007

Slickdrac posted:

It looks like it's functioning fairly normally, potentially, without knowing just how much traffic and throughput the network and the switches are seeing. Random spikes of traffic are going to eat up clock time and create large numbers. If I were your engineer, I would just load up my snmp monitoring and do a glance over of interface errors, CPU utilization, and interface utilization. It doesn't seem like anything terribly odd, but I always donate a good two minutes when I have a single person raising a question of possible speed issues, because I'm nice, and because I don't want to look like a total rear end later when a cable/interface is starting to fail or a device is approaching overload levels.

But getting occasional triple digits isn't terrible odd if there's a heavily utilized device in the pathway. We have a massive fiber ring that links up 4 offices to each other and the data center, it'll ping consistently at 5-15 ms, but on occasion will just spike up to 200, and rarely will sometimes stop off for donuts and come back in 4 figure land. It's just dependent on the amount of traffic going through at that particular moment. Nothing to worry about unless a 2 minute sanity check reveals the start of what could become a larger issue.

Cool, thanks for the suggestions! We are running performance tests and it turns out trying to characterize network performance is a hard problem :downs:

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply