|
I'm hoping someone can answer a quick question about CISCO routers. I know that mtr is unreliable when reporting response time from a CISCO device that is your network gateway because it offloads requests to the gateway to a different CPU (I think?) and then responds whenever load is low enough to do so. This results in really erratic response times. -- Or at least this is my "I'm only passingly familiar with CISCO hardware" understanding. Now let's say I'm looking for routes to the gateway that might be problematic. My approach was to simply sample mtr data over a long period of time (48 hours) and then hope the data normalizes some. What I saw was the following: Host A -> Gateway -> Host B results in up to 500-700ms extremes in the data collected. Host C -> Gateway -> Host B results in 40-60ms extremes in the data collected. The extreme data points happen 15-20 times per hour, and the standard deviation across the Host A path is much larger. It should be noted that these are very direct routes, as in, what I illustrated is it (aside from some switches at the top of the rack). My sample rate is 1 second intervals and the tests are running at the same time. I brought my findings to the network team, but they simply dismissed it outright as "mtr is unreliable when used to measure performance of CISCO devices." Am I wrong to think these should be consistently unreliable? As in, both paths should have their traffic de-prioritized in the same way and the mtr data should be all over the place...but the extremes should be similar in both tests. The other thing to note is that Host C is actually a VM and is going through UCS, while Host A and Host B are both physical hosts. I'm just trying to assess if this difference is indicative of a problem through one path and not the other. e: the gateway in question is a beefy 7K router. Winkle-Daddy fucked around with this message at 19:59 on Nov 4, 2015 |
# ¿ Nov 4, 2015 19:52 |
|
|
# ¿ Apr 27, 2024 23:37 |
|
Slickdrac posted:Is "gateway" the same exact IP/interface? Slickdrac posted:Do they go through the same or different switches? Slickdrac posted:Is this an always live and active network, or do the times during off hours become stable and lower and similar then? Slickdrac posted:If you hit between devices within the same subnet, do you still see triple digit extremes from A? Slickdrac posted:Is the overall average time in the single digits/VERY low double digits? If the answer is Yes/Yes, same, minimal change, yes, no. Then could be a bad cable to the host A or in the pathway of it. If any of those answers deviate, then it could just be the amount of traffic hitting one of the devices/interfaces on the line (or still a problem with the cable, because no one expects Layer 1 except helpdesk support, higher support overthinks too much) Instead of just giving vague generalities about how long it is, I'll just provide some real numbers. These numbers reflect the overall averages over a 48hr period (however, each hour is characterized almost exactly the same as we did that, too): Host A (physical) => Host B (physical), the data is for ping time to gateway Longest Response: 670.9ms Shortest Response: 0.3ms Average Response: 1.27ms Standard deviation: 5.15ms Host C (virtual) => Host B (physical), the data is for ping time to gateway Longest Response: 42.7ms Shortest Response: 0.3ms Average Response: 0.74ms Standard Deviation: 1.35ms So again, my assumption in the way that the ICMP traffic is de-prioritized leads me to believe that this indicates a problem, as I would expect that the standard deviation from both tests would be about the same simply due to the amount of data collected over multiple days. edit: Because I'm bad at stating my specific question, it is: Does this data indicate there is a likely problem, or is gateway pinging with expired TTL such that there is really no way to tell without getting other tools involved? (I'm testing several other things right now because while I believe this may indicate an issue, I do not believe it indicates a very large issue). Powercrazy posted:You said in the other thread Host A and Host C were on a different subnet. This opens up a huge pool of potential causes for the latency. So to answer your questions. Thanks! I don't know if the additional data I posted above adds any context that could help in one way or another. Winkle-Daddy fucked around with this message at 22:40 on Nov 4, 2015 |
# ¿ Nov 4, 2015 21:15 |
|
Slickdrac posted:It looks like it's functioning fairly normally, potentially, without knowing just how much traffic and throughput the network and the switches are seeing. Random spikes of traffic are going to eat up clock time and create large numbers. If I were your engineer, I would just load up my snmp monitoring and do a glance over of interface errors, CPU utilization, and interface utilization. It doesn't seem like anything terribly odd, but I always donate a good two minutes when I have a single person raising a question of possible speed issues, because I'm nice, and because I don't want to look like a total rear end later when a cable/interface is starting to fail or a device is approaching overload levels. Cool, thanks for the suggestions! We are running performance tests and it turns out trying to characterize network performance is a hard problem
|
# ¿ Nov 5, 2015 00:06 |