|
MagnumOpus posted:<intermittent network failures> Ideas? 1. Asymmetric routing. It's quite common but if you run mtr and watch packets get dropped somewhere roughly around a 50% duty cycle across a connection and you have two primary network paths available, you're looking at this as a fundamental problem. This is what oftentimes occurs between two different physical networks like across WANs and BGP where you advertise AS paths and sometimes the other peer does not quite respect your routing prefixes. AWS does respect this unlike many others, so request them, dammit. I used mtr to diagnose this problem live as it happened. Sadly enough, I'm not even a network admin and using that taught our network architects a new tool to use (yeah.... that's not a good sign when your random-rear end contracted devops guy is figuring poo poo out for your supposedly best network guys) 2. As mentioned above, mismatched TCP MTU. Note that AWS VMs use an MTU of 9001 by default and despite being off by one can chunk 1500 multiples fine, but having to convert a lot can result in packet fragmentation problems that translate ultimately into retransmits and packet reassembly times going up. 3. Just check your TTLs when to make sure that they're not expiring once in a while from a really, really, really complicated network. Had a user that was on a 40+ hop network complaining about how he couldn't get to AWS VMs reliably because it was so slow. Half his packets were dropping from an ancient network (literally almost as old as me) shoehorned onto a random-rear end backbone and so forth and TTL was just plain running out. 4. If you're using ping to AWS (doubtful, you're with an OpenStack provider), AWS has told me they're supposed to drop somewhere around 10% of ping traffic for performance reasons - check that your provider is not doing traffic shaping or anything to cause this. 5. Our instances (running VMware) are on severely overprovisioned clusters and drop pings randomly from underlying hardware just plain not keeping up to the point where our software HA solution is more of a liability because it detects 3 ping failures and tries to failover and that's about when it fails back, so we get all sorts of inconsistent state problems. Check /var/log/dmesg for kernel messages For some general ideas, Brendan Gregg's book and his website have all sorts of solid methodologies for "figure out wtf is going wrong" and "why is poo poo so slow?" problems. To more directly answer your monitoring question, you seem to need event correlation alongside your network monitoring. We're just running Graphite with Sensu grabbing NIC metrics and shoving them onto the AMQP bus and I map different time series together onto the time domain and look for patterns. A lot of this tends to just plain suck because our infrastructure has a serious case of a lot because ntpd doesn't even work and half our clocks are off by 4+ minutes, but looking into how your TCP stack behaves with the rest of your system state is handy when running applications. In most cases, threatening to drop your provider because you're having intermittent network problems will almost always get them on the phone and trying to diagnose your issue right away. You can improve your chances of faster resolution by providing network analysis for the vendor's network folks trying to eliminate the above issues (mtr - newer versions support MPLS labels btw, sar, maybe nmap for its peculiar traceroute methods, and tcp statistics from tcpdump , etc.)
|
# ¿ Dec 5, 2015 05:37 |
|
|
# ¿ Apr 28, 2024 13:55 |
|
Vulture Culture posted:I've almost never found time series to be useful for anything in recent memory -- though I once burned 3,000 IOPS and 400 GB of disk space on Graphite chasing down a single NFS performance regression on an IBM software update (have you run across the mountstats collector for Diamond? That was me, for this problem) -- but I agree completely about Brendan Gregg and his USE method, and the need for correlation. At the most basic level, this means system clocks corrected to within a second or two, and reasonable log aggregation to help determine exactly what's going on in a distributed system. The Graphite bits are super-useful if you find yourself looking at actual NIC errors, but if you're divorced from the physical hardware in a private cloud, you'll see dwindling returns. There's no way I'd have found a lot of problems without tcpdump / Wireshark such as bad NAT configurations or low firewalls and traffic shapers gone mad, and anything else in your usual enterprise network of madness. AWS offering Flow Logs would be great if I could get any drat access for these to be able to use the feature instead of having to do crazypants things like e-mailing LEGAL if I can directly dump and share info off a BGP switch with Amazon support. But our cloud maturity level is pretty bad so I suspect we won't find one thing wrong with an AWS service for 400 things that is our fault. Our poor Amazon account rep Collecting what sort of errors start showing up at what times is helpful when you're trying to at least avoid some obvious problems like cron jobs or vMotion and you want to quickly share different error stats at certain points in time with third parties, including your cloud provider.
|
# ¿ Dec 8, 2015 06:20 |
|
Tab8715 posted:Extra-circular question but when it comes to massive web-based SaaS Applications like Facebook, Salesforce, Apple iCloud what are they using for their Directory Service?
|
# ¿ Dec 26, 2015 06:12 |
|
Sounds like someone customized that Horizon portal..... poorly.
|
# ¿ Jan 12, 2016 19:39 |
|
Wait... no backups? Although ironically, my group trying to setup backups for our Chef installation (less than 10k nodes across thousands of users and people think we're getting overloaded first thing, lol) has caused more serious outages and failures than if we had never bothered with back-ups. Hell, we had Netbackup running (which was causing some of the outages, ironically too) that could have restored our systems, so we had double backups going for a while. The same thing goes for our Chef HA setup (it's caused more availability loss than anything due to split brain problems happening with a really, really flaky ESX cluster running way, way overprovisioned on everything when we randomly drop ICMP packets for a few times in a row even though no vMotions are happening).
|
# ¿ Jan 13, 2016 19:44 |
|
Half the places I've seen using OpenStack come from a VMware-ish or traditional infrastructure history where you're used to treating every host as a pet, so everyone should be running high availability storage and network to some degree as a habit. If people are starting to deploy OpenStack thinking it magically fixes all these things for you, I dunno if you just became brain damaged or what.
|
# ¿ Jan 14, 2016 05:34 |
|
I've learned to just say "Yes, we can do it, but... here's what you're trading off" and documenting it carefully while I silently work on counter-measures / contingencies for the inevitable failure of the bad design. If a place can't get a lot of system design right, there's a fair chance they're not going to get even basic stuff like HA network and storage right and you're just on your own really. Replication factor of 1? Absolutely fine for your goal of cost savings, Mr. SVP! I'll check with the backup guys that we've put all these into backup inclusions in the meantime if that's alright with you?Vulture Culture posted:This is a great reason to have devs responsible for operating their own software Also, gently caress the developers that think reading highscalability.com posts means they're a goddamn CCIE and VCDX architect.
|
# ¿ Jan 14, 2016 23:02 |
|
MagnumOpus posted:For examples they were recently stumped by the eventually-consistent nature of Cassandra, forcing a massive refactor of their app, and don't even get the PM started on how badly they missed the mark on estimating SSD storage costs. cheese-cube posted:Yeah I've read that article and unfortunately it looks like system routes cannot be overridden. Was hoping that there was a way to do it but oh well. Thanks Ants posted:It's the Microsoft Excel of the virtualization stacks. Ixian posted:If you are googling "eventually consistent" at 3am after the poo poo has hit the fan and splattered around the room you just might be an Openstack user.
|
# ¿ Jan 15, 2016 03:23 |
|
I don't think that's possible. I know I can do it with the API in VMware's stuff, but in EC2 you have to launch an instance to create or clone one, and launching means it gets put into pending and then running state. The official lifecycle document from AWS pretty much means that you don't get any state of an instance before the Pending state. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-lifecycle.html Only thing I could think of as a sneaky way possibly is to create an AMI that will immediately shut down the machine before the bootloader kicks in the first time you launch it, but I suspect that won't help because it might have to be put into the Running state first for that to kick in. If you need to queue up a bunch of instances to be able to handle something like a spike load it's easy to forget that you have to let AWS know so that your ELBs don't get run over by a freight train. I think that applies for even internal ELBs.
|
# ¿ Apr 15, 2016 23:05 |
|
Even more roughly, block storage is meant to be used like it's attached to another, mutually exclusive device (usually single entity like a virtual machine), and object storage is generally expected for use from remote locations and should support access semantics appropriate for those use cases. At least this was sufficient enough for people that aren't familiar with AWS in my past.
|
# ¿ Aug 23, 2016 14:48 |
|
Google isn't particularly well known for being very hands on and personal with their support. On the other hand, scaling support like how AWS works really sucks and is really expensive so it'll likely go to enterprise accounts primarily in practice I'd wager.
|
# ¿ Aug 23, 2016 16:29 |
|
Novo posted:When I set up 2FA I always enroll both my phone and my tablet using the same QR code, in case one of them dies. The problem with a shared MFA stuck in a vault is that you can't revoke access to it necessarily if someone had access to it even after an emergency situation. You'll need a way to confirm or ensure rotation / revocation of existing MFA tokens as well if your security people are stringent. I had all my MFA keys after I left my last job and my account logins were all disabled once I lost access to my e-mail, but if it was your AWS root account you may not want to disable it completely outright (although AWS will tell you that you totally should go full hog IAM roles out the wazoo everywhere and don't bother using root accounts). Root account credentials for AWS accounts at my last place were stored at datacenters using HSM (there were over 90+ AWS accounts - not quite change between the couch cushions).
|
# ¿ Sep 15, 2016 21:54 |
|
Internet Explorer posted:Not sure I understand why this is the case?
|
# ¿ Sep 15, 2016 22:42 |
|
Internet Explorer posted:Sorry, I guess I am just not understanding. How does a hardware authentication token not give out one time use keys? Doesn't that defeat the purpose?
|
# ¿ Sep 16, 2016 01:13 |
|
The other important part is "what do you know already that you think may or may not be relevant"? Secondly, do you even have an interest in those topics in the first place?
|
# ¿ Dec 21, 2016 01:30 |
|
If you're in an enterprise house of horrors, you can use Shibboleth with Nginx to perform Single Sign On (specifically via FastCGI), too. https://github.com/nginx-shib/nginx-http-shibboleth The primary advantage of using something like Shibboleth is that you can potentially hand off authn/authz headaches to teams of people that eat it for breakfast while you can maintain your apps and services separately from that layer. At a previous place in 2014, even the enterprise security offerings for an ELK stack wasn't good enough and we wound up deferring everything to AD / LDAP with ACL mappings. But of course, this was before they released X-Pack and such, but with how rudimentary the roadmap looked for enterprise-gently caress-cool-things shops it wasn't going to pan out.
|
# ¿ Jan 3, 2017 17:39 |
|
A while ago I think someone here wanted to know how to start AWS instances in a stopped state and while researching an unrelated issue found that you can launch an instance with userdata for the cloud-init boot script to start-up and immediately shutdown. You cannot start an instance without incurring some form of a cost in one way or another, but using something like an instance store (provided you don't mind the first-write penalty or are ok with using one of the more expensive SSDs that's pre-warmed to avoid the first 10+ min of initialization) you can minimize the cost of pre-baked and lukewarm AWS instances. Not sure if anyone else's place has horrific habits of blowing crazy amounts of money on instances that sit doing nothing for the most part using the most expensive setups possible but I'm still amazed at how so many places can blow through $1MM / mo with maybe 10% utilization of resources and hardly blink even though they're typically cost center orgs.
|
# ¿ Feb 24, 2017 14:50 |
|
Crap like that is why I'm loathing deploying a brand new bare metal production Openstack Newton environment. Given contracts that literally forbid us from putting customer data in AWS and we have hard dependencies upon appliances that are literally hardware-only so the best we can do is to manage some bare metal and deploy Kubernetes, Swarm, or... Openstack. In the end I'll be deploying Kubernetes on top of Openstack and letting Kubernetes figure out how to deal with the unreliable stack below. That'll take about two years I figure and be a good time to quit or die of alcohol poisoning.
|
# ¿ Mar 16, 2017 00:07 |
|
|
# ¿ Apr 28, 2024 13:55 |
|
Have you confirmed that the DHCP server is seeing the request on its interfaces? Is the reply making it through firewalls and gateways? What are your port mirroring / broadcast / promiscuous interface settings in your stack?
|
# ¿ Mar 22, 2017 00:49 |