Cloud Computing: Mostly fog machines in other people's datacenters

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Cloud Computing: Mostly fog machines in other people's datacenters

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

MagnumOpus posted:

<intermittent network failures> Ideas?

This happens everywhere where I am, and after fighting these issues for months one by one across one of the largest enterprise networks in the world supposedly, it's probably something really asinine in the end. Here's what I've found before as culprits:

1. Asymmetric routing. It's quite common but if you run mtr and watch packets get dropped somewhere roughly around a 50% duty cycle across a connection and you have two primary network paths available, you're looking at this as a fundamental problem. This is what oftentimes occurs between two different physical networks like across WANs and BGP where you advertise AS paths and sometimes the other peer does not quite respect your routing prefixes. AWS does respect this unlike many others, so request them, dammit. I used mtr to diagnose this problem live as it happened. Sadly enough, I'm not even a network admin and using that taught our network architects a new tool to use (yeah.... that's not a good sign when your random-rear end contracted devops guy is figuring poo poo out for your supposedly best network guys)
2. As mentioned above, mismatched TCP MTU. Note that AWS VMs use an MTU of 9001 by default and despite being off by one can chunk 1500 multiples fine, but having to convert a lot can result in packet fragmentation problems that translate ultimately into retransmits and packet reassembly times going up.
3. Just check your TTLs when to make sure that they're not expiring once in a while from a really, really, really complicated network. Had a user that was on a 40+ hop network complaining about how he couldn't get to AWS VMs reliably because it was so slow. Half his packets were dropping from an ancient network (literally almost as old as me) shoehorned onto a random-rear end backbone and so forth and TTL was just plain running out.
4. If you're using ping to AWS (doubtful, you're with an OpenStack provider), AWS has told me they're supposed to drop somewhere around 10% of ping traffic for performance reasons - check that your provider is not doing traffic shaping or anything to cause this.
5. Our instances (running VMware) are on severely overprovisioned clusters and drop pings randomly from underlying hardware just plain not keeping up to the point where our software HA solution is more of a liability because it detects 3 ping failures and tries to failover and that's about when it fails back, so we get all sorts of inconsistent state problems. Check /var/log/dmesg for kernel messages

For some general ideas, Brendan Gregg's book and his website have all sorts of solid methodologies for "figure out wtf is going wrong" and "why is poo poo so slow?" problems.

To more directly answer your monitoring question, you seem to need event correlation alongside your network monitoring. We're just running Graphite with Sensu grabbing NIC metrics and shoving them onto the AMQP bus and I map different time series together onto the time domain and look for patterns. A lot of this tends to just plain suck because our infrastructure has a serious case of :downs:

a lot because ntpd doesn't even work and half our clocks are off by 4+ minutes, but looking into how your TCP stack behaves with the rest of your system state is handy when running applications.

In most cases, threatening to drop your provider because you're having intermittent network problems will almost always get them on the phone and trying to diagnose your issue right away. You can improve your chances of faster resolution by providing network analysis for the vendor's network folks trying to eliminate the above issues (mtr - newer versions support MPLS labels btw, sar, maybe nmap for its peculiar traceroute methods, and tcp statistics from tcpdump , etc.)

# ¿ Dec 5, 2015 05:37

Adbot: ADBOT LOVES YOU

# ¿ Apr 28, 2024 13:55

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

Vulture Culture posted:

I've almost never found time series to be useful for anything in recent memory -- though I once burned 3,000 IOPS and 400 GB of disk space on Graphite chasing down a single NFS performance regression on an IBM software update (have you run across the mountstats collector for Diamond? That was me, for this problem) -- but I agree completely about Brendan Gregg and his USE method, and the need for correlation. At the most basic level, this means system clocks corrected to within a second or two, and reasonable log aggregation to help determine exactly what's going on in a distributed system. The Graphite bits are super-useful if you find yourself looking at actual NIC errors, but if you're divorced from the physical hardware in a private cloud, you'll see dwindling returns.

I feel like an old neckbeard, but I've been relying more and more on sar/sysstat recently and less on stuff like collectd and Graphite. It's certainly a lot easier to scale.

Time series were helpful identifying certain patterns that had any form of regularity or direct correlation so far for me. A VM was having problems constantly at 3 am or so and we found that a BackupExec job was running alongside our VM internal rsync and garb all backup job and that disk I/O would completely stall while we had triple whammies involving drbd and network saturation due to CPU getting overwhelmed while all that crap is going on (load of 45 on an 8 vCPU box in prod is scary, man). We wound up disabling the job entirely and because our internal network is garbagetastic for our own drat needs we need to wait about another 6 months for a major network redesign (again - they have to do this almost every year) to get two stupid VNICs to run on two different VLANs at least.

There's no way I'd have found a lot of problems without tcpdump / Wireshark such as bad NAT configurations or low firewalls and traffic shapers gone mad, and anything else in your usual enterprise network of madness. AWS offering Flow Logs would be great if I could get any drat access for these to be able to use the feature instead of having to do crazypants things like e-mailing LEGAL if I can directly dump and share info off a BGP switch with Amazon support. But our cloud maturity level is pretty bad so I suspect we won't find one thing wrong with an AWS service for 400 things that is our fault. Our poor Amazon account rep :smith:

Collecting what sort of errors start showing up at what times is helpful when you're trying to at least avoid some obvious problems like cron jobs or vMotion and you want to quickly share different error stats at certain points in time with third parties, including your cloud provider.

# ¿ Dec 8, 2015 06:20

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

Tab8715 posted:

Extra-circular question but when it comes to massive web-based SaaS Applications like Facebook, Salesforce, Apple iCloud what are they using for their Directory Service?

This is specific for Salesforce, but everything I know about Salesforce's primary platform after talking to various employees over the years is that they actually don't even have that many servers (< 1500) and that they don't even use virtualization, so I'd make a guess that software-wise they're still probably fine with LDAP. I know they aren't using CA SiteMinder (the only other alternative to Active Directory worth a drat in any fashion that gives consideration to dinosaur enterprise needs and enterprise feature check boxes). And to add to the above, Salesforce's primary tech stack was based around hiring Oracle engineers (and Benioff is ex-Oracle himself after all) so they went with Oracle because it made the most sense for them. Salesforce's scaling problems are almost nothing like the problems that face most b2c platforms and will have a lot more concerns about regulatory considerations for their customers that make dealing with oftentimes tech-clueless regulators easiest with just pulling crap into a traditional enterprise RDBMS like Oracle and just treat it like a typical ETL platform with data warehousing that's the bread and butter of Oracle's business.

# ¿ Dec 26, 2015 06:12

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

Sounds like someone customized that Horizon portal..... poorly.

# ¿ Jan 12, 2016 19:39

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

Wait... no backups?

Although ironically, my group trying to setup backups for our Chef installation (less than 10k nodes across thousands of users and people think we're getting overloaded first thing, lol) has caused more serious outages and failures than if we had never bothered with back-ups. Hell, we had Netbackup running (which was causing some of the outages, ironically too) that could have restored our systems, so we had double backups going for a while. The same thing goes for our Chef HA setup (it's caused more availability loss than anything due to split brain problems happening with a really, really flaky ESX cluster running way, way overprovisioned on everything when we randomly drop ICMP packets for a few times in a row even though no vMotions are happening).

# ¿ Jan 13, 2016 19:44

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

Half the places I've seen using OpenStack come from a VMware-ish or traditional infrastructure history where you're used to treating every host as a pet, so everyone should be running high availability storage and network to some degree as a habit. If people are starting to deploy OpenStack thinking it magically fixes all these things for you, I dunno if you just became brain damaged or what.

# ¿ Jan 14, 2016 05:34

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

I've learned to just say "Yes, we can do it, but... here's what you're trading off" and documenting it carefully while I silently work on counter-measures / contingencies for the inevitable failure of the bad design. If a place can't get a lot of system design right, there's a fair chance they're not going to get even basic stuff like HA network and storage right and you're just on your own really. Replication factor of 1? Absolutely fine for your goal of cost savings, Mr. SVP! I'll check with the backup guys that we've put all these into backup inclusions in the meantime if that's alright with you?

Vulture Culture posted:

This is a great reason to have devs responsible for operating their own software

Yeah, and this is why I wound up doing ops in the end for a software team because I had to make up for all the other software engineers that couldn't wrap their heads (and more importantly - didn't want to care) around why you shouldn't just run JBODs, unmask all LUNs on the SANs, stop IPtables, disable SELinux and AppArmor on a public-facing web server, or have flat 10.0.0.0/8 subnets everywhere. Going fast by removing safety measures seems to be fashionable among developers that don't care because "I made it compile and pass my meaningless unit tests with like 20% coverage" is the modus operandi of developer-centric start-ups and because they're paid stupidly high wages they feel they're an authority on anything besides just plain code. Nowadays, I'm now banging my head against ops engineers that don't understand software from this century and how to plan infrastructure for "all these newfangled clouds."

Also, gently caress the developers that think reading highscalability.com posts means they're a goddamn CCIE and VCDX architect.

# ¿ Jan 14, 2016 23:02

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

MagnumOpus posted:

For examples they were recently stumped by the eventually-consistent nature of Cassandra, forcing a massive refactor of their app, and don't even get the PM started on how badly they missed the mark on estimating SSD storage costs.

In my world, if you're an application architect making really big decisions, you should really, really, really pay attention to the nature of your loving persistence layers and know exactly what you're trading off when you're going between different database solutions. That's just plain incompetence / laziness that has nothing to do with whether they're decent at operational considerations or not. If you ever read anything about NoSQL you'd know that almost everybody is trading off consistency for eventual consistency - another example of skimming blog posts after having worked on some BS applications in the 90s as your "hands on experience" and calling yourself a Big Data architect fashionable in tech (start-up hipsters do this poo poo too).

cheese-cube posted:

Yeah I've read that article and unfortunately it looks like system routes cannot be overridden. Was hoping that there was a way to do it but oh well.

Those routes look pretty reasonable. In fact, they're exactly the routes that I'm afraid of losing in our enterprise-as-all-goddamn-hell :downs:

network routing configuration in AWS that we're going to duplicate in Azure. Internet subnet, cloud-local subnet, and your private route - this is what a privileged VM should look like. It gets dicey if you want NAT instances that act as edge routers and IDS like most enterprises do admittedly, but in those cases I'd just override the routes locally via something like iptables (or whatever the hell is in Windows) instead of letting anything outside the instance take precedence. Almost everything I've heard about people using Azure leads me to believe "we just want cheaper Windows VMs that are connected to the corporate network than what corporate IT can give us, oh dear god please anything but those" (myself included, our internal private cloud is miserable and destitute).

Thanks Ants posted:

It's the Microsoft Excel of the virtualization stacks.

Excel is easy to use and is used to run a great deal of the world economy and has a fantastic user interface that very, very quickly shows its limitations before people can take it way too far (and if you do, you ignored the warnings and very much deserve the rear end-pounding that trying to run the equivalent of Big Data OLTP or OLAP on a single loving file will result in). Excel is used extensively by small businesses and scales pretty well out until your needs really are unique and requiring enough scale to warrant something serious (MS Access is a weird place honestly but I've worked with Access DBs enough to know they're definitely used before the era of Google forms off of Sheets). I had a customer before that was running a $80M+ datacenter off of Excel - they wanted their configuration management, monitoring, everything to use Excel as their CMDB. I don't like CMDBs, but holy christ please pick something that you won't be modifying and read-locking throughout the day while your orchestration engine is falling over trying to figure out "did I gently caress up?"

Ixian posted:

If you are googling "eventually consistent" at 3am after the poo poo has hit the fan and splattered around the room you just might be an Openstack user.

Yeah yeah, blame the user, not the fault of the software, I get it. Managing external IT and software disasters is pretty much half of my job though and I see this way too often. Software is a tool and if you are a tool using it any system can go bad but something about this stack brings it out more than others. Not Openstacks fault probably, just how it is.

I've been assisting the Fortune 500 and federal government in various places for the better part of a decade and the amount of incompetence associated with the OpenStack implementers out there is mostly a reflection of the fact that OpenStack has been fantastic at addressing the technical concerns and overall (typically brain damaged) use cases and needs of these organizations.... without addressing any of the non-technical problems that led to these places having such terrible requirements in the first place that can't scale, extremely costly, and generally destined for failure in every sense of the word. I do not envy OpenStack engineers in any way, it's going to be a long battle and by the time anyone has a really great, wonderful experience with OpenStack at scale, we'll be on public clouds that probably meet OpenStack API specs meaning the literal only reason to go with OpenStack would be "you own it" and that's only required by certain government regulations based upon archaic motivations that will hopefully die within our lifetime.

# ¿ Jan 15, 2016 03:23

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

I don't think that's possible. I know I can do it with the API in VMware's stuff, but in EC2 you have to launch an instance to create or clone one, and launching means it gets put into pending and then running state. The official lifecycle document from AWS pretty much means that you don't get any state of an instance before the Pending state. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-lifecycle.html Only thing I could think of as a sneaky way possibly is to create an AMI that will immediately shut down the machine before the bootloader kicks in the first time you launch it, but I suspect that won't help because it might have to be put into the Running state first for that to kick in.

If you need to queue up a bunch of instances to be able to handle something like a spike load it's easy to forget that you have to let AWS know so that your ELBs don't get run over by a freight train. I think that applies for even internal ELBs.

# ¿ Apr 15, 2016 23:05

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

Even more roughly, block storage is meant to be used like it's attached to another, mutually exclusive device (usually single entity like a virtual machine), and object storage is generally expected for use from remote locations and should support access semantics appropriate for those use cases. At least this was sufficient enough for people that aren't familiar with AWS in my past.

# ¿ Aug 23, 2016 14:48

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

Google isn't particularly well known for being very hands on and personal with their support. On the other hand, scaling support like how AWS works really sucks and is really expensive so it'll likely go to enterprise accounts primarily in practice I'd wager.

# ¿ Aug 23, 2016 16:29

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

Novo posted:

When I set up 2FA I always enroll both my phone and my tablet using the same QR code, in case one of them dies.

Google Authenticator app syncs across all my devices for me in my experience. I'm not quite sure if that's quite secure TBQH because it implies that the secret seed number and other factors are synchronized to another location away from your device.

The problem with a shared MFA stuck in a vault is that you can't revoke access to it necessarily if someone had access to it even after an emergency situation. You'll need a way to confirm or ensure rotation / revocation of existing MFA tokens as well if your security people are stringent. I had all my MFA keys after I left my last job and my account logins were all disabled once I lost access to my e-mail, but if it was your AWS root account you may not want to disable it completely outright (although AWS will tell you that you totally should go full hog IAM roles out the wazoo everywhere and don't bother using root accounts).

Root account credentials for AWS accounts at my last place were stored at datacenters using HSM (there were over 90+ AWS accounts - not quite change between the couch cushions).

# ¿ Sep 15, 2016 21:54

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

Internet Explorer posted:

Not sure I understand why this is the case?

The MFA secret given out is not a one-time-use code to access the root account for a static period of time unless you reset MFA on the account after each use automatically. Security wants one and only one person to have root at any given moment and to have irrefutability and authenticity. A shared root account that someone or someones can have unrestricted access to without re-authenticating is a Bad Thing. The other option is to continuously rotate the root password though of course, but I got some weird response about that approach despite that being a lot more sensible than rotating freakin' MFA.

# ¿ Sep 15, 2016 22:42

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

Internet Explorer posted:

Sorry, I guess I am just not understanding. How does a hardware authentication token not give out one time use keys? Doesn't that defeat the purpose?

I wasn't meaning a hardware token but a soft token above where, for example, several users could scan the QR code. If you have a literal hardware token that can obviously be revoked and is unique. Most of the security / compliance principles I've seen invoked for MFA-everywhere also involve guaranteeing that all access and use of privileged accounts are traceable to a human operator and access can be revoked on-demand. A root AWS account means nothing about who logged in as the human using it in the first place.

# ¿ Sep 16, 2016 01:13

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

The other important part is "what do you know already that you think may or may not be relevant"? Secondly, do you even have an interest in those topics in the first place?

# ¿ Dec 21, 2016 01:30

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

If you're in an enterprise house of horrors, you can use Shibboleth with Nginx to perform Single Sign On (specifically via FastCGI), too. https://github.com/nginx-shib/nginx-http-shibboleth The primary advantage of using something like Shibboleth is that you can potentially hand off authn/authz headaches to teams of people that eat it for breakfast while you can maintain your apps and services separately from that layer.

At a previous place in 2014, even the enterprise security offerings for an ELK stack wasn't good enough and we wound up deferring everything to AD / LDAP with ACL mappings. But of course, this was before they released X-Pack and such, but with how rudimentary the roadmap looked for enterprise-gently caress-cool-things shops it wasn't going to pan out.

# ¿ Jan 3, 2017 17:39

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

A while ago I think someone here wanted to know how to start AWS instances in a stopped state and while researching an unrelated issue found that you can launch an instance with userdata for the cloud-init boot script to start-up and immediately shutdown. You cannot start an instance without incurring some form of a cost in one way or another, but using something like an instance store (provided you don't mind the first-write penalty or are ok with using one of the more expensive SSDs that's pre-warmed to avoid the first 10+ min of initialization) you can minimize the cost of pre-baked and lukewarm AWS instances.

Not sure if anyone else's place has horrific habits of blowing crazy amounts of money on instances that sit doing nothing for the most part using the most expensive setups possible but I'm still amazed at how so many places can blow through $1MM / mo with maybe 10% utilization of resources and hardly blink even though they're typically cost center orgs.

# ¿ Feb 24, 2017 14:50

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

Crap like that is why I'm loathing deploying a brand new bare metal production Openstack Newton environment. Given contracts that literally forbid us from putting customer data in AWS and we have hard dependencies upon appliances that are literally hardware-only so the best we can do is to manage some bare metal and deploy Kubernetes, Swarm, or... Openstack. In the end I'll be deploying Kubernetes on top of Openstack and letting Kubernetes figure out how to deal with the unreliable stack below. That'll take about two years I figure and be a good time to quit or die of alcohol poisoning.

# ¿ Mar 16, 2017 00:07

Adbot: ADBOT LOVES YOU

# ¿ Apr 28, 2024 13:55

necrobobsledder: Mar 21, 2005; Lay down your soul to the gods rock 'n roll; Nap Ghost

Have you confirmed that the DHCP server is seeing the request on its interfaces? Is the reply making it through firewalls and gateways? What are your port mirroring / broadcast / promiscuous interface settings in your stack?

# ¿ Mar 22, 2017 00:49

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Cloud Computing: Mostly fog machines in other people's datacenters