Cloud Computing: Mostly fog machines in other people's datacenters

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Cloud Computing: Mostly fog machines in other people's datacenters

«‹›2 »

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Bhodi posted:

Roll call, who's bought into any of this, and for how much? Spill your shame here.

Through no fault of my own, I've touched almost every major cloud technology in a real production or pre-production capacity in the last 18 months. AWS, Google Compute Engine, SoftLayer, OpenStack, and all the supporting tooling that goes along with it. (No Azure yet.) On the one hand, :suicide:

. On the other hand, I've got a pretty solid perspective by now on what the strengths and weaknesses are of all these platforms relative to one another. I'll do a writeup soon.

MagnumOpus posted:

Couple years back my team built a multi-DC private cloud with VMWare ESXi for infrastructure and a combination of Chef and in-house microservices supplying the platform layer.

"Hand-fed cattle."

Vulture Culture fucked around with this message at 20:37 on Feb 20, 2015

# ¿ Feb 20, 2015 20:32

Adbot: ADBOT LOVES YOU

# ¿ Apr 28, 2024 19:48

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Docjowles posted:

"Runs many core production services on OpenStack" guy checking in Our OpenStack guru actually just gave his resignation today, so this should be fun. I understand how our environment works but not quite at his level, so, woohoo?

Because Reasons, I'm looking at OpenStack AZs that are half a blade chassis wide. We're gonna be drinking heavily over the Internet at each other.

high six posted:

I think a lot of it needlessly complicates things where it doesn't need to be used and causes a lot of unneeded issues.

On the other hand, a decent PaaS (like Cloud Foundry if you insist on hosting it yourself) does a really terrific job of keeping you from having to maintain thousands of unique VM instances just because everyone wants to run some stupid pet PHP image gallery or equivalently dumb app. So, yeah, it's about using the right tool for the job. Cloud infrastructures give you a truck full of new tools.

What's really transformative about cloud-computing technology is that it empowers all the different departments of the business to leverage code and automation in whatever way is valuable to them, without being bottlenecked by a central IT department that's getting pulled in fifty different directions by forty different departments. It frees up IT to be a strategic partner for the business, rather than just a cost center. For most applications outside core LoB, this is more important than five-nines uptime.

Vulture Culture fucked around with this message at 23:17 on Feb 20, 2015

# ¿ Feb 20, 2015 23:10

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Bhodi posted:

Or, if the system is particularly complex, some elements can be automated, some can't or haven't been. Like, the database piece and initial configuration is manual but bringing up additional workers to handle load is automated.

There's a risk management piece to certain kinds of automation too. I think everyone who's tried clustering MySQL or PostgreSQL on DRBD using something like Heartbeat or Pacemaker back in the day, with automated failover turned on, has run into the failover scenario where the database and disk flap back and forth between hosts until the entire underlying file structure is completely, unrecoverably corrupt. MongoDB (lol) is theoretically easy to scale out to many nodes by using node discovery in something like Chef, but some drivers (Node) are stupid and will try to connect to database copies that are still initializing, causing timeouts.

# ¿ Feb 21, 2015 17:56

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

MagnumOpus posted:

That's the main problem with it. Like Puppet/Chef it lets you describe your networks and services well enough but the deployments are kludgy as gently caress. And when the deployment doesn't work right, triage is a nightmare because the mess of scripts tend to leave artifacts all over the place.

Also you can't use package managers in your nodes because it compiles everything on the master.

Oh boy it's just like a loving Rocks cluster but even worse

# ¿ Mar 17, 2015 21:16

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

adorai posted:

Just wait until everyone starts doing it.

It's not really a problem in most other environments (GCE, Azure, etc.), because nobody else oversubscribes CPU resources to the level that Amazon does.

re: OnMetal, there's no hourly billing option like SoftLayer, only monthly. What's the loving point?

# ¿ Mar 30, 2015 15:13

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

OpenStack nerds: is there any way to have an instance automatically terminate on shutdown, like in EC2?

# ¿ Mar 31, 2015 05:00

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Docjowles posted:

In EC2 you can set an instance so that if it is shut down (like "shutdown -h" at the command line), it automatically terminates instead of just shutting down as it normally would. He's asking if you can do that in OpenStack. Without explicitly calling "nova delete".

Bingo. We could easily have a service stop handler on this VM image that blows up the instance immediately during shutdown, but I was hoping for something built into the compute or orchestration stack already. Not having any kind of credentials baked into this VM image is a better option for us. If you're familiar with what we do, it's obvious why.

Docjowles posted:

I don't think OS has that feature, though I could easily be wrong. What we've done when we need automation that's not provided by OpenStack itself is write a little Python app that listens to the rabbit queues and takes appropriate action. You could write a handler that listens for the instance stop message and automatically fire a delete command when one comes through, for example.

I was thinking of writing a scrubber that runs every couple of minutes (and still might, to catch edge cases), but this is a really really good idea I never thought of. Thanks!

(A generic stream processing service for OpenStack that gets messages and fires webhooks might be even nicer to have someday.)

Vulture Culture fucked around with this message at 18:54 on Mar 31, 2015

# ¿ Mar 31, 2015 18:49

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

MagnumOpus posted:

We're currently looking into doing something along these lines with Stackstorm. Will be a while before I can give a field report though, we're still slowly unfucking the damage these devs did in the six months they were "doing DevOps" before it occurred to them to hire some people actually trained in operations.

This is really awesome, thanks for this!

# ¿ Apr 1, 2015 02:12

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

StabbinHobo posted:

so appreciative, thanks

if I may keep going, how is some kind of console or out of band access handled? for instance, lets assume I typo something in my kickstart, how do i get on that console and hit alt+f2 to see what went wrong?

edit: oh poo poo, if customers are sharing a vlan... can I kickstart? (pxe boot?) I assume you have to handle dhcp then not me? do you allow for configurable paramaters like next-server then?

You won't be using Kickstart or PXE directly, you'll use OpenStack's facilities for uploading and distributing basic system images to your bare-metal hosts. Like other cloud services, you'll do your post-install configuration at first boot using cloud-init. I believe OnMetal uses a configuration drive rather than http://169.254.169.254 to distribute the metadata for customizing your install, but it's transparent either way -- you'll set the metadata at server creation time through the API, and cloud-init will make decisions based on that data.

I don't think you get any kind of console access at all with OnMetal; their documentation strongly implies that these are only available on virtual servers.

Vulture Culture fucked around with this message at 02:56 on Apr 7, 2015

# ¿ Apr 7, 2015 02:52

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

https://www.youtube.com/watch?v=ZY8hnMnUDjU

# ¿ May 22, 2015 23:04

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Bhodi posted:

Like the guy said, systems guys are terrible at writing APIs and unfortunately, systems guys are the only ones writing cloud software. I really think that cloudfoundry has the right idea with their wholesale ~~stealing~~ ~~appropriation~~ compatibility with AWS / EC2.

They did it the best, they are the largest, and thus the de-facto standard.

Did I miss a big announcement where Cloud Foundry got EC2-like IaaS features? Last I heard related to IaaS and CF was that Mirantis, the only OpenStack vendor that gets it, joined the Cloud Foundry Foundation back in April. Their focus has been PaaS and they've been mostly ignoring the IaaS side, outside of fuzzy lines in between like containerized apps.

# ¿ May 31, 2015 01:29

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Bhodi posted:

This came through my twitter feed today and I just have to laugh

A complicated and expensive looking contraption that won't actually work due to engineering issues? Perfect representation.

I keep running across these weird design decisions in OpenStack as I go further down the rabbit hole. I moved our Cinder storage (Ceph backend) to a private jumbo-frames network with new IPs the other day, and after reconfiguring all the clients, I couldn't figure out why it wasn't attaching to the volumes. It turns out that, despite having named multi-backend support enabled in Cinder, it had encoded the old IP addresses of my Ceph mons into the database. But not the Cinder database, where you might expect that as volume metadata -- the Nova database, on each individual block device attachment, buried in a JSON field in a database that gets converted verbatim into a section of libvirt.xml (???). Luckily I had a DB dump lying around that I could grep through, or I never would have found that poo poo.

But it's just like that Summit video says -- there's so many weird architectural decisions in OpenStack that arose strictly from people sneaking changes into weird places so they wouldn't have to deal with some other person.

Vulture Culture fucked around with this message at 22:09 on May 31, 2015

# ¿ May 31, 2015 22:06

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

evol262 posted:

It's code reviewed before merge, but a lot of the weird decisions are because of this sense that openstack isn't really a platform, it's just a bunch of independent pieces talking to each other over a message bus with AAA handled by one common piece.

From the perspective of a developer who's never actually worked on a production environment and never had to migrate hardware or make big architectural changes, this stuff makes sense. Because the tests are run in autoprovisioned instances from Jenkins, not your extant environment that got a storage change.

And Cinder is just an API, so Cinder knowing anything about VMs at all means it's exceeding scope and doing more than providing anonymous storage, much less what Nova may be doing with them.

For Nova, telling a VM to go talk to Cinder every time introduces a dependence on it being available and adds extra traffic when it comes up. If the value is hardcoded, who cares? Less API calls. And if it doesn't work, you can just terminate and reprovision, right?

In some way, this actually removes all the dependence. But it doesn't make logical sense, and debugging it is a nightmare. Cloudstack and Eucalyptus do this better. "Real" openstack platforms that treat it as an integrated thing and not keystone+whatever could do this better but probably never will because vendors who are part of the openstack foundation don't want to see it work perfectly or add every possible rfe.

VMs shouldn't get the wrong volumes, though. Your provider probably hosed up cloning a LUN somewhere.

From a human factors perspective, this is a great explanation of why OpenStack is trash

# ¿ Jun 1, 2015 05:45

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

squeakygeek posted:

It was bad enough setting up Eucalyptus. I can only imagine how painful OpenStack is.

Once you have OpenStack running correctly, it's mostly non-objectionable, except that logging/metrics/etc. are a roll-your-own kind of solution. The main issue is that getting to that point can be a 3-4 month ordeal for a single engineer, especially if you're learning storage/network layers like Ceph and Open vSwitch on top of everything else. Once you have that, your logs are still probably not parsed/formatted correctly in your logging solution of choice. It's overkill for most companies that aren't service providers or otherwise actually earning revenue from OpenStack in some fashion.

The worst part is that OpenStack's reference architecture diagrams, especially ones from vendor slides, aren't even correct. We set up a multi-master Percona XtraDB/Galera (MySQL) cluster, because this seems to be an incredibly common deployment choice. Turns out that because of the way that OpenStack (especially Nova) uses SELECT ... FOR UPDATE, you can't have more than one node taking writes even though object IDs are mostly UUID rather than auto-increment -- for most services, you have to separate out your reader/writer connections like traditional master/slave topologies.

Vulture Culture fucked around with this message at 15:48 on Jun 22, 2015

# ¿ Jun 22, 2015 15:43

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Buffer posted:

That's good to know, because it seems the docs everywhere are just do maria+galera, you'll be fine. But a lot of openstack docs are "it will work this way... in vagrant." Soo... anyway, how are you mitigating that? HAProxy in front with two vips, one for read, one for write?

Pretty much. One vIP, two ports. The writer backend is configured to prefer the local host with the other two as backups, and vice versa on the reader backend.

We're also a single-tenant organization, so we got rid of the DbQuotaDriver, which eliminates at least half of the lock contention via SELECT ... FOR UPDATE on the Nova database.

This is a hilarious issue because if you watch the performance videos from the OpenStack summits, you know all about this problem (it's been covered by at least Percona and Mirantis in two different talks at two different summits), but it seems like for political reasons they don't actually go into any of this poo poo in the docs.

Vulture Culture fucked around with this message at 16:51 on Jul 15, 2015

# ¿ Jul 15, 2015 16:48

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Some great slides on Kubernetes 1.0 courtesy of Rajdeep Dua with VMware (!), if container scheduling is your thing:

http://www.slideshare.net/rajdeep/introduction-to-kubernetes

# ¿ Jul 28, 2015 17:37

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Some more Kubernetes resources as I find them:

Kelsey Hightower of CoreOS has published a brief Google Compute Engine lab from his talk at OSCON for getting a very basic Kubernetes configuration up and running:
https://github.com/kelseyhightower/intro-to-kubernetes-workshop

The first two chapters of his upcoming Kubernetes: Up and Running book can also be found here:
https://tectonic.com/assets/pdf/Kubernetes_Up_and_Running_Preview.pdf

# ¿ Jul 29, 2015 13:53

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Mr Shiny Pants posted:

I mean, every TCO study I've seen concludes that you pay more for the cloud in the long run.

ahahahaha look at this guy

Running your own physical hardware is like running your own electrical grid -- there are certain kinds of businesses it makes sense for, but you'd better be prepared to rationalize it nowadays.

Mr Shiny Pants posted:

Meanwhile hardware is getting cheaper and cheaper and you need less and less of it because it does get faster.

This is also why every decent service provider (AWS, Azure, GCE) continuously gets cheaper and adds newer, faster hardware types.

# ¿ Aug 15, 2015 14:54

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Mr Shiny Pants posted:

Sorry we haven't all jumped on the cloud bandwagon, maybe it is different in the states.

I'm laughing because you're talking about "the TCO" like it's remotely the same for any two companies. Everyone has vastly different methods of operating their IT services -- some people have 10,000 square feet of datacenter space that they've already paid for, and some companies have a wiring closet with a Netgear switch in it. Some companies run strategic IT in tandem with the front-end line-of-business, and others operate it as a cost center to try to squeeze out better efficiency than cloud at the cost of business agility and service focus. Any study that tells you what "the TCO" is for cloud -- in either direction -- is pulling a fast one on you. You need to calculate this for yourself.

StabbinHobo posted:

yea but, and I don't have a graph to prove this just a feel, it doesn't seem to be tracking moores law at all. 3 years of amazon price cuts might add up to a 50% price/performance improvement, but a fresh new generation of hardware will be 2x at least.

would love it if someone did the data work to prove me wrong/right

Most of the costs of cloud don't follow a 1:1 relationship with hardware, the same way that they don't in an on-premises setting. There's overhead around the cloud platform to manage all those resources, the people who have to write, maintain and operate that platform, network transit costs, etc. There's no way you're ever going to get that to 1:1 parity until Skynet self-assembles the machines into a cluster that runs themselves and teleports bits to their destinations on the other end of the wire.

Even running everything yourself, you have to deal with:

Physical space, power, cooling for a datacenter, obviously
Support staff to operate the datacenter, rack/stack hardware, track assets, deal with operating and developing policy/practice around virtualization platforms, navigating interdepartmental politics of "why can't I have a 32-core VM," other "strategic" CIO issues around utilization-based internal billing that are better spent actually integrating with the line of business to make money
Support staff to handle those support staff -- management/vendor relations, purchasing, shipping/receiving, accounts payable/receivable, HR to handle these employees upon employees
Physical facilities around all those support staff -- offices, parking, bathrooms, Internet pipes, wi-fi access points (and more network engineers to manage those previous two things)
Opportunity cost because an IT vendor schedule slip or delivery problem (and project managers to mitigate that risk (and further loss of business agility because you've forced yourself into a waterfall delivery model))
Handling whatever half-assed self-service provisioning you're bolting on top to keep Shadow IT from taking over the company from the outside in (though Cloud Foundry is getting rather good)
Overhead of physically integrating all of these facilities and support staff into any future mergers/acquisitions

None of this stuff is remotely free, and every dumb little complication detracts a lot from the business's ability to just be a business.

If you're fundamentally a logistics company, awesome, but there's a pretty big case to be made for "I don't want to deal with this poo poo." Cloud is not for everyone, but there's this big group of people clinging onto the legacy IT on-premises model like the people holding onto penny stocks appreciating 2% a year over the last decade because they're not actually losing money. Great, but wouldn't you still rather sell those and invest the capital into an asset that will actually make you money instead?

Vulture Culture fucked around with this message at 20:45 on Aug 15, 2015

# ¿ Aug 15, 2015 20:32

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

MagnumOpus posted:

Whether cloud is right for you depends on a lot of factors. Can you share some about your deployment?

- Are you using database systems that are designed to scale vertically or horizontally? OLAP or OLTP workloads or both?
- Do you have spike utilization during busy hours or is your profile more stable throughout the day?
- Got private/regulated data?
- How prepared is your org for doing DevOps work? This is a big one that is often overlooked; all that elasticity (generally) only pays off if you're willing to implement and maintain systems that actually scale without constant live tinkering.

Another big one: how much data do you have, and what are you doing with it? An organization with, say, 10 PB of general-use genomics data accessed over network filesystems is probably not a great candidate for cloud. Other data warehousing activities might be better in EMR/Redshift or something.

# ¿ Aug 15, 2015 20:47

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Mr Shiny Pants posted:

Now that wasn't so hard now was it.

Seeing Adorai's postings in the past I assumed we are not talking about a closet with a single switch.

Even then, the key questions aren't "should we use cloud?" They are "where can we strategically outsource operations?" and "at what point should we pay someone else to run [thing X]?"

Cloud vendors are aware that there's a big piece of the pie left uneaten in legacy/enterprise. I assure you, they're working hard to fix that problem.

# ¿ Aug 15, 2015 21:16

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

KS posted:

Even if you have a bunch of legacy systems that don't fit the cloud model, getting them out of your building and making power/cooling/network someone else's problem is usually a really easy sell. I've had a few places where the combined bill from a colo is less than what it costs to just power server room critical load and HVAC at the corporate office.

Good colos are one thing, but big cloud networks at their most unreliable are still generally better than random ad-hoc datacenters these days.

# ¿ Aug 17, 2015 23:57

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Thanks Ants posted:

Last week I spent a couple of hours explaining AWS to someone developing web apps, I explained what part EC2, S3, RDS, VPC, Elastic Beanstalk etc. played in the overall solution, showed them some documentation as it related to Wordpress in terms of where to store static content. A pretty good overview I thought.

Today I'm getting emails telling me that Bitnami LAMP stacks are ready to use

If AWS was easy enough for mere mortals to comprehend, we wouldn't have DigitalOcean nipping at their heels.

# ¿ Oct 12, 2015 19:48

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

StabbinHobo posted:

if you wanted to run a cassandra ring or a xtradb cluster which three cloud providers would you run it across? I guess basically who are the top 3 where the product is similar enough to get stuff working and not poo poo for some other reason.

Are you running standard MySQL M/M replication, or Galera?

Why do you need three different cloud providers instead of three regions on the same provider?

# ¿ Nov 8, 2015 18:51

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Docjowles posted:

edit: xtradb cluster is Percona's fork of galera IIRC

XtraDB Cluster is a MySQL distribution which includes XtraDB and Galera, among other things. Though it includes Galera, you don't need to use Galera.

# ¿ Nov 8, 2015 19:29

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Docjowles posted:

True, although I don't know why you'd bother to use the cluster version if you weren't going to cluster.

Easier to just install their bundle than integrating XtraDB into your distro's standard MySQL packages, though MariaDB includes XtraDB either way

# ¿ Nov 8, 2015 21:40

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Ryaath posted:

Image baking is going to lead to some real poo poo in my company... how do you you do it well (hook it to app ci, etc...)? Most our development is lovely java webapps.

If you're already doing some kind of configuration-as-code with Puppet, Chef, etc. then this is super-easy with tools like [http://www.packer.io]Packer[/url].

Ryaath posted:

Not accessing each and every vm directly is a shift for us.... Log aggregating I get, but do people just not monitor the underlying vms of their services? Or do I just accept that I'm going to attach a floating ip to each vm?

If you're already not controlling your IP assignments, something like Nagios is going to be a disaster for you anyway. Something agent-based that is capable of self-registering hosts like Sensu is a better fit, though there are other enterprise tools that might be a better fit for your Java webapps.

Ryaath posted:

We're missing some key services... dnsaas mostly, but the lbaas also isn't 'production-ready' (the haproxy objects only run on 1 controller and don't fail over)... do I just cry til these get added in what I assume will be 3 years?

LBaaS doesn't have HA baked-in, but that doesn't mean you can't make it work with a little bit of effort. http://blog.anynines.com/openstack-neutron-lbaas/

If you're okay talking to the OpenStack compute API, it's really, really easy to automate HAProxy or an F5 BigIP or whatever your preferred load balancing technology is from something as simple as a Python script (I'm told there's a Powershell API client now, thanks to Rackspace). Don't rely on DNS for dynamic services.

Vulture Culture fucked around with this message at 15:58 on Nov 9, 2015

# ¿ Nov 9, 2015 15:56

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Smokeping is a good start if you think there's a problem at the physical layer, but pings will almost never reveal the kinds of problems you expect them to in production. Hitting a single endpoint probably won't reveal anything about asymmetrically misconfigured link aggregates, because you'll always be taking the same network path. Small ping packet sizes won't reveal anything related to mismatched MTUs along the network causing unexpected fragmentation. Systems that don't take notice of out-of-order packets won't see UDP packets randomly going round-robin and arriving in the wrong sequence to a bad application that doesn't cope with that. Certainly, a ping every second or two will not trigger any meddling QoS policies, and won't reveal anything in particular about links that are saturated under production load.

As a practical example, here's the kind of dumb bullshit you'll run into in some cloud networks, and no quantity of pings will ever detect it for you:
https://code.google.com/p/google-compute-engine/issues/detail?id=87

A better option is to come up with some kind of test suite that's representative of your production workload, start a packet capture (tcpdump ring buffer is an awesome option), run it until you see the issue, then inspect the network traffic. Are you seeing packets randomly arriving out of order at your endpoint? Are you receiving fragmented packets that you expect not to be fragmented? Is there significant latency between certain packets leaving the one system and arriving at the other? Is some traffic just plain missing? The best place to start is to just analyze a basic packet capture and see what Wireshark's UI flags in red. Your local network device error counters (and dmesg) are also your friend.

necrobobsledder posted:

For some general ideas, Brendan Gregg's book and his website have all sorts of solid methodologies for "figure out wtf is going wrong" and "why is poo poo so slow?" problems.

To more directly answer your monitoring question, you seem to need event correlation alongside your network monitoring. We're just running Graphite with Sensu grabbing NIC metrics and shoving them onto the AMQP bus and I map different time series together onto the time domain and look for patterns. A lot of this tends to just plain suck because our infrastructure has a serious case of a lot because ntpd doesn't even work and half our clocks are off by 4+ minutes, but looking into how your TCP stack behaves with the rest of your system state is handy when running applications.

I've almost never found time series to be useful for anything in recent memory -- though I once burned 3,000 IOPS and 400 GB of disk space on Graphite chasing down a single NFS performance regression on an IBM software update (have you run across the mountstats collector for Diamond? That was me, for this problem) -- but I agree completely about Brendan Gregg and his USE method, and the need for correlation. At the most basic level, this means system clocks corrected to within a second or two, and reasonable log aggregation to help determine exactly what's going on in a distributed system. The Graphite bits are super-useful if you find yourself looking at actual NIC errors, but if you're divorced from the physical hardware in a private cloud, you'll see dwindling returns.

I feel like an old neckbeard, but I've been relying more and more on sar/sysstat recently and less on stuff like collectd and Graphite. It's certainly a lot easier to scale.

Vulture Culture fucked around with this message at 05:09 on Dec 7, 2015

# ¿ Dec 7, 2015 05:00

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

necrobobsledder posted:

(load of 45 on an 8 vCPU box in prod is scary, man)

My personal record is 253 on a quad-core running TSM

e: wait, it was dual quad-core

Vulture Culture fucked around with this message at 07:01 on Dec 8, 2015

# ¿ Dec 8, 2015 06:55

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Tab8715 posted:

Extra-circular question but when it comes to massive web-based SaaS Applications like Facebook, Salesforce, Apple iCloud what are they using for their Directory Service?

Active Directory doesn't make sense because it's too slow for such an enormous deployment and being web-centric Kerberos/NTLM aren't a good fit. I know many will point to Azure AD but all of these services existed before AAD.

What do they use?

Almost nobody is using anything off the shelf. Most applications built to scale are storing user information in something equally built to scale for their exact use cases, and that's usually the (often NoSQL) datastore that drives the rest of their platform. Facebook uses TAO, which they talked about here. Apple recently purchased FoundationDB, and it's not an unreasonable assumption that their cloud services use it under the hood. Salesforce proper uses Oracle under the covers for mostly legacy reasons, but their subsidiaries like Heroku make heavy use of PostgreSQL, MongoDB, and others for storing user information.

# ¿ Dec 22, 2015 03:00

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Route53 alias record TTLs are set to the TTL of the target.

# ¿ Dec 23, 2015 14:35

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

If my OpenStack servers weren't blades I would literally pull them out from the rack on their rails and poo poo inside them

# ¿ Jan 13, 2016 06:33

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

evol262 posted:

It's because their admins didn't bother configuring a resilient backend for glance, or their storage policies weren't set up (or were misconfigured) for swift.

Or the cluster is set up with instance storage by default, which is sensible for many environments. AWS does the same thing if you launch an instance store AMI.

# ¿ Jan 14, 2016 02:24

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

evol262 posted:

But then we have to assume that all the VMs were running on a single compute node, and that a single disk failure killed that node (or at least /var), which is even dumber.

Dumber in the sense that people are obviously doing the wrong thing with it (either expectations haven't been communicated, or people aren't listening regardless), but there's nothing at all wrong with this approach for cattle when you're running OpenStack as designed. The instances blow up, and Heat spins new ones. Why double your disk costs for no reason? If you need individual instances to be resilient against disk failures, you have Cinder.

# ¿ Jan 14, 2016 03:24

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

MagnumOpus posted:

I feel like this is the biggest change that occurred in the web ops world in terms of its effect on my day-to-day. It used to be that no one else in the org cared about what we were doing on an infrastructure level because they knew it was a nightmare realm they dared not enter. Since the rise of everything "cloud" now we've got armies of web devs who know just enough to be dangerous. My job is now running around trying to keep assholes from loving up long-term plans by applying half-understood cloudisms to their designs. Just about every day I find a new thing that makes me go cross-eyed with rage, and on the days I don't it's because instead I spent 3 hours arguing a web dev down from his 20% understanding of platform and infrastructure concepts.

This is a great reason to have devs responsible for operating their own software

# ¿ Jan 14, 2016 20:38

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Can I create a big pile of stopped instances on EC2 without powering them on (and incurring the hourly charge) first?

# ¿ Apr 15, 2016 18:42

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

necrobobsledder posted:

I don't think that's possible. I know I can do it with the API in VMware's stuff, but in EC2 you have to launch an instance to create or clone one, and launching means it gets put into pending and then running state. The official lifecycle document from AWS pretty much means that you don't get any state of an instance before the Pending state. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-lifecycle.html Only thing I could think of as a sneaky way possibly is to create an AMI that will immediately shut down the machine before the bootloader kicks in the first time you launch it, but I suspect that won't help because it might have to be put into the Running state first for that to kick in.

If you need to queue up a bunch of instances to be able to handle something like a spike load it's easy to forget that you have to let AWS know so that your ELBs don't get run over by a freight train. I think that applies for even internal ELBs.

That's not a concern, there's no load balancers involved on these. I just want some people here to be able to power servers on to add capacity without doing anything funny (auto-scaling groups are a really poor fit for stateful services).

# ¿ Apr 16, 2016 00:11

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

I'm firing these up in bulk through Terraform, so there's no big deal if I have to actually start them in response to demand. It would have been nice to be able to hand our CTO or whoever instructions to just power a bunch of stuff on, though.

(ASGs don't work for our use case for a litany of reasons I'm not going to get into.)

# ¿ Apr 18, 2016 06:07

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Internet Explorer posted:

Hopefully quick HIPAA / Cloud question. I don't deal with HIPAA but something I heard does not pass the smell test.

If a cloud provider has admin access to a Windows VM on their infrastructure, is it possible for them to be HIPAA complaint? I find a hard time believing that they would be without going through the same paperwork required to share HIPAA data from the owners of that data.

There's a standard for transfer of liability which is covered under the Business Associate Contracts section, so it's very possible to be HIPAA compliant if the contract language meets the standards under �164.314(a)(1). There may be specific implementation details which don't pass the smell test in other ways. For example, "a cloud provider has admin access" sounds to me like they have a single default admin account which would not fulfill the HIPAA data access auditing requirements. If a business is aware of ways the Business Associate/Covered Entity is not holding up their end of the data privacy standards, there is specific language in �164.314(a)(1)(ii) determining how an organization should address that to avoid their own complicity/liability.

Vulture Culture fucked around with this message at 19:54 on May 18, 2016

# ¿ May 18, 2016 19:51

Adbot: ADBOT LOVES YOU

# ¿ Apr 28, 2024 19:48

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Bhodi posted:

VPC *IS* for site to site connections.

No it isn't. Are you confusing the VPC with the Virtual Private Gateway?

good jovi posted:

Does anyone here have any experience setting up a VPN endpoint in an AWS VPC? Everything I've been able to find seems to be aimed at site-to-site connections, rather than just something for developers to connect to. It looks like this involves running some 3rd party software appliance, rather than being built in to AWS itself. Any recommendations there?

Correct, the VPC itself does not supply a VPN gateway, and the Virtual Private Gateway product is oriented towards IPsec-only site-to-site connections. If you want to connect up random endpoints, you'll need something that supports L2TP+IPsec or PPTP. I've used Openswan in the past with some fiddling, but nowadays I'd probably use a VyOS (Vyatta fork) appliance or something else that handles the Openswan/Strongswan+xl2tpd+pppd stuff a little more transparently. If you control the endpoints, you might want to just use something easy like OpenVPN. And you'll want to turn off Source/Destination Check on your instance, just like with a NAT gateway on your VPC, or you won't be able to push traffic back out through it to your clients.

Vulture Culture fucked around with this message at 01:24 on May 21, 2016

# ¿ May 21, 2016 01:19

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Cloud Computing: Mostly fog machines in other people's datacenters

«‹›2 »