Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

MagnumOpus posted:

That's the main problem with it. Like Puppet/Chef it lets you describe your networks and services well enough but the deployments are kludgy as gently caress. And when the deployment doesn't work right, triage is a nightmare because the mess of scripts tend to leave artifacts all over the place.

Also you can't use package managers in your nodes because it compiles everything on the master.
Oh boy it's just like a loving Rocks cluster but even worse

Adbot
ADBOT LOVES YOU

Fiendish Dr. Wu
Nov 11, 2010

You done fucked up now!
I signed up for Building Cloud Apps with Microsoft Azure – Part 1 on EdX. It's a 4 week class, part 1 of 3,that starts on the 31st. Figured this thread would be a good place to recruit / discuss.

quote:

This course will walk you through a patterns-based approach to building real-world cloud solutions. The patterns apply to the development process as well as to architecture and coding practices.
 
The concepts are illustrated with concrete examples, and each module includes links to other resources that provide more in-depth information. The examples and the links to additional resources are for Microsoft frameworks and services, but the principles illustrated apply to other web development frameworks and cloud environments as
well.

What will you learn:

Use scripts to maximize efficiency.
Set up branching structures in source control.
Automate build and deployment with each source control check-in.
Keep web tier stateless

incoherent
Apr 24, 2004

01010100011010000111001
00110100101101100011011
000110010101110010
Does anyone anyone have a good referral on a AWS consultant who primarily handle SMB and working with poo poo networking hardware to get a VPC up and running?

StabbinHobo
Oct 18, 2002

by Jeffrey of YOSPOS
anybody using rackspaces's "OnMetal" yet?

it seems like the only way I could ever go cloud, ec2's network latency and noisy-neighbor/cpu-steal garbage is just unbearable

Less Fat Luke
May 23, 2003

Exciting Lemon

StabbinHobo posted:

anybody using rackspaces's "OnMetal" yet?

it seems like the only way I could ever go cloud, ec2's network latency and noisy-neighbor/cpu-steal garbage is just unbearable
I'm running a pretty big deployment (few hundred instances) on EC2 and I don't have either network latency or stealing problems. Only memory-optimized instances ever get steal time in our metrics, and anything gen-3 or gen-4 has incredibly low latency network performance thanks to SR-IOV. If you're in either of those c3 or c4 generations with an HVM image it's actually incredibly close to the metal. What instance types/sizes are you using and what are the metrics you're experiencing?

StabbinHobo
Oct 18, 2002

by Jeffrey of YOSPOS
do you graph the 90th or 99th percentile of all your stuff at a 1 second resolution?

edit: I would love to find out that c4 has solved this, but everyone I ask is looking at 1 minute averages in cloudwatch

evol262
Nov 30, 2010
#!/usr/bin/perl

StabbinHobo posted:

do you graph the 90th or 99th percentile of all your stuff at a 1 second resolution?

edit: I would love to find out that c4 has solved this, but everyone I ask is looking at 1 minute averages in cloudwatch

Is your workload actually that performance sensitive? Can you not scale it horizontally? Maybe cloud is not for you.

Less Fat Luke
May 23, 2003

Exciting Lemon

StabbinHobo posted:

do you graph the 90th or 99th percentile of all your stuff at a 1 second resolution?

edit: I would love to find out that c4 has solved this, but everyone I ask is looking at 1 minute averages in cloudwatch
We dump all kinds of metrics to Splunk and NewRelic and usually look at 95th and 99th percentile graphs. Those metrics are inter-instance calls between our applications and outside of our applications having some variance we don't see any latencies staying in the same zone. Across AZs there are minor fluctuations but that's expected.

Those events are logged as they come and there's probably never a second without a bunch of requests for every host involved.

What metric specifically are you talking about in 1 minute averages on CloudWatch? There's no "network latency" metric as that's pretty generic and measuring means you need a common destination (unless you're saying there's just added latency for all network activity).

Edit: Also Splunk is magic, if you want me to throw together something to monitor across some instances I'd be happy to try anything out!

Less Fat Luke fucked around with this message at 00:54 on Mar 29, 2015

StabbinHobo
Oct 18, 2002

by Jeffrey of YOSPOS

evol262 posted:

Is your workload actually that performance sensitive?
yes, at millions of users and thousands of requests per second pretty much every ms matters (well, really every 10)

quote:

Can you not scale it horizontally?
scaling horizontally solves for capacity not latency

quote:

Maybe cloud is not for you.
I know (so far), that's why I'm asking if anyone's tried onmetal.

Less Fat Luke posted:

Those events are logged as they come and there's probably never a second without a bunch of requests for every host involved.
hundreds of instances doing <10 req/s, somebody's got an overpaid "architect" on staff

Less Fat Luke
May 23, 2003

Exciting Lemon

StabbinHobo posted:

yes, at millions of users and thousands of requests per second pretty much every ms matters (well, really every 10)

scaling horizontally solves for capacity not latency

I know (so far), that's why I'm asking if anyone's tried onmetal.

hundreds of instances doing <10 req/s, somebody's got an overpaid "architect" on staff
You're kind of being confrontational while I'm offering to help! We have 15 million users, and looking at New Relic one of the example groups right now is doing 6.12k requests per minute, at 4.09ms per response on average. These are generally non-cacheable endpoints that are personalized per user which involves database calls.

Most of our traffic is fairly cacheable so we heavily use Varnish in EC2 as well as CloudFront (and some Akamai).

Despite you being a dick my offer still stands to test stuff for you :)

My "never less than 10/sec" was indicating that even our least used hosts still are capturing endpoint requests and latency.

evol262
Nov 30, 2010
#!/usr/bin/perl

StabbinHobo posted:

yes, at millions of users and thousands of requests per second pretty much every ms matters (well, really every 10)

scaling horizontally solves for capacity not latency

I know (so far), that's why I'm asking if anyone's tried onmetal.
Is latency that critical to users or inside the infrastructure?

And you missed the point a little. Scaling horizontally also lets you spread the load and schedule on a flatter, larger infrastructure where CPU usage isn't the end of your world because it never got scheduled somewhere with high utilization.

You don't need to be condescending. It's unlikely that your app is a special snowflake that just can't bear what other apps do with a little rearchitecture.

Obviously steal is bad, but local interrupts are just as nasty sometimes. Are you running realtime?

What are you expecting to gain from metal that you can't get from adding more instances and caching?

StabbinHobo
Oct 18, 2002

by Jeffrey of YOSPOS
i'll just take all that as a 'no'

Proud Christian Mom
Dec 20, 2006
READING COMPREHENSION IS HARD
:rolleyes:

theperminator
Sep 16, 2009

by Smythe
Fun Shoe

evol262 posted:

What are you expecting to gain from metal that you can't get from adding more instances and caching?

He clearly said "low latency", doesn't matter how you scale out it won't drop the latency below whatever the minimum is of the infrastructure being used.

Apps can't always be restructured to suit either, I've had customers with requirements like his that meant they needed bare metal hardware too. it happens.

evol262
Nov 30, 2010
#!/usr/bin/perl

theperminator posted:

He clearly said "low latency", doesn't matter how you scale out it won't drop the latency below whatever the minimum is of the infrastructure being used.

Apps can't always be restructured to suit either, I've had customers with requirements like his that meant they needed bare metal hardware too. it happens.

It's a useless requirement unless you specify whether it's intra or extra environmental latency, which is why I asked.

Hoping to gain latency from metal is reasonable if you can't handle it on some eventing layer. Moving off site is a no-go if it's end users or something outside the local environment. That's why I asked.

Also, " minimum" latency for an environment doesn't play in much. You absolutely can restructure to deal with "latency" even if you can't muck with the app code by doing what Netflix does and rescheduling instance's which are seeing steal. Obviously their problems are different, but if the latency issue is steal, that's solvable

wwb
Aug 17, 2004

FWIW we split responsibilties between orchestration tools like chef and app deployment. We render unto chef the basics of the environment -- the underlying OS services and core apache / nginx / mysql configurations and the like. We then leave apps and our CI servers responsible for talking to source control and pushing apps out to servers.

This keeps things a bit more transportable, keeps developers out of dealing with chef and the like and seems to work pretty well, at least for our workflow and scenario -- loads of little apps owned by different groups.

Bhodi
Dec 9, 2007

Oh, it's just a cat.
Pillbug
If I had to guess he's doing some sort of voip thing where low jitter's critically important. At Vonage, the primary proprietary voip server we ran was extremely sensitive because it was built in C and tuned over a number of years to run and rely on bare metal. I was able to stick it in ec2 but we never got the performance we were hoping for.

It was good enough though because it was still cheaper than sticking a datacenter in low-subscriber companies, but it wasn't optimal. If rackspace's bare metal were around at that point we'd have beelined for it.

It is a bit silly that you have to turn up/turn down instances that are underperforming and kind of game the system - but that's cloud for you.

Bhodi fucked around with this message at 15:34 on Mar 29, 2015

adorai
Nov 2, 2002

10/27/04 Never forget
Grimey Drawer

Bhodi posted:

It is a bit silly that you have to turn up/turn down instances that are underperforming and kind of game the system - but that's cloud for you.
Just wait until everyone starts doing it.

Less Fat Luke
May 23, 2003

Exciting Lemon
I never have to do that. We have a bunch of alarms for steal time being too high or performance dropping and it only fires when doing CPU-intensive tasks on memory-optimized instances (which kind of makes sense). It's still worth monitoring for both though.

Originally we were setting things up to auto-terminate under-performing instances but now it happens so rarely that we just have a report sent to the ops team if anything like that shows up.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

adorai posted:

Just wait until everyone starts doing it.
It's not really a problem in most other environments (GCE, Azure, etc.), because nobody else oversubscribes CPU resources to the level that Amazon does.

re: OnMetal, there's no hourly billing option like SoftLayer, only monthly. What's the loving point?

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.
OpenStack nerds: is there any way to have an instance automatically terminate on shutdown, like in EC2?

cliffy
Apr 12, 2002

Full disclosure: I work on the OnMetal product at Rackspace. Mainly doing Openstack-related, open source development.

StabbinHobo posted:

anybody using rackspaces's "OnMetal" yet?

I am! In a way, at least. Do you have any specific, quantifiable questions?

I'll try to avoid making subjective claims like: "It's great!", because I'm obviously biased.


Misogynist posted:

It's not really a problem in most other environments (GCE, Azure, etc.), because nobody else oversubscribes CPU resources to the level that Amazon does.

re: OnMetal, there's no hourly billing option like SoftLayer, only monthly. What's the loving point?

Actually, usage is metered down to the minute. You can see the hourly rates here, under the OnMetal heading. The messaging on the main OnMetal landing page leaves a bit to be desired. It only talks about monthly rates, but you're definitely billed for instances by the minute.

Misogynist posted:

OpenStack nerds: is there any way to have an instance automatically terminate on shutdown, like in EC2?

'nova delete', if you're using the python-novaclient, should do what you're asking. See also: http://developer.openstack.org/api-ref-compute-v2.html . Specifically 'delete server' under the 'Servers' heading. Unless I'm misunderstanding what you're trying to do.

Docjowles
Apr 9, 2009

In EC2 you can set an instance so that if it is shut down (like "shutdown -h" at the command line), it automatically terminates instead of just shutting down as it normally would. He's asking if you can do that in OpenStack. Without explicitly calling "nova delete".

I don't think OS has that feature, though I could easily be wrong. What we've done when we need automation that's not provided by OpenStack itself is write a little Python app that listens to the rabbit queues and takes appropriate action. You could write a handler that listens for the instance stop message and automatically fire a delete command when one comes through, for example.

Docjowles fucked around with this message at 16:43 on Mar 31, 2015

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Docjowles posted:

In EC2 you can set an instance so that if it is shut down (like "shutdown -h" at the command line), it automatically terminates instead of just shutting down as it normally would. He's asking if you can do that in OpenStack. Without explicitly calling "nova delete".
Bingo. We could easily have a service stop handler on this VM image that blows up the instance immediately during shutdown, but I was hoping for something built into the compute or orchestration stack already. Not having any kind of credentials baked into this VM image is a better option for us. If you're familiar with what we do, it's obvious why.

Docjowles posted:

I don't think OS has that feature, though I could easily be wrong. What we've done when we need automation that's not provided by OpenStack itself is write a little Python app that listens to the rabbit queues and takes appropriate action. You could write a handler that listens for the instance stop message and automatically fire a delete command when one comes through, for example.
I was thinking of writing a scrubber that runs every couple of minutes (and still might, to catch edge cases), but this is a really really good idea I never thought of. Thanks!

(A generic stream processing service for OpenStack that gets messages and fires webhooks might be even nicer to have someday.)

Vulture Culture fucked around with this message at 18:54 on Mar 31, 2015

MagnumOpus
Dec 7, 2006

Misogynist posted:

(A generic stream processing service for OpenStack that gets messages and fires webhooks might be even nicer to have someday.)

We're currently looking into doing something along these lines with Stackstorm. Will be a while before I can give a field report though, we're still slowly unfucking the damage these devs did in the six months they were "doing DevOps" before it occurred to them to hire some people actually trained in operations.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

MagnumOpus posted:

We're currently looking into doing something along these lines with Stackstorm. Will be a while before I can give a field report though, we're still slowly unfucking the damage these devs did in the six months they were "doing DevOps" before it occurred to them to hire some people actually trained in operations.
This is really awesome, thanks for this!

MagnumOpus
Dec 7, 2006

Misogynist posted:

This is really awesome, thanks for this!

I'd be curious to hear what you think about it!

MagnumOpus
Dec 7, 2006

Anyone using Foreman for orchestration?

We're evaluating tools and I'm most familiar with Chef + mess of scripts for orchestration. Looking for a better way that will handle orchestration from the tenant (OpenStack) up, with a caveat that it needs to gracefully integrate extensions for auto-scaling logic we might write.

evol262
Nov 30, 2010
#!/usr/bin/perl

MagnumOpus posted:

Anyone using Foreman for orchestration?

We're evaluating tools and I'm most familiar with Chef + mess of scripts for orchestration. Looking for a better way that will handle orchestration from the tenant (OpenStack) up, with a caveat that it needs to gracefully integrate extensions for auto-scaling logic we might write.

I have nothing bad to say about foreman at all. It was pretty puppet-centric in the past, but that's all better now

Docjowles
Apr 9, 2009

I've been hoping to deploy Foreman for like the last year. They have a decent SaltStack plugin (what we use for config management), too. Unfortunately it requires a newish version of Salt, and there's a couple ridiculous bugs that are blocking us from upgrading. They're all marked as fixed in the next Salt release, so I'm hopeful that we can finally start testing Foreman sometime soon.

Bhodi
Dec 9, 2007

Oh, it's just a cat.
Pillbug
Sat server 6 uses foreman and people seem to like it. It's anecdotally a good choice for what it does.

MagnumOpus
Dec 7, 2006

I have not read enough but Satellite as well and I definitely will look deeper into both now. It looks like we're locked into a masterless Puppet architecture for the foreseeable future, so if Foreman fits together nicely that would be best.

I just added a secure secrets storage story to our backlog this afternoon. What's the new hotness there? We have a semi-directive to keep the platform loosely IaaS agnostic, so integrating it with Keystone is out. I'm looking at both Conjur, which would fit real well into our environment assuming they can produce an OpenStack image and blackbox which would snap right onto our teams existing "secrets all up in repos" workflow but doesn't feel very durable.

evol262
Nov 30, 2010
#!/usr/bin/perl

MagnumOpus posted:

I have not read enough but Satellite as well and I definitely will look deeper into both now. It looks like we're locked into a masterless Puppet architecture for the foreseeable future, so if Foreman fits together nicely that would be best.

I just added a secure secrets storage story to our backlog this afternoon. What's the new hotness there? We have a semi-directive to keep the platform loosely IaaS agnostic, so integrating it with Keystone is out. I'm looking at both Conjur, which would fit real well into our environment assuming they can produce an OpenStack image and blackbox which would snap right onto our teams existing "secrets all up in repos" workflow but doesn't feel very durable.

I've never used any of the secret sharing, so no comment there.

Last time I installed foreman, it wanted to be an external classifier and optionally a puppetmaster. Never tried it masterless, but I would think it then becomes an overblown image deployment tool, which is maybe what you want.

MagnumOpus
Dec 7, 2006

evol262 posted:

I've never used any of the secret sharing, so no comment there.

Last time I installed foreman, it wanted to be an external classifier and optionally a puppetmaster. Never tried it masterless, but I would think it then becomes an overblown image deployment tool, which is maybe what you want.

It sort of is. Our OpenStack provider relationship requires us to use specific base images that we can't branch or anything, so we have to config once the VM comes up. Not ideal I know but :corp lyfe:.

Edit: This is not to say there aren't reasons. Again it's a possibly over-reaction to concerns about being IaaS-agnostic, but it's the directives I have to work with.

StabbinHobo
Oct 18, 2002

by Jeffrey of YOSPOS

cliffy posted:

Full disclosure: I work on the OnMetal product at Rackspace. Mainly doing Openstack-related, open source development.


I am! In a way, at least. Do you have any specific, quantifiable questions?

I'll try to avoid making subjective claims like: "It's great!", because I'm obviously biased.

awesome! hi

are there extra software layers involved in the host-to-host networking? like, in a colo setup its code -> kernel ip stack -> ethernet driver -> 1 - 4 switches -> ethernet driver -> ip stack -> code. what (if any) extra hops/layers does onmetal have? since there's no dom0, how do you handle network/customer segmentation? any chance its just plain-old-vlans?

how is ironic? I'm still in cobbler/kickstart land, and #including all of openstack's... accoutrement seems like... I guess no ones ever been able to coherently pitch it to me without rambling about the benefits of being able to run hundreds of vm's, which I consider a full blown anti-pattern.

how do you guys clean the disks between customers? similarly, how do you check for things like ssd wear?

evol262
Nov 30, 2010
#!/usr/bin/perl

StabbinHobo posted:

awesome! hi

are there extra software layers involved in the host-to-host networking? like, in a colo setup its code -> kernel ip stack -> ethernet driver -> 1 - 4 switches -> ethernet driver -> ip stack -> code. what (if any) extra hops/layers does onmetal have? since there's no dom0, how do you handle network/customer segmentation? any chance its just plain-old-vlans?

how is ironic? I'm still in cobbler/kickstart land, and #including all of openstack's... accoutrement seems like... I guess no ones ever been able to coherently pitch it to me without rambling about the benefits of being able to run hundreds of vm's, which I consider a full blown anti-pattern.

how do you guys clean the disks between customers? similarly, how do you check for things like ssd wear?

Onmetal will still go through Neutron and whatever segmentation rackspace uses for that (vxlan or GRE, probably, though plain vlans are an option). It's really unlikely that they're using plain nova networks.

Rackspace probably has openflow/neutron switches from Cisco or juniper, so it won't need to run all the way to a neutron controller, but all of those pieces still matter.

The advantage of ironic is that you can deploy images and get all the cloud-init bits, so the same image running in virt somewhere is running on metal, and it ties into heat and everything for autoscaling and formations and tenant networks, and... That's the pitch. It is not a replacement for cobbler or the foreman discovery image. It extends openstack.

The same openstack patterns still apply to ironic.

cliffy
Apr 12, 2002

StabbinHobo posted:

awesome! hi

Greetings!

StabbinHobo posted:

are there extra software layers involved in the host-to-host networking? like, in a colo setup its code -> kernel ip stack -> ethernet driver -> 1 - 4 switches -> ethernet driver -> ip stack -> code. what (if any) extra hops/layers does onmetal have? since there's no dom0, how do you handle network/customer segmentation? any chance its just plain-old-vlans?

I'm not a networking expert. That said, I'm told there are no extra layers than what you laid out.

Customer network segmentation is currently handled by $cisco_magic, but are not true isolated networks. Isolated networks are considered a critical feature, so we should have them sooner rather than later. When we do roll them out they will likely be VXLAN-based.

StabbinHobo posted:

how is ironic? I'm still in cobbler/kickstart land, and #including all of openstack's... accoutrement seems like... I guess no ones ever been able to coherently pitch it to me without rambling about the benefits of being able to run hundreds of vm's, which I consider a full blown anti-pattern.

It works for us, but afaict OnMetal is literally the only production bare-metal cloud product using Ironic. Other companies may be using it for servicing in-house deployments. It helps that we have two Ironic core developers on our team.

To deploy Ironic you do need a few accompanying Openstack services, which, at a minimum, comprise of: Nova (Compute scheduling service), Glance (Image service), and Keystone (Identity service). The VM ramblings probably come from the fact that Nova is designed to schedule/provision VMs using drivers which communicate with various hypervisors. Nova treats Ironic like a hypervisor. You could consider Ironic as a hypervisor which completely vacates the machine once the machine gets provisioned. Though the specifics of Ironic behavior depend on which bare-metal driver you have configured within Ironic to manage machines.

StabbinHobo posted:

how do you guys clean the disks between customers? similarly, how do you check for things like ssd wear?

OnMetal uses the Ironic Python Agent driver in Ironic to machines so you can see exactly what we do to erase disks here:
https://github.com/openstack/ironic-python-agent/blob/master/ironic_python_agent/hardware.py#L429

The short of it is we use ATA enhanced secure erase where available, and fall back to ATA secure erase when the enhanced version is not available.

We currently collect SMART data, but doing further analysis and generating actions on said data is a work in progress.

StabbinHobo
Oct 18, 2002

by Jeffrey of YOSPOS
so appreciative, thanks

if I may keep going, how is some kind of console or out of band access handled? for instance, lets assume I typo something in my kickstart, how do i get on that console and hit alt+f2 to see what went wrong?

edit: oh poo poo, if customers are sharing a vlan... can I kickstart? (pxe boot?) I assume you have to handle dhcp then not me? do you allow for configurable paramaters like next-server then?

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

StabbinHobo posted:

so appreciative, thanks

if I may keep going, how is some kind of console or out of band access handled? for instance, lets assume I typo something in my kickstart, how do i get on that console and hit alt+f2 to see what went wrong?

edit: oh poo poo, if customers are sharing a vlan... can I kickstart? (pxe boot?) I assume you have to handle dhcp then not me? do you allow for configurable paramaters like next-server then?
You won't be using Kickstart or PXE directly, you'll use OpenStack's facilities for uploading and distributing basic system images to your bare-metal hosts. Like other cloud services, you'll do your post-install configuration at first boot using cloud-init. I believe OnMetal uses a configuration drive rather than http://169.254.169.254 to distribute the metadata for customizing your install, but it's transparent either way -- you'll set the metadata at server creation time through the API, and cloud-init will make decisions based on that data.

I don't think you get any kind of console access at all with OnMetal; their documentation strongly implies that these are only available on virtual servers.

Vulture Culture fucked around with this message at 02:56 on Apr 7, 2015

Adbot
ADBOT LOVES YOU

MagnumOpus
Dec 7, 2006

Anyone using Cloud Foundry and if so what are you doing for system metrics? I see that Collector has been deprecated but the Firehose system that is replacing it is still very new and the only interfaces ("nozzles") I can find around are prototypes.

  • Locked thread