Virtualization Megathread V2: VMs inside VMs

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

1000101 posted:

You probably can't answer this question even if you know the answer but I'll ask anyway.

What is Red Hat professional services using to deploy with? I can't imagine they're using the OSP installer for everything are they?

This times 1000. They can not be using OSP unless those guys are popping happy pills daily to make up for the soul crushing weight of having to work through that deployment tool.

# ¿ Jul 30, 2015 22:46

Adbot: ADBOT LOVES YOU

# ¿ Apr 27, 2024 22:10

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

evol262 posted:

I'd guess that they are, though I don't know for sure. Since so few customers who "need" openstack actually need openstack (there's a very high "I need openstack" to "gimme rhev" conversion ratio), it's hard to estimate.

I'm not involved on the operations side internally, but I'm positive we use OSP for our internal and external stuff backed by openstack, though. Dogfooding is a big deal, and the internal outage mailing lists are very explicit about "we're going down for 6 hours on $date because we're updating from OSP X to OSP Y"

Can you explain what needs to happen before it scales past 5 machines? I've got a deployment out there that's around 30 physical nodes and the thing runs like poo. If I launch 10 VMs the APIs start to fail and some of the VMs fail to launch. I've got almost 100% defaults except for password and ceph ports in the OSP hostgroup params.

# ¿ Jul 30, 2015 23:04

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

evol262 posted:

It's hard to say without knowing what's failing.

This is almost always neutron. Are worker threads enabled? I don't think OSP enables them by default, and Neutron behaves like poo poo without them.

How many NICs do you have? There's not a ton of message queue or database traffic (not enough to worry about with 30 nodes), but your guests are segmented into vxlans on a different physical NIC, right?

How many identity/keystone instances are you running? OSP defaults to one. Running it on every node and fronting with haproxy is ideal.

Is glance mapped from some fast storage? How do your NICs look when you're doing this? If you time out waiting for neutron to create an port, look there. If you crap out because glance is taking 100% of your bandwidth, configure nova's instances_path to be a mountpoint. Ceph or gluster are ideal (so you can shove images in from glance-api on any node).

But you'd have to be a little more specific than "APIs start to fail" to get specific suggestions, and I'm not an expert in every component...

I think it's keystone authentication tokens if I had to guess, the API's that fail are either Cinder related or Nova related. The problem originally seemed to be database backend and Keystone is basically just a REST front end and a database. There's been some errors in the logs but it's mostly "I can't do the thing I tried to do after 3 attempts" google-fu seems to not pull up anything and the OSP config is basically defaults like I stated earlier.

There are 5 NICs per host, all running on UCS, dual 10gig backend configured in A-B failover. I never looked into the Neutron worker threads, I'll be sure to check that out. Guest traffic isn't an issue yet, as it's only maybe 5-10 instances. This is a brand new install. And yea, VXLAN for tenant networks.

So we're running 3x controllers, each with 128 gigs of ram, 56 cores and a pair of RAID1 7k SAS drives. Glance is mapped to a trio of ceph backed storage nodes, each with 10 disks or 30 total OSDs, I've seen this hit 7000 IOPS and during testing we hardly hit 1000 IOPS (like to reproduce this problem).

The NICs are all separated via type, so there's : Management, Storage Clustering, Cluster Management, Tenant, External, Public API, Storage

# ¿ Jul 31, 2015 01:51

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

Vulture Culture posted:

I don't know anything about OSP specifically, but I'm positive you have a database problem.

OpenStack uses locking and SELECT ... FOR UPDATE extensively when allocating resources (vCPUs, memory, fixed/floating IPs, etc.). This fails transactions frequently, especially in multi-writer MySQL configurations, because of the way that Galera processes transactions. Most of the OpenStack components are configured to retry when this condition results in a conflict, but under heavy lock contention they can just sit there and spin forever with none of the updates ever finishing. As a first step, if you're using MySQL, make sure you're directing all of your writes through a single MySQL server node. (You can use different primary writers for Nova and Neutron to help scale, if you need, but make sure all your Nova writes go through the same node and all your Neutron writes go through the same node.) Make sure your database is tuned for writes as tightly as it will go. Strongly consider running your Nova and Neutron databases from SSD, as this will make the commits much faster and decrease the incidence of this problem.

Most of the database load from that SELECT ... FOR UPDATE issue is quota management. If you're running in a single-tenant organization, or you otherwise don't care about quotas, you can switch your Nova and Neutron configurations from the DbQuotaDriver to the NoopQuotaDriver, effectively disabling quotas.

I'll be glad to check this - again this is defaults from the OSP installer. It does have 3x MySQL boxes created using a pacemaker cluster. One guy seems to be the master because his process list hovers between 1000 - 1600 (and one of our first steps was to up this number from the default 1000). The other two MySQLd process lists show 4-5, sleeping or waiting for binlogs iirc. What we see is the 'master' doesn't seem to be reporting any locks directly. The MySQLd log doesn't look too ugly other than reporting it can't change the max number of open files: "[Warning] Could not increase number of max_open_files to more than 1024 (request: 1835)". I think at one point we turned on slow log and found it very weird. Like it would go from 2-4 second queries straight into a 30 second query and then roll over (I think that's the HAProxy timeout for API requests)

evol262 posted:

This is 5-10 instances total? I thought you mean "starting 10 instances within 3 seconds makes some API fall over", which is often able to be blamed on Neutron.

Are the errors consistent? Same service failing every time? Or from the same hosts? Or to the same controllers? That could give a jumping off point, at least, but it sounds like your architecture starting off right, and isn't to blame (even though I like LACP better than A-B failover)

So I can reproduce this easily when I start 10 instances from the cli or gui. Just select CentOS-whatever, launch 10 small's with Cinder backed storage and boom, 9/10 times at least 2-3 error out and fail. I can usually get it to fail just doing straight nova backed instances as well.

In terms of breaking the problem down a little more, I'm able to do this with two controllers running in the same cluster (I've taken one down, and changed which one I take down). For which services fail, sometimes it's during Device block mapping, sometimes not.

Also, just for clarity. When I ask about "scaling past 5 machines" I mean hosts. Like, a really basic single controller install with 2-3 compute hosts.

Also, thank you both for raising some great questions.

# ¿ Jul 31, 2015 08:12

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

Vulture Culture posted:

I don't know anything about OSP specifically, but I'm positive you have a database problem.

OpenStack uses locking and SELECT ... FOR UPDATE extensively when allocating resources (vCPUs, memory, fixed/floating IPs, etc.). This fails transactions frequently, especially in multi-writer MySQL configurations, because of the way that Galera processes transactions. Most of the OpenStack components are configured to retry when this condition results in a conflict, but under heavy lock contention they can just sit there and spin forever with none of the updates ever finishing. As a first step, if you're using MySQL, make sure you're directing all of your writes through a single MySQL server node. (You can use different primary writers for Nova and Neutron to help scale, if you need, but make sure all your Nova writes go through the same node and all your Neutron writes go through the same node.) Make sure your database is tuned for writes as tightly as it will go. Strongly consider running your Nova and Neutron databases from SSD, as this will make the commits much faster and decrease the incidence of this problem.

Most of the database load from that SELECT ... FOR UPDATE issue is quota management. If you're running in a single-tenant organization, or you otherwise don't care about quotas, you can switch your Nova and Neutron configurations from the DbQuotaDriver to the NoopQuotaDriver, effectively disabling quotas.

So I was requested by support to run a innodb_status during or after the failures. It was 3011 lines of output. Most of which look like the following:
"MySQL thread id 981139, OS thread handle 0x7ef9eeefc700, query id 35323377 192.168.x.x keystone sleeping
---TRANSACTION 2241C14, not started"

Sometimes it's nova, sometimes its keystone, sometimes its neutron.

# ¿ Aug 1, 2015 00:00

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

evol262 posted:

...

Vulture Culture posted:

...

Either of you guys know of a oslo.db bug in juno with idle timeouts? My redhat support dude is still chasing after engineers trying to understand what's happening. I can not for the life of me understand why this is a problem on a vanilla deployment. We're 5 weeks in at this point.

# ¿ Aug 27, 2015 23:51

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

Vulture Culture posted:

We had the problem you're talking about in Kilo and it went away when we identified and went around the MySQL issue where you can't actually do multi-master in Nova with Galera.

I just got told to edit nova.conf, and add this to all nodes:

code:

[database]
idle_timeout = 300

But after reading what you guys are saying and with a colleague I'm not so sure now.

I just checked the defaults... 1 hour, man this is so janky.

code:

[database]
idle_timeout=3600

ILikeVoltron fucked around with this message at 07:08 on Aug 28, 2015

# ¿ Aug 28, 2015 07:00

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

Vulture Culture posted:

Back up -- what does your database architecture and configuration look like?

It's literally a vanilla install from redhat's OSP6. They deploy a wsep based trifecta for the database, using ha proxy as the front end (with server pinning) and replication. It's running on 3 x 56 core boxes with 128 gigs of memory or round about (all bare metal) and a pair of disks in raid 1. We're talking about 10 VMs total on the system (on say, 28ish nova hosts).

# ¿ Aug 28, 2015 07:11

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

evol262 posted:

I'm not really an OSP person (upstream only there, both ends on RHEV), but I have an old (Juno) RDO (upstream of RHOS at the time, now RHEL-OSP) running on two dell 9020s running more VMs than that.

Do you have a case #? Can you PM it to me? I can at least go look at the sosreports and bits they've made you attach.

Sent! Thanks for taking a look at this.

# ¿ Aug 28, 2015 18:32

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

chutwig posted:

We front our MySQL cluster with HAProxy with the MySQL backend in backup mode so that only one instance is interacted with for exactly that reason.

Do you happen to know how this is configured? Like, how did you tell the other members of the MySQL cluster to be backup mode?

# ¿ Sep 3, 2015 00:29

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

chutwig posted:

It's configuration on HAProxy only, MySQL has no awareness of it. "Backup mode" in the context of HAProxy means that HAProxy will only send traffic to a single backend instead of spreading it around the different backends: https://github.com/bloomberg/chef-bcpc/blob/master/cookbooks/bcpc/templates/default/haproxy-head.cfg.erb#L34-L44

Our setup leans heavily on the notion of a master virtual IP that floats around the cluster, i.e., there's one control plane node (head node) that is more equal than all the rest, and that's whichever one is holding the VIP. That node serves MySQL and HAProxy exclusively and also runs cronjobs that should only run on one system, everything else that goes through the VIP is distributed by HAProxy to all the control plane nodes. You'll need something like that for our jankity MySQL+HAProxy workaround to work for you.

Ok, so the haproxy config out of redhat's installer is "stick on dst" and "timeout server 90m" like so:

code:

listen galera
  bind x.x.x.x:3306
  mode  tcp
  option  tcplog
  option  httpchk
  option  tcpka
  stick  on dst
  stick-table  type ip size 2
  timeout  client 90m
  timeout  server 90m
  server host1 x.x.x.x:3306  check inter 1s port 9200 on-marked-down shutdown-sessions
  server host2 x.x.x.x:3306  check inter 1s port 9200 on-marked-down shutdown-sessions
  server host3 x.x.x.x:3306  check inter 1s port 9200 on-marked-down shutdown-sessions

I'm more interested in how you're doing replication between the mysql hosts as after some disk tests it seems to me that this matters:

code:

mysql -e "SHOW STATUS LIKE 'wsrep_local_recv_queue_avg';"
+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| wsrep_local_recv_queue_avg | 0.545683 |
+----------------------------+----------+

The server running the mongo primary node for ceilometer seems to spike up to 1.5 or so, and then we start to see some failures. This could be nothing, but one of my colleague thinks we could be seeing IO based stalling causing replication to stall causing things to wait eventually fail.

# ¿ Sep 3, 2015 18:11

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

BangersInMyKnickers posted:

The function of these virtual arrays really isn't that different from conventional controllers, the scale for everything (cache side, cache persistence time) is just orders of magnitude larger. I am well aware of the implications of this (almost no read ops hitting platter, extreme optimization of writes by buffering through ssd first) and am a proponent of the tech, but the point remains that 6 spindles offers a tiny amount of backend disk performance for a virtual cluster that could easily support 200+ VMs. The danger with these systems is that they will work extremely well, until they don't. And you hit that wall really really hard because you're running at SSD speed right up until the SSD cache is exhausted and by that time the platters are already saturated. More conventional storage architectures give you more warning on when you are hitting that saturation point as storage latencies creep up in a more controlled manner. I'm not saying its bad technology or you shouldn't use it, but you need to know what kind of IO you're throwing at it because there are a number of scenarios when you can make it catastrophically explode in your face, and in a much worse manner than the conventional "bunch o spindles" setups.

Isn't the glory of that platform that you could just buy an additional host and no longer be at that wall?

# ¿ Jan 29, 2016 23:12

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

evol262 posted:

I unironically use manageiq to keep track of VMs across RHEV, vSphere, openstack, and aws.

Heathen! Heretic! My god why would you do that to yourself.... the horror, oh loving horror!

I did a week of training on cloudforms and do not have many nice things to say, other than it can do a whole bunch of stuff in a very complicated way!

# ¿ Mar 4, 2016 18:08

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

Thermopyle posted:

I'd like to automate the creation of VM's in Vagrant style. These would be Linux machines with a desktop environment. Probably Ubuntu 16.04. Would need to install various software and config various things in the VM.

I've found some google results for using Vagrant for this purpose, but it doesn't seem like a very "supported" use case. Googling "automated VM creation" leads me to various sorts of results.

I don't care if this is virtualbox or vmware.

Currently, I use a base snapshot of a vmware machine, but I think I'd like to able to do things programmatically.

What's the One True Way of doing this?

What exactly are you trying to automate? Creating the VM? Installing the software? Building images to launch through vagrant? Each of these are different answers.

If you're just trying to have a vm you run on your desktop that you can delete easily or recreate easily vagrant is a great solution. "vagrant up; vagrant -f destroy". If you're trying to automate software install, then you want something like puppet/chef/salt/etc, and if you're trying to create the image that you launch out of vagrant (or many other platforms) you want packer.

# ¿ Sep 3, 2016 02:11

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

jaegerx posted:

terraform or packer

I legit don't understand what you're asking or answering here?

Thermopyle posted:

Creating the VM, installing the software.

Likely vagrant and some form of provisioning, https://www.vagrantup.com/docs/provisioning/ I'd personally use puppet because it's what I know but if you're weak on that (or chef, ansible, salt, etc) you could just write some shell scripts that'll kick off when the vagrant vm is launched. (ie: yum install foo / apt-get install foo)

# ¿ Sep 6, 2016 16:51

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

Bhodi posted:

I guess I consider container images to be similar enough to binary artifacts to lump them together. I think we're splitting hairs at this point. You're right on opinionated, though!

No splitting hairs, you're just wrong. Containers != VMs. VM Images != Containers in any way.

# ¿ Sep 6, 2016 20:20

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

The Nards Pan posted:

I'm having a strange problem with my home qemu/KVM lab running on Ubuntu desktop 17.04 (although this issue was around in 16.04 and 16.10 too). Since I'm using my laptop as the host and it's often connected my wifi which doesn't support bridged connections, I have it set up with a NAT virtual network which also provides DHCP for the guests. If I let my guests use the DNS address that the DHCP provides, which is the network address of the NAT network (192.168.100.1), some websites won't connect while others work fine. I get a response from nslookup for the sites that don't work, but they immediately return a server not found page if I try to access them through a web browser from the guest. If I manually set the DNS to Google's everything works just fine. I can't seem to find a pattern of what works and what doesn't either - microsoft.com and mozilla.com are no good, google.com and somethingawful.com work fine. I get the same results from nslookup using 192.168.100.1 or 8.8.8.8 as DNS on my guest and the same result from nslookup on the host using my ISP DNS.

I may have messed up something a few months ago when I was trying to cudgel the virtual network into working as a bridge over wifi, but I can't find anything else that I did then that I haven't undone. I was running a DNS server on one of the guests with all the other guests pointed at that so I didn't notice.

Can anyone think of something that might be causing this?

Are you trying to NAT FROM and TO the same network? The NAT Virtual network you're using uses the same IP scope as the "main" or primary network?

# ¿ Jan 19, 2017 21:33

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

Paul MaudDib posted:

So I remember reading that Docker had a really ludicrous security model (running applications runs as root by default and/or needs to be run as root a lot of the time, or something like that).

Is there a container system with a bit more of a reasonable security model? FreeBSD doesn't seem like they'd do that bullshit, do jails work reasonably well? How about LXC?

Obviously a full-on hypervisor is the way to go if you really want to totally sandbox everything, but that's fairly heavyweight.

rkt claims to be a 'very secure way to run containers' and can import docker containers for portability. I haven't played with it much and the vagrant VM bombed on me last time I tried, but given another year I think it might overtake docker.

# ¿ Apr 26, 2017 00:36

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

Roargasm posted:

why this over puppet?

It's not adverse to puppet, it could easily be in addition to puppet. Packer has a puppet provider built in.

Anyway, to answer the question you maybe were asking, packer is good at one thing - building images. Puppet isn't so good at building images, it's good at setting desired state. Which means you're still making some glue around puppet (like say jenkins, or whatever) to launch the VM, run puppet, and then make an image. Packer can do all of this, it's the obvious tool for making images (hence, packing them!)

# ¿ May 26, 2017 21:45

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

I'm going to throw out there that if you're looking for a 10+ Core Xeon, don't make the mistake I did and order it new, that poo poo is hella cheap on ebay if you can deal with shipping from HK or China.

Intel Xeon E5 2630 V4 ES QHVK 2.1Ghz 25MB 10Core 20threads 14nm 85W CPU - $189 on ebay.

# ¿ Aug 6, 2017 21:25

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

nicky_glasses posted:

Anyone doing any vmware automation not in Powershell? The vmware SDK documentation is painfully obtuse and picking any bindings outside of PS is very difficult as there are no books or tutorials as far as I can tell for Python or even others, except for Java which I don't care to learn.

Their APIs and docs are mostly poo poo and remind me of working with Active Directory over COM.

# ¿ Aug 16, 2017 03:58

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

evol262 posted:

virt-manager is essentially a frontend to libvirt (like virsh).

The primary "value add" is that it can handle a bunch of finicky migration stuff, easy device passthrough (including to running VMs), managing block storage, CPU flags, et al. It's a lot more complicated, not a little.

qemu can do all of that, if you want to manage iscsi storage pools yourself and memorize 90001 flags, though

It's nearly impossible to stress how painful the command line options to qemu are, just open up a system running a VM that was started by qemu and `ps aux | grep qemu`, that running process's flags should all be visable (and painful)

# ¿ Mar 2, 2018 00:06

Adbot: ADBOT LOVES YOU

# ¿ Apr 27, 2024 22:10

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

SlowBloke posted:

I never considered FC to be interesting until i started working with it. Unlike iSCSI it either works perfectly or everything is hosed. The native multipathing is a nice extra.

Perfect linear link aggregation is really nice too, the idea that to add bandwidth just add a port is awesome.

# ¿ Jul 19, 2019 17:41

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs