Virtualization Megathread V2: VMs inside VMs

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›312 »

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

evol262 posted:

The RDO community forums are probably better support, unfortunately.

We know the installer sucks and that Mirantis' is better (I think I've talked about it here before, too). Staypuft is actually good if you're running Foreman, though

You probably can't answer this question even if you know the answer but I'll ask anyway.

What is Red Hat professional services using to deploy with? I can't imagine they're using the OSP installer for everything are they?

# ? Jul 30, 2015 22:40

Adbot: ADBOT LOVES YOU

# ? Apr 18, 2024 15:05

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

1000101 posted:

You probably can't answer this question even if you know the answer but I'll ask anyway.

What is Red Hat professional services using to deploy with? I can't imagine they're using the OSP installer for everything are they?

This times 1000. They can not be using OSP unless those guys are popping happy pills daily to make up for the soul crushing weight of having to work through that deployment tool.

# ? Jul 30, 2015 22:46

evol262: Nov 30, 2010; #!/usr/bin/perl

1000101 posted:

You probably can't answer this question even if you know the answer but I'll ask anyway.

What is Red Hat professional services using to deploy with? I can't imagine they're using the OSP installer for everything are they?

I'd guess that they are, though I don't know for sure. Since so few customers who "need" openstack actually need openstack (there's a very high "I need openstack" to "gimme rhev" conversion ratio), it's hard to estimate.

I'm not involved on the operations side internally, but I'm positive we use OSP for our internal and external stuff backed by openstack, though. Dogfooding is a big deal, and the internal outage mailing lists are very explicit about "we're going down for 6 hours on $date because we're updating from OSP X to OSP Y"

# ? Jul 30, 2015 22:51

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

Wicaeed posted:

How do you guys reconcile the fact that sometimes on vCenter, a VM may show that a it is relatively underutilized, but the VM owner says that they can't run their app properly because of resource contention issues?

I'm starting at a new position managing something like 1000 vm's, about half are staging/qa environments where the VM owners are howling that they don't have the resources do their jobs properly.

That doesn't jive with what I'm seeing in both vCenter and vCOPS reports showing that the VM's in question are actually underutilized according to the vCOPS report.

Always start at the OS level, see what the problem software is seeing. If you're on windows, the resource monitor is laid out in four tabs, cpu, memory, disk, and network and you should go through one by one looking at what is causing a bottleneck. Is the cpu or a single vCPU pegged out? Is the disk thrashing all the time with the IO queue maxed out? Is it pushing a bunch of network traffic? Is physical memory usage maxed out and causing page file trashing? Those are the key areas and each application uses resources a little different so you have to take it on a case by case basis, but there are some general rules to follow.

CPU - If the VM is maxing out its CPU all the time (or in bursts and is latency sensitive) then check to see if the software is multithreaded. If it is, adding a second vcpu to the VM is a fairly safe bet. If not, going from 1 to 2 vCPU may still help a bit by allowing the busy application take an entire core while the other services take the other lightly loaded on (but you're going to eat a lot of overhead doing this and performance improvements will be small, maybe 2-3% depending on what your idle CPU usage on the VM is). Then you move out of the VM to the host and see if its CPU requests are being fulfilled or if its stuck in scheduling hell because all the other VMs are crowding it out. Check VM CPU ready time, latency %, and usage. Those first two if they're high often indicate overcrowding or too many oversized VMs. In any case, at that point you're looking at tuning the existing VMs to get them to play nicer with each other, swapping out CPUs for something with more cores/frequency, or dropping in more hosts.

Memory - Memory is tricky because unused memory is wasted so you want the utilization rates fairly high. Again, start in the OS and look at the physical memory usage. Windows starts looking for things to dump in to the pagefile around 90% utilization and REALLY pushes in to it at 95% so try to keep that stuff around 85% at the high end. Page file swapping is the killer here and you want to avoid it. Many applications need a certain number of MB per session or whatever and can be easy to scale and size but databases can be a pain in the rear end. A DB working correctly is always going to try to take your system to a ~90% memory utilization rate because its trying to cache db pages in to ram for quicker access. The key here is cache hit rates on the DB, mssql gives that to you as performance monitor stat and Oracle has some sizing stuff that will tell you exactly how much memory you should be allocating to achieve such and such cache hit rates but you'll most likely need to work with the VM owner or a DBA on that. If not properly configured, a DB instance will keep consuming all the physical ram you give it until an entire copy of its databases are loaded there, which is a terrible way to use ram.

On the host side for memory, you're probably going to over-provision because that's how you really get VMware to pay for itself. With default settings, VMs will get allocated ram in large 4mb(?) pages which will quickly get your hosts up to the warning threshold for memory usage. That's okay. Once you start flirting that limit, the large pages that aren't being used start getting broken up in to standard 8kb pages which are hashed and deduped, along with some other zero page reclaiming and compression stuff that will occur that really let you stretch the memory you have. A word of warning is that in the name of security, VMware added a unique salt in to each VM's page hashing which effectively breaks the dedupe feature so you have to go in to the advanced config of your hosts to tell it to use the same salt for everything so VMs can dedupe pages against each other instead of just themselves. The memory killer on hosts is again vswap (pagefile) thrashing, so watch that metric and if you have sustained vswap churn then you need to add more ram to the hosts or do a bunch of tuning. Also, set up resource pools (default High, Medium, Low should be fine) and make sure your dev stuff goes in low so if something bad happen like a host dies then memory and CPU pressure get pushed on the less important stuff while your production systems miss the brunt of the impact.

Disk - The blue line on the resource monitor disk graphs is your IO queue depth and when its up that means whatever is going on its waiting on disk. Some things are disk intensive and take time, but if disk performance isn't acceptable then you're going to need some upgrades or get creative with SSD/ram vmdk caching. Check storage latency for your datastores and make sure that storage io control is on for everything. If you have multiple pools of disk available then consider doing some storage vmotions to level the load out. Otherwise, upgrade upgrade upgrade.

Network - Check the bandwidth on the VM, check the total bandwidth on the host, see if you're maxing anything out. If you have two VMs that love to talk to each other, consider sticking them together as a vapp or with an affinity rule so the network traffic gets processed internally through the host instead of hitting the switch/wire. If you need more bandwidth, add more NICs and aggregate links.

# ? Jul 30, 2015 22:56

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

evol262 posted:

I'd guess that they are, though I don't know for sure. Since so few customers who "need" openstack actually need openstack (there's a very high "I need openstack" to "gimme rhev" conversion ratio), it's hard to estimate.

I'm not involved on the operations side internally, but I'm positive we use OSP for our internal and external stuff backed by openstack, though. Dogfooding is a big deal, and the internal outage mailing lists are very explicit about "we're going down for 6 hours on $date because we're updating from OSP X to OSP Y"

Can you explain what needs to happen before it scales past 5 machines? I've got a deployment out there that's around 30 physical nodes and the thing runs like poo. If I launch 10 VMs the APIs start to fail and some of the VMs fail to launch. I've got almost 100% defaults except for password and ceph ports in the OSP hostgroup params.

# ? Jul 30, 2015 23:04

evol262: Nov 30, 2010; #!/usr/bin/perl

ILikeVoltron posted:

Can you explain what needs to happen before it scales past 5 machines? I've got a deployment out there that's around 30 physical nodes and the thing runs like poo. If I launch 10 VMs the APIs start to fail and some of the VMs fail to launch. I've got almost 100% defaults except for password and ceph ports in the OSP hostgroup params.

It's hard to say without knowing what's failing.

This is almost always neutron. Are worker threads enabled? I don't think OSP enables them by default, and Neutron behaves like poo poo without them.

How many NICs do you have? There's not a ton of message queue or database traffic (not enough to worry about with 30 nodes), but your guests are segmented into vxlans on a different physical NIC, right?

How many identity/keystone instances are you running? OSP defaults to one. Running it on every node and fronting with haproxy is ideal.

Is glance mapped from some fast storage? How do your NICs look when you're doing this? If you time out waiting for neutron to create an port, look there. If you crap out because glance is taking 100% of your bandwidth, configure nova's instances_path to be a mountpoint. Ceph or gluster are ideal (so you can shove images in from glance-api on any node).

But you'd have to be a little more specific than "APIs start to fail" to get specific suggestions, and I'm not an expert in every component...

# ? Jul 30, 2015 23:33

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

BangersInMyKnickers posted:

Always start at the OS level, see what the problem software is seeing. If you're on windows, the resource monitor is laid out in four tabs, cpu, memory, disk, and network and you should go through one by one looking at what is causing a bottleneck. Is the cpu or a single vCPU pegged out? Is the disk thrashing all the time with the IO queue maxed out? Is it pushing a bunch of network traffic? Is physical memory usage maxed out and causing page file trashing? Those are the key areas and each application uses resources a little different so you have to take it on a case by case basis, but there are some general rules to follow.

CPU - If the VM is maxing out its CPU all the time (or in bursts and is latency sensitive) then check to see if the software is multithreaded. If it is, adding a second vcpu to the VM is a fairly safe bet. If not, going from 1 to 2 vCPU may still help a bit by allowing the busy application take an entire core while the other services take the other lightly loaded on (but you're going to eat a lot of overhead doing this and performance improvements will be small, maybe 2-3% depending on what your idle CPU usage on the VM is). Then you move out of the VM to the host and see if its CPU requests are being fulfilled or if its stuck in scheduling hell because all the other VMs are crowding it out. Check VM CPU ready time, latency %, and usage. Those first two if they're high often indicate overcrowding or too many oversized VMs. In any case, at that point you're looking at tuning the existing VMs to get them to play nicer with each other, swapping out CPUs for something with more cores/frequency, or dropping in more hosts.

Memory - Memory is tricky because unused memory is wasted so you want the utilization rates fairly high. Again, start in the OS and look at the physical memory usage. Windows starts looking for things to dump in to the pagefile around 90% utilization and REALLY pushes in to it at 95% so try to keep that stuff around 85% at the high end. Page file swapping is the killer here and you want to avoid it. Many applications need a certain number of MB per session or whatever and can be easy to scale and size but databases can be a pain in the rear end. A DB working correctly is always going to try to take your system to a ~90% memory utilization rate because its trying to cache db pages in to ram for quicker access. The key here is cache hit rates on the DB, mssql gives that to you as performance monitor stat and Oracle has some sizing stuff that will tell you exactly how much memory you should be allocating to achieve such and such cache hit rates but you'll most likely need to work with the VM owner or a DBA on that. If not properly configured, a DB instance will keep consuming all the physical ram you give it until an entire copy of its databases are loaded there, which is a terrible way to use ram.

On the host side for memory, you're probably going to over-provision because that's how you really get VMware to pay for itself. With default settings, VMs will get allocated ram in large 4mb(?) pages which will quickly get your hosts up to the warning threshold for memory usage. That's okay. Once you start flirting that limit, the large pages that aren't being used start getting broken up in to standard 8kb pages which are hashed and deduped, along with some other zero page reclaiming and compression stuff that will occur that really let you stretch the memory you have. A word of warning is that in the name of security, VMware added a unique salt in to each VM's page hashing which effectively breaks the dedupe feature so you have to go in to the advanced config of your hosts to tell it to use the same salt for everything so VMs can dedupe pages against each other instead of just themselves. The memory killer on hosts is again vswap (pagefile) thrashing, so watch that metric and if you have sustained vswap churn then you need to add more ram to the hosts or do a bunch of tuning. Also, set up resource pools (default High, Medium, Low should be fine) and make sure your dev stuff goes in low so if something bad happen like a host dies then memory and CPU pressure get pushed on the less important stuff while your production systems miss the brunt of the impact.

Disk - The blue line on the resource monitor disk graphs is your IO queue depth and when its up that means whatever is going on its waiting on disk. Some things are disk intensive and take time, but if disk performance isn't acceptable then you're going to need some upgrades or get creative with SSD/ram vmdk caching. Check storage latency for your datastores and make sure that storage io control is on for everything. If you have multiple pools of disk available then consider doing some storage vmotions to level the load out. Otherwise, upgrade upgrade upgrade.

Network - Check the bandwidth on the VM, check the total bandwidth on the host, see if you're maxing anything out. If you have two VMs that love to talk to each other, consider sticking them together as a vapp or with an affinity rule so the network traffic gets processed internally through the host instead of hitting the switch/wire. If you need more bandwidth, add more NICs and aggregate links.

This is all reasonable advice, but never "start at the operating system level" -- start with the user. As in, actually talk to them and figure out what they're seeing. Especially if you're talking to dev and QA teams, it's very likely that the system is underutilized or outright idle 95% of the time, but the 5% of the time they actually need the system (say, preparing a test build) it runs too slowly for people to get their job done. Most internal applications are bursty by nature. If you don't understand the access and workflow patterns, you're bound to make dumb decisions. Let the user help you help them. Build a timeline and make sure you're looking at the right things.

Secondly, don't try to monitor CPU or memory usage from inside a VM (swap is fine) unless you're using the VMware-specific performance counter extensions that are bundled with VMware Tools. The VM is guessing what its resource utilization is, but it really has no idea -- the entire point is that the VM has been abstracted away from the hardware, right? The hypervisor knows the difference between 100% of a CPU and contention with another VM, and it knows the difference between what looks to the OS like reasonable memory usage and something running up against a resource limit set on the hypervisor.

# ? Jul 31, 2015 00:34

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

evol262 posted:

It's hard to say without knowing what's failing.

This is almost always neutron. Are worker threads enabled? I don't think OSP enables them by default, and Neutron behaves like poo poo without them.

How many NICs do you have? There's not a ton of message queue or database traffic (not enough to worry about with 30 nodes), but your guests are segmented into vxlans on a different physical NIC, right?

How many identity/keystone instances are you running? OSP defaults to one. Running it on every node and fronting with haproxy is ideal.

Is glance mapped from some fast storage? How do your NICs look when you're doing this? If you time out waiting for neutron to create an port, look there. If you crap out because glance is taking 100% of your bandwidth, configure nova's instances_path to be a mountpoint. Ceph or gluster are ideal (so you can shove images in from glance-api on any node).

But you'd have to be a little more specific than "APIs start to fail" to get specific suggestions, and I'm not an expert in every component...

I think it's keystone authentication tokens if I had to guess, the API's that fail are either Cinder related or Nova related. The problem originally seemed to be database backend and Keystone is basically just a REST front end and a database. There's been some errors in the logs but it's mostly "I can't do the thing I tried to do after 3 attempts" google-fu seems to not pull up anything and the OSP config is basically defaults like I stated earlier.

There are 5 NICs per host, all running on UCS, dual 10gig backend configured in A-B failover. I never looked into the Neutron worker threads, I'll be sure to check that out. Guest traffic isn't an issue yet, as it's only maybe 5-10 instances. This is a brand new install. And yea, VXLAN for tenant networks.

So we're running 3x controllers, each with 128 gigs of ram, 56 cores and a pair of RAID1 7k SAS drives. Glance is mapped to a trio of ceph backed storage nodes, each with 10 disks or 30 total OSDs, I've seen this hit 7000 IOPS and during testing we hardly hit 1000 IOPS (like to reproduce this problem).

The NICs are all separated via type, so there's : Management, Storage Clustering, Cluster Management, Tenant, External, Public API, Storage

# ? Jul 31, 2015 01:51

evol262: Nov 30, 2010; #!/usr/bin/perl

ILikeVoltron posted:

I think it's keystone authentication tokens if I had to guess, the API's that fail are either Cinder related or Nova related. The problem originally seemed to be database backend and Keystone is basically just a REST front end and a database. There's been some errors in the logs but it's mostly "I can't do the thing I tried to do after 3 attempts" google-fu seems to not pull up anything and the OSP config is basically defaults like I stated earlier.

There are 5 NICs per host, all running on UCS, dual 10gig backend configured in A-B failover. I never looked into the Neutron worker threads, I'll be sure to check that out. Guest traffic isn't an issue yet, as it's only maybe 5-10 instances. This is a brand new install. And yea, VXLAN for tenant networks.

So we're running 3x controllers, each with 128 gigs of ram, 56 cores and a pair of RAID1 7k SAS drives. Glance is mapped to a trio of ceph backed storage nodes, each with 10 disks or 30 total OSDs, I've seen this hit 7000 IOPS and during testing we hardly hit 1000 IOPS (like to reproduce this problem).

The NICs are all separated via type, so there's : Management, Storage Clustering, Cluster Management, Tenant, External, Public API, Storage

This is 5-10 instances total? I thought you mean "starting 10 instances within 3 seconds makes some API fall over", which is often able to be blamed on Neutron.

Are the errors consistent? Same service failing every time? Or from the same hosts? Or to the same controllers? That could give a jumping off point, at least, but it sounds like your architecture starting off right, and isn't to blame (even though I like LACP better than A-B failover)

# ? Jul 31, 2015 03:09

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

ILikeVoltron posted:

Can you explain what needs to happen before it scales past 5 machines? I've got a deployment out there that's around 30 physical nodes and the thing runs like poo. If I launch 10 VMs the APIs start to fail and some of the VMs fail to launch. I've got almost 100% defaults except for password and ceph ports in the OSP hostgroup params.

I don't know anything about OSP specifically, but I'm positive you have a database problem.

OpenStack uses locking and SELECT ... FOR UPDATE extensively when allocating resources (vCPUs, memory, fixed/floating IPs, etc.). This fails transactions frequently, especially in multi-writer MySQL configurations, because of the way that Galera processes transactions. Most of the OpenStack components are configured to retry when this condition results in a conflict, but under heavy lock contention they can just sit there and spin forever with none of the updates ever finishing. As a first step, if you're using MySQL, make sure you're directing all of your writes through a single MySQL server node. (You can use different primary writers for Nova and Neutron to help scale, if you need, but make sure all your Nova writes go through the same node and all your Neutron writes go through the same node.) Make sure your database is tuned for writes as tightly as it will go. Strongly consider running your Nova and Neutron databases from SSD, as this will make the commits much faster and decrease the incidence of this problem.

Most of the database load from that SELECT ... FOR UPDATE issue is quota management. If you're running in a single-tenant organization, or you otherwise don't care about quotas, you can switch your Nova and Neutron configurations from the DbQuotaDriver to the NoopQuotaDriver, effectively disabling quotas.

# ? Jul 31, 2015 03:27

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

Vulture Culture posted:

I don't know anything about OSP specifically, but I'm positive you have a database problem.

OpenStack uses locking and SELECT ... FOR UPDATE extensively when allocating resources (vCPUs, memory, fixed/floating IPs, etc.). This fails transactions frequently, especially in multi-writer MySQL configurations, because of the way that Galera processes transactions. Most of the OpenStack components are configured to retry when this condition results in a conflict, but under heavy lock contention they can just sit there and spin forever with none of the updates ever finishing. As a first step, if you're using MySQL, make sure you're directing all of your writes through a single MySQL server node. (You can use different primary writers for Nova and Neutron to help scale, if you need, but make sure all your Nova writes go through the same node and all your Neutron writes go through the same node.) Make sure your database is tuned for writes as tightly as it will go. Strongly consider running your Nova and Neutron databases from SSD, as this will make the commits much faster and decrease the incidence of this problem.

Most of the database load from that SELECT ... FOR UPDATE issue is quota management. If you're running in a single-tenant organization, or you otherwise don't care about quotas, you can switch your Nova and Neutron configurations from the DbQuotaDriver to the NoopQuotaDriver, effectively disabling quotas.

I'll be glad to check this - again this is defaults from the OSP installer. It does have 3x MySQL boxes created using a pacemaker cluster. One guy seems to be the master because his process list hovers between 1000 - 1600 (and one of our first steps was to up this number from the default 1000). The other two MySQLd process lists show 4-5, sleeping or waiting for binlogs iirc. What we see is the 'master' doesn't seem to be reporting any locks directly. The MySQLd log doesn't look too ugly other than reporting it can't change the max number of open files: "[Warning] Could not increase number of max_open_files to more than 1024 (request: 1835)". I think at one point we turned on slow log and found it very weird. Like it would go from 2-4 second queries straight into a 30 second query and then roll over (I think that's the HAProxy timeout for API requests)

evol262 posted:

This is 5-10 instances total? I thought you mean "starting 10 instances within 3 seconds makes some API fall over", which is often able to be blamed on Neutron.

Are the errors consistent? Same service failing every time? Or from the same hosts? Or to the same controllers? That could give a jumping off point, at least, but it sounds like your architecture starting off right, and isn't to blame (even though I like LACP better than A-B failover)

So I can reproduce this easily when I start 10 instances from the cli or gui. Just select CentOS-whatever, launch 10 small's with Cinder backed storage and boom, 9/10 times at least 2-3 error out and fail. I can usually get it to fail just doing straight nova backed instances as well.

In terms of breaking the problem down a little more, I'm able to do this with two controllers running in the same cluster (I've taken one down, and changed which one I take down). For which services fail, sometimes it's during Device block mapping, sometimes not.

Also, just for clarity. When I ask about "scaling past 5 machines" I mean hosts. Like, a really basic single controller install with 2-3 compute hosts.

Also, thank you both for raising some great questions.

# ? Jul 31, 2015 08:12

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

Vulture Culture posted:

This is all reasonable advice, but never "start at the operating system level" -- start with the user. As in, actually talk to them and figure out what they're seeing. Especially if you're talking to dev and QA teams, it's very likely that the system is underutilized or outright idle 95% of the time, but the 5% of the time they actually need the system (say, preparing a test build) it runs too slowly for people to get their job done. Most internal applications are bursty by nature. If you don't understand the access and workflow patterns, you're bound to make dumb decisions. Let the user help you help them. Build a timeline and make sure you're looking at the right things.

Secondly, don't try to monitor CPU or memory usage from inside a VM (swap is fine) unless you're using the VMware-specific performance counter extensions that are bundled with VMware Tools. The VM is guessing what its resource utilization is, but it really has no idea -- the entire point is that the VM has been abstracted away from the hardware, right? The hypervisor knows the difference between 100% of a CPU and contention with another VM, and it knows the difference between what looks to the OS like reasonable memory usage and something running up against a resource limit set on the hypervisor.

You need to be looking inside the VM at the OS/application level to know if something is threaded properly to benefit from multiple cores. You won't be able to tell if you aren't down to that level and it dramatically changes how you address the bottleneck.

# ? Jul 31, 2015 20:54

ILikeVoltron: May 17, 2003; I <3 spyderbyte!

Vulture Culture posted:

I don't know anything about OSP specifically, but I'm positive you have a database problem.

OpenStack uses locking and SELECT ... FOR UPDATE extensively when allocating resources (vCPUs, memory, fixed/floating IPs, etc.). This fails transactions frequently, especially in multi-writer MySQL configurations, because of the way that Galera processes transactions. Most of the OpenStack components are configured to retry when this condition results in a conflict, but under heavy lock contention they can just sit there and spin forever with none of the updates ever finishing. As a first step, if you're using MySQL, make sure you're directing all of your writes through a single MySQL server node. (You can use different primary writers for Nova and Neutron to help scale, if you need, but make sure all your Nova writes go through the same node and all your Neutron writes go through the same node.) Make sure your database is tuned for writes as tightly as it will go. Strongly consider running your Nova and Neutron databases from SSD, as this will make the commits much faster and decrease the incidence of this problem.

Most of the database load from that SELECT ... FOR UPDATE issue is quota management. If you're running in a single-tenant organization, or you otherwise don't care about quotas, you can switch your Nova and Neutron configurations from the DbQuotaDriver to the NoopQuotaDriver, effectively disabling quotas.

So I was requested by support to run a innodb_status during or after the failures. It was 3011 lines of output. Most of which look like the following:
"MySQL thread id 981139, OS thread handle 0x7ef9eeefc700, query id 35323377 192.168.x.x keystone sleeping
---TRANSACTION 2241C14, not started"

Sometimes it's nova, sometimes its keystone, sometimes its neutron.

# ? Aug 1, 2015 00:00

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

ILikeVoltron posted:

So I was requested by support to run a innodb_status during or after the failures. It was 3011 lines of output. Most of which look like the following:
"MySQL thread id 981139, OS thread handle 0x7ef9eeefc700, query id 35323377 192.168.x.x keystone sleeping
---TRANSACTION 2241C14, not started"

Sometimes it's nova, sometimes its keystone, sometimes its neutron.

Are you running keystone in UUID or PKI auth mode? UUID also slams the database with this crap.

# ? Aug 1, 2015 02:39

Moey: Oct 22, 2010; I LIKE TO MOVE IT

Not sure if this has been asked yet, anyone migrate their VMFS datastores to VVOLs?

# ? Aug 5, 2015 22:58

Potato Salad: Oct 23, 2014; nobody cares

If there's a better place to post this, tell me and I'll go elsewhere. Seeing as there isn't a server hardware thread, here it goes:

I'm interested in booting a few esxi hosts on Dell m520 modules from an internal USB flash drive / SD card (both are available). Purpose: free up the pair of server-grade SSDs for a flash read cache. Before I go and do so, however, I was wondering whether there are server-grade USB flash drives / SD cards that can take the internal temps of a server -- it gets hot back there. I'm not particularly concerned about read/write burnout, but of course higher reliability is a plus. My budget is pretty much unlimited.

# ? Aug 10, 2015 17:13

Pile Of Garbage: May 28, 2007

Potato Salad posted:

If there's a better place to post this, tell me and I'll go elsewhere. Seeing as there isn't a server hardware thread, here it goes:

I'm interested in booting a few esxi hosts on Dell m520 modules from an internal USB flash drive / SD card (both are available). Purpose: free up the pair of server-grade SSDs for a flash read cache. Before I go and do so, however, I was wondering whether there are server-grade USB flash drives / SD cards that can take the internal temps of a server -- it gets hot back there. I'm not particularly concerned about read/write burnout, but of course higher reliability is a plus. My budget is pretty much unlimited.

Most vendors provide USB flash drives for exactly this purpose as an option (I've installed them in IBM HS22 blades before). Speak to Dell or your VAR.

Pile Of Garbage fucked around with this message at 17:36 on Aug 10, 2015

# ? Aug 10, 2015 17:33

Potato Salad: Oct 23, 2014; nobody cares

cheese-cube posted:

Most vendors provide USB flash drives for exactly this purpose

Heh, no poo poo. Thanks!

# ? Aug 10, 2015 18:44

bull3964: Nov 18, 2000; DO YOU HEAR THAT? THAT'S THE SOUND OF ME PATTING MYSELF ON THE BACK.

Been running for 2.5 years on the internal SD cards on our R620s without an incident. They are redundant too, so I don't have to worry about one crapping out.

# ? Aug 10, 2015 19:58

some kinda jackal: Feb 25, 2003; �
�

OK this is going to be a really weird question but is there any way to get a VM to update notes on itself in vCenter?

Like I'd love to set up a script to parse apache's conf and updated notes on which vhosts live on that server. I'm guessing you probably don't want a VM to mess with its own vmx so theoretically I'm thinking something like a central server (or even the VCSA itself) pulling down apache confs from each web server then running powercli scripts based on some parsing?

# ? Aug 11, 2015 16:50

Dr. Arbitrary: Mar 15, 2006; Bleak Gremlin

VMware ESXi 6:
Is the easiest way to update virtual machine files still to do a storage vmotion?

# ? Aug 11, 2015 17:03

Pile Of Garbage: May 28, 2007

Martytoof posted:

OK this is going to be a really weird question but is there any way to get a VM to update notes on itself in vCenter?

Like I'd love to set up a script to parse apache's conf and updated notes on which vhosts live on that server. I'm guessing you probably don't want a VM to mess with its own vmx so theoretically I'm thinking something like a central server (or even the VCSA itself) pulling down apache confs from each web server then running powercli scripts based on some parsing?

Unless there's an off-the-shelf solution other than VRA that's pretty much what I'd do, except in the opposite direction: have a script which runs on each server that publishes the Apache confs to a centralised location which is then parsed by a PowerCLI script on a separate server. The advantage of the push method is that you can make the script a part of your standard build so you don't need to re-jig the PowerCLI script when you deploy a new server.

# ? Aug 11, 2015 17:12

Wicaeed: Feb 8, 2005

Started a new job recently at a shop running ~60 hosts (1000 VMs). Most of their hosts are running vSphere 5.1 and vCSA 5.5, however they are 99% Linux (don't even use AD for Auth).

Is there still no way to run vSphere Update Manager without running Windows?

# ? Aug 11, 2015 19:13

Pile Of Garbage: May 28, 2007

Wicaeed posted:

Started a new job recently at a shop running ~60 hosts (1000 VMs). Most of their hosts are running vSphere 5.1 and vCSA 5.5, however they are 99% Linux (don't even use AD for Auth).

Is there still no way to run vSphere Update Manager without running Windows?

Do you mean using vSphere Update Manager to install OS updates on the Linux guests? If so then have a read of KB2018695 and check the compatibility guide to see whether your guest OS(s) are supported.

# ? Aug 11, 2015 19:22

some kinda jackal: Feb 25, 2003; �
�

cheese-cube posted:

Unless there's an off-the-shelf solution other than VRA that's pretty much what I'd do, except in the opposite direction: have a script which runs on each server that publishes the Apache confs to a centralised location which is then parsed by a PowerCLI script on a separate server. The advantage of the push method is that you can make the script a part of your standard build so you don't need to re-jig the PowerCLI script when you deploy a new server.

This sounds workable. I'm going to set up a utility server (since I have to anyway, NTP and such) so I'll throw powerCLI on there. I might be tempted to segregate it more if it wasn't just a lab environment.

Thx!

# ? Aug 11, 2015 20:43

Tev: Aug 13, 2008

cheese-cube posted:

Do you mean using vSphere Update Manager to install OS updates on the Linux guests? If so then have a read of KB2018695 and check the compatibility guide to see whether your guest OS(s) are supported.

I think he means have the VUM service running on something other than a Windows machine. And the answer to that is "not yet"

# ? Aug 12, 2015 00:13

adorai: Nov 2, 2002; 10/27/04 Never forget; Grimey Drawer

This is probably a weird question, but does anyone know what bankonit.com's cloud runs on? We are acquiring an institution that uses them and I am interested in getting this answered without directly asking someone. Assuming they run openstack, is it relatively easy to export a VM from one openstack provider to another? is it as easy as it would be with VMware?

# ? Aug 12, 2015 04:53

evol262: Nov 30, 2010; #!/usr/bin/perl

adorai posted:

This is probably a weird question, but does anyone know what bankonit.com's cloud runs on? We are acquiring an institution that uses them and I am interested in getting this answered without directly asking someone. Assuming they run openstack, is it relatively easy to export a VM from one openstack provider to another? is it as easy as it would be with VMware?

No. I mean, in theory libvirt can migrate it, but it really depends on openstack release and the libvirt vsphere driver is terrible (if they're running on vmware), but your hope should be that their VMs were built with config management. The glance images are easy to move. Cinder volumes are a little harder, but not much. Moving guest definitions is possible with some database hacking. Moving running guests is basically no.

# ? Aug 12, 2015 05:41

DevNull: Apr 4, 2007; And sometimes is seen a strange spot in the sky
A human being that was given to fly

Since my Fling has not been published yet, I will plug this one:
https://labs.vmware.com/flings/esxi-embedded-host-client

My only problem with this being a Fling is that it should be a fully supported part of ESX. Engineers have been pushing for something like this for years internally, and that is the only reason that happened. It started off as two hackathon projects. One for just a webMKS console (VMRC), and another for basic control of your VM and host. They merged the two projects after the hackathon, and then spend a year fighting to get more support from the company. It was finally supported enough to it was handed off to some other guys that cleaned it up and made it shippable as a Fling.

# ? Aug 12, 2015 16:16

Tev: Aug 13, 2008

DevNull posted:

Since my Fling has not been published yet, I will plug this one:
https://labs.vmware.com/flings/esxi-embedded-host-client

My only problem with this being a Fling is that it should be a fully supported part of ESX. Engineers have been pushing for something like this for years internally, and that is the only reason that happened. It started off as two hackathon projects. One for just a webMKS console (VMRC), and another for basic control of your VM and host. They merged the two projects after the hackathon, and then spend a year fighting to get more support from the company. It was finally supported enough to it was handed off to some other guys that cleaned it up and made it shippable as a Fling.

I'm excited about this one, and I'm assuming it's just a way for VMware to test it out in the wild and get some feedback before making it standard. Can you talk about the Fling you're working on?

# ? Aug 12, 2015 17:07

Kachunkachunk: Jun 6, 2011

Feedback seems positive on Reddit as well as the Flings page linked above. Here's the Reddit thread: https://www.reddit.com/r/vmware/comments/3gpvea/html5_web_client_technical_preview/
Edit: I'm high. They're not talking about the UI at all. Just the flings page did. I need a break from the computer, methinks.

# ? Aug 12, 2015 21:20

DevNull: Apr 4, 2007; And sometimes is seen a strange spot in the sky
A human being that was given to fly

Tev posted:

I'm excited about this one, and I'm assuming it's just a way for VMware to test it out in the wild and get some feedback before making it standard. Can you talk about the Fling you're working on?

I'm not sure what the plan is with it, but I know a lot of engineers would love to have it as part of the basic install. It makes it far easier to do our job if we only care about a single ESX system. There is also some politics involved, as the people that did the Fling are not part of the group that do the vCenter web client. I guess some people feel that this Fling is stepping on their toes. Then again, those same people came up with a solution for a single host that involved installing some Adobe crap on a Windows machine that connected to an ESX box. The vib is only 2.2Mb, so I think getting that into the image wouldn't be too difficult.

As for my Fling. I will hold off another day or two until it is released, but I will say it has nothing to do with ESX.

# ? Aug 12, 2015 23:11

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

DevNull posted:

I'm not sure what the plan is with it, but I know a lot of engineers would love to have it as part of the basic install. It makes it far easier to do our job if we only care about a single ESX system. There is also some politics involved, as the people that did the Fling are not part of the group that do the vCenter web client. I guess some people feel that this Fling is stepping on their toes. Then again, those same people came up with a solution for a single host that involved installing some Adobe crap on a Windows machine that connected to an ESX box. The vib is only 2.2Mb, so I think getting that into the image wouldn't be too difficult.

As for my Fling. I will hold off another day or two until it is released, but I will say it has nothing to do with ESX.

Oddly I can't seem to hit labs.vmware.com anymore without getting an access denied.

As for the toes of the webclient team, I think they need to be stomped on if not outright cut off. Even the 6.0 web client is a steaming pile of fragile poo poo that makes me have to force-kill my browser every hour or so.

# ? Aug 13, 2015 19:12

DevNull: Apr 4, 2007; And sometimes is seen a strange spot in the sky
A human being that was given to fly

1000101 posted:

Oddly I can't seem to hit labs.vmware.com anymore without getting an access denied.

As for the toes of the webclient team, I think they need to be stomped on if not outright cut off. Even the 6.0 web client is a steaming pile of fragile poo poo that makes me have to force-kill my browser every hour or so.

No comment. I completely Agree.

They also blocked Workstation from being able to browse datastores and register existing VMs from a datastore, for the same reason. Yeah, you know your code is poo poo when you have to worry about other software in your own company makes you look bad.

# ? Aug 13, 2015 19:24

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

DevNull posted:

No comment. I completely Agree.

They also blocked Workstation from being able to browse datastores and register existing VMs from a datastore, for the same reason. Yeah, you know your code is poo poo when you have to worry about other software in your own company makes you look bad.

VMware has an event called PTAB where they bring product managers in to give us roadmaps for where various products are going and to provide them feedback. They always seem genuinely surprised when we tell them (typically ~40-50 partners in the room) in no uncertain terms how frustrating the web client is to work with.

Someone's doing a whole lot of lying over there to someone.

# ? Aug 13, 2015 19:43

mayodreams: Jul 4, 2003; Hello darkness,
my old friend

1000101 posted:

Oddly I can't seem to hit labs.vmware.com anymore without getting an access denied.

Try another browser. Chrome did this for me but Safari and FireFox worked.

# ? Aug 13, 2015 21:21

DevNull: Apr 4, 2007; And sometimes is seen a strange spot in the sky
A human being that was given to fly

Tev posted:

Can you talk about the Fling you're working on?

https://labs.vmware.com/flings/vnc-server-and-vnc-client

OK, now I can talk about it.

Probably boring to most people, but a lot of us use this for working remotely. I spend the last week in Seattle connected to my linux desktop in Palo Alto with it, and work from home with it all the time as well. I've also use it to play Skyrim over a hotel wireless during a convention with lovely networking. It was showing some artifacts with the compress during that, but still pretty impressive. It works really well with low bandwidth. It's the same VNC code that your VMRC connection uses now. There was a bunch of politics that kept this from launching for 7 months, so we already have a bunch of plans for another release. The main thing we want to add is a UI, and support for a Mac client.

# ? Aug 15, 2015 06:57

Potato Salad: Oct 23, 2014; nobody cares

DevNull posted:

https://labs.vmware.com/flings/vnc-server-and-vnc-client

OK, now I can talk about it.

Probably boring to most people, but a lot of us use this for working remotely. I spend the last week in Seattle connected to my linux desktop in Palo Alto with it, and work from home with it all the time as well. I've also use it to play Skyrim over a hotel wireless during a convention with lovely networking. It was showing some artifacts with the compress during that, but still pretty impressive. It works really well with low bandwidth. It's the same VNC code that your VMRC connection uses now. There was a bunch of politics that kept this from launching for 7 months, so we already have a bunch of plans for another release. The main thing we want to add is a UI, and support for a Mac client.

Holy balls, bookmarked & installed. I'll start using it tomorrow during a flight.

# ? Aug 15, 2015 07:21

DevNull: Apr 4, 2007; And sometimes is seen a strange spot in the sky
A human being that was given to fly

Potato Salad posted:

Holy balls, bookmarked & installed. I'll start using it tomorrow during a flight.

This makes me happy to see excitement. I hate to admit it, but the code shipped there is 7 months old. *sigh* Life with a huge company. There is a ton of code changed since then that we could have shipped, but it would have taken another few weeks to deal with open source tracking, and we just wanted to get a 1.0 out the door. We are probably going to put a new version out in 3 months or so. While I no longer work on this code, I still use it and want to see it succeed. We now have a lot more support from management, so hopefully we can make this even better.

# ? Aug 15, 2015 07:44

Adbot: ADBOT LOVES YOU

# ? Apr 18, 2024 15:05

evol262: Nov 30, 2010; #!/usr/bin/perl

Will the next release be under an open license? This looks really nice, but I'm wondering about integrating the server with other stuff...

# ? Aug 15, 2015 14:09

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›312 »