Virtualization Megathread V2: VMs inside VMs

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›312 »

Bob Morales: Aug 18, 2006; ~~Just wear the fucking mask, Bob~~

I don't care how many people I probably infected with COVID-19 while refusing to wear a mask, my comfort is far more important than the health and safety of everyone around me!

Two years ago you would have went with 6.5 as it had been out for a bit. Now I'm not so sure it would be a big deal.

It'd be like going to 6.7 instead of 7.0 right now.

What else are you using with it? I know my manager hates 6.7 because it doesn't integrate with our Nimble (which is older) like it used to.

# ? May 5, 2020 01:50

Adbot: ADBOT LOVES YOU

# ? Apr 19, 2024 03:02

Internet Explorer: Jun 1, 2005

It's been a while, but isn't that just the HTML5 interface that doesn't have plug-ins? I thought the Nimble stuff should still work in the Flash client.

# ? May 5, 2020 02:49

Bob Morales: Aug 18, 2006; ~~Just wear the fucking mask, Bob~~

I don't care how many people I probably infected with COVID-19 while refusing to wear a mask, my comfort is far more important than the health and safety of everyone around me!

We upgraded our Nimble software so the newest html5 plugin doesn't have the same functionality

The flash client is deprecated now, so that plugin doesn't get updated

I guess it's more bumbles fault but whatever. You just have to pop into the Nimble to create datastores now instead of being able to do it in vsphere

# ? May 5, 2020 03:32

Moey: Oct 22, 2010; I LIKE TO MOVE IT

Yeah, I played with the Nimble vCenter plug in years ago, I just use the web interface on the arrays to do that. It's not difficult at all.

# ? May 5, 2020 05:13

SlowBloke: Aug 14, 2017

Wicaeed posted:

I have somewhat stupidly volunteered myself for a VMware upgrade Project of our aged vCenter 6.0 installation.

The advisor recommendations are saying we should install the 6.5.0 GA version of vCenter, but I don't see any mention of vCenter 6.7.

We do have some older hosts that can only go to 6.0.0 U2 version of VMware, however these should be compatible with vCenter 6.7 according to the VMware docs.

Am I missing anything super obvious as to why 6.7 wouldn't be showing as a recommended upgrade for us?

I do have a VMW support ticket created as well, just figured SA may have a quicker turnaround than VMware support nowadays...

There is no major issue going 6.0 to 6.7(unless you went external PSC or you are running the vcenter install on windows), there are issues going directly from 5.5 to 6.7(you need an intermediate 6.5 step to keep host compatibility). The current vmware upgrade path can be found at https://www.vmware.com/resources/compatibility/sim/interop_matrix.php#upgrade&solution=2 (insert vcenter in the text field).
Also never do an upgrade with a GA build, always at least on Update1. Current VCSA 6.7 build is Update 3g.

# ? May 5, 2020 07:15

TheFace: Oct 4, 2004; Fuck anyone that doesn't wanna be this beautiful

Wicaeed posted:

I have somewhat stupidly volunteered myself for a VMware upgrade Project of our aged vCenter 6.0 installation.

The advisor recommendations are saying we should install the 6.5.0 GA version of vCenter, but I don't see any mention of vCenter 6.7.

We do have some older hosts that can only go to 6.0.0 U2 version of VMware, however these should be compatible with vCenter 6.7 according to the VMware docs.

Am I missing anything super obvious as to why 6.7 wouldn't be showing as a recommended upgrade for us?

I do have a VMW support ticket created as well, just figured SA may have a quicker turnaround than VMware support nowadays...

Go to 6.7, and even if you go 6.5 you sure as poo poo shouldn't go GA! LATEST UPDATES ALWAYS!!!

# ? May 5, 2020 21:03

greatapoc: Apr 4, 2005

We've just recently bought a bunch of new Dell R640 servers to replace our aging HP c7000 blade chassis hosting our Hyper-V infrastructure. We're experiencing poor network performance on the new servers though and I'm trying to track down the source of it. Our original environment was 2012r2 hosts. We had to inplace upgrade these to 2016 to raise the functional level before adding the new 2019 Dell hosts to the cluster and then live migrating all the guests over to the new hardware. We've still got some VMs stuck on one of the old hosts but we have a plan around that. Anyway, this gives us something to work with and compare for the poor network issues we're seeing.

Differences I can see between the old host and the new: jumbo frames disabled on the new, receive and transmit buffers set to 512 on the new and auto on the old. Old host has three 1GB NICs in a LACP team to two Cisco switches, new host has 2 10GB NICs in a LACP team to the same switches. iPerf test from a VM on the old host to my PC saturates the 1Gbps link into my PC but the new host only pushes about 500mbps. Transfer between VMs on the same host is around 2gbps, transfer between VMs on different hosts is around 600mbps. VMQ is enabled, NICs are Intel X710 on the Dells.

Not sure what else to mention. I was going to try changing the NIC team from LACP to "Switch Embedded Teaming" tonight in an outage window as well as enabling jumbo frames to see if it makes a difference. Does anyone have any ideas of things to look at?

# ? May 6, 2020 07:48

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

The x710's are unstable trashfires, especially on bonded links. I spent almost a year chasing my tail on them and trying every combination of driver and firmware imaginable and going through every level of Dell and Intel support they could throw me at until they finally relented and replaced the NDCs with Qlogics that worked flawlessly. I will say that the x710 silicon is fast as hell, can pull of full line speed on 64k frames, but that doesn't mean diddly when the loving thing is getting reinitialized five times a second because they're panicking and flapping the link. You can improve the situation by getting jumbo frames back on and cranking up your ring buffers to max (I think it was 4096, maybe 8192? 512 is way too small for a virtual host)

The qlogics were slower (could only handle 3-4gbps on 1500 frames, but still hit line speed on jumbo. Newer models are better), but they didn't panic and my lacp interface uptime was being measured in days instead of seconds which was an acceptable tradeoff.

e: I can't believe they're still selling them to be honest. The senior engineer at Dell I finally got through to said they were having no end to the problems with them and were about ready to drop Intel nics as an option. This was years ago, I figured they would have finally sorted out the issues or made a new version of a quad-port 10gige interface. If you need to stick with Intel, try the X722 instead, hopefully they got their poo poo together. X710's are old, I was dealing with this poo poo back in 2014/2015

BangersInMyKnickers fucked around with this message at 14:56 on May 6, 2020

# ? May 6, 2020 14:49

greatapoc: Apr 4, 2005

BangersInMyKnickers posted:

The x710's are unstable trashfires, especially on bonded links.

Well that's just bloody great, we just took delivery of the things two weeks ago. Looks like we might have to go back to Dell and ask for something else. The thing is we've never had any problem with it dropping the connection, the port-channels are rock solid and haven't missed a beat. The performance is just terrible.

# ? May 6, 2020 23:58

Thanks Ants: May 21, 2004; #essereFerrari

Tell Dell that they need to ship you new mezzanine cards with different NICs on or you won't pay the invoice

# ? May 7, 2020 00:03

Methanar: Sep 26, 2013; by the sex ghost

X710 is great because it puts ESXi closer towards it's natural state. psod

Also I once discovered that my non-lacp bonded interfaces had been flapping several times a second for like a year on one of our databases.

Also sometimes the cards would just not be recognized as plugged in at all, even by the BMC until you restarted 10 times.

gently caress datacenters

Methanar fucked around with this message at 00:53 on May 7, 2020

# ? May 7, 2020 00:44

greatapoc: Apr 4, 2005

Touch wood it looks like I may have fixed it but I'm not sure exactly which part did it.

Removed the team and recreated it (still using LACP)
Enabled jumbo frames on both NICs
Increased receive and transfer buffers to 4096
Added reg key HKLM\SYSTEM\CurrentControlSet\Services\VMSMP\Parameters\TenGigVmqEnabled=1 (VMQ was already enabled on the VMs)
Rebooted host

iperf and file transfers are now flying like they should but failover cluster manager is throwing up it's hands so I need to do more with that.

Edit: Here's a capture from where I have it running on one of the Dells then live migrate it to the one I've just (hopefully) fixed.

[ 4] 2.00-3.00 sec 54.5 MBytes 456 Mbits/sec
[ 4] 3.00-4.00 sec 25.9 MBytes 217 Mbits/sec
[ 4] 4.00-5.00 sec 49.8 MBytes 419 Mbits/sec
[ 4] 5.00-6.00 sec 43.4 MBytes 364 Mbits/sec
[ 4] 6.00-7.00 sec 48.2 MBytes 405 Mbits/sec
[ 4] 7.00-8.00 sec 49.4 MBytes 414 Mbits/sec
[ 4] 8.00-9.00 sec 39.8 MBytes 334 Mbits/sec
[ 4] 9.00-12.49 sec 7.50 MBytes 18.1 Mbits/sec
[ 4] 12.49-12.49 sec 0.00 Bytes 0.00 bits/sec
[ 4] 12.49-12.49 sec 0.00 Bytes 0.00 bits/sec
[ 4] 12.49-13.00 sec 2.62 MBytes 43.1 Mbits/sec
[ 4] 13.00-14.00 sec 111 MBytes 929 Mbits/sec
[ 4] 14.00-15.00 sec 112 MBytes 936 Mbits/sec
[ 4] 15.00-16.00 sec 110 MBytes 919 Mbits/sec
[ 4] 16.00-17.00 sec 110 MBytes 922 Mbits/sec
[ 4] 17.00-18.00 sec 105 MBytes 880 Mbits/sec
[ 4] 18.00-18.58 sec 56.5 MBytes 813 Mbits/sec

greatapoc fucked around with this message at 01:28 on May 7, 2020

# ? May 7, 2020 01:05

Potato Salad: Oct 23, 2014; nobody cares

loving nobody gets LACP right

Orchestrate static aggregation channels instead of counting on LACP to do it for you

# ? May 7, 2020 03:55

Potato Salad: Oct 23, 2014; nobody cares

Lacp exists to cause you more pain and suffering and production losses than the time it takes to set up channels/aggregation up by hand, every single time

It is worth taking a moment to just write some logic to set up static aggregation orchestration with whatever tools you use to manage your switches and your compute

Potato Salad fucked around with this message at 11:11 on May 7, 2020

# ? May 7, 2020 03:57

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

greatapoc posted:

Touch wood it looks like I may have fixed it but I'm not sure exactly which part did it.

Removed the team and recreated it (still using LACP)
Enabled jumbo frames on both NICs
Increased receive and transfer buffers to 4096
Added reg key HKLM\SYSTEM\CurrentControlSet\Services\VMSMP\Parameters\TenGigVmqEnabled=1 (VMQ was already enabled on the VMs)
Rebooted host

iperf and file transfers are now flying like they should but failover cluster manager is throwing up it's hands so I need to do more with that.

Edit: Here's a capture from where I have it running on one of the Dells then live migrate it to the one I've just (hopefully) fixed.

[ 4] 2.00-3.00 sec 54.5 MBytes 456 Mbits/sec
[ 4] 3.00-4.00 sec 25.9 MBytes 217 Mbits/sec
[ 4] 4.00-5.00 sec 49.8 MBytes 419 Mbits/sec
[ 4] 5.00-6.00 sec 43.4 MBytes 364 Mbits/sec
[ 4] 6.00-7.00 sec 48.2 MBytes 405 Mbits/sec
[ 4] 7.00-8.00 sec 49.4 MBytes 414 Mbits/sec
[ 4] 8.00-9.00 sec 39.8 MBytes 334 Mbits/sec
[ 4] 9.00-12.49 sec 7.50 MBytes 18.1 Mbits/sec
[ 4] 12.49-12.49 sec 0.00 Bytes 0.00 bits/sec
[ 4] 12.49-12.49 sec 0.00 Bytes 0.00 bits/sec
[ 4] 12.49-13.00 sec 2.62 MBytes 43.1 Mbits/sec
[ 4] 13.00-14.00 sec 111 MBytes 929 Mbits/sec
[ 4] 14.00-15.00 sec 112 MBytes 936 Mbits/sec
[ 4] 15.00-16.00 sec 110 MBytes 919 Mbits/sec
[ 4] 16.00-17.00 sec 110 MBytes 922 Mbits/sec
[ 4] 17.00-18.00 sec 105 MBytes 880 Mbits/sec
[ 4] 18.00-18.58 sec 56.5 MBytes 813 Mbits/sec

Make sure you kick the tires on live migrations between hosts on the new 10gig interfaces once you have them all up. The problems I was seeing didn't manifest until I was regularly moving traffic at multi-gbps rates. Being choked down by the old server interfaces could be masking issues.

My guess is that jumbo frames were the thing that did the most good. The buffer size won't be an issue at those speeds, especially on an iperf test. Likely the old server nic's can't handle full gbps rate on 1500 mtu and that's where your bottleneck was. For all their faults, x710 can handle 1 gbps on 1500mtu.

BangersInMyKnickers fucked around with this message at 13:52 on May 7, 2020

# ? May 7, 2020 13:48

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

Potato Salad posted:

Lacp exists to cause you more pain and suffering and production losses than the time it takes to set up channels/aggregation up by hand, every single time

It is worth taking a moment to just write some logic to set up static aggregation orchestration with whatever tools you use to manage your switches and your compute

What are people actually getting wrong? There�s not a whole lot to setup on LACP beyond timers (which most platforms only have 1 option) and if the interfaces are going to actively send LACP PDUs or not. I�ve probably seen more people get static link aggregation wrong where maybe one side has the wrong load distribution algorithm set. I think the dvswitch itself supports something like 26 different options of which not all exist on all switching platforms.

That said I almost never bother with link aggregation to hypervisors anymore. 10 gig is cheap and source based load distribution doesn�t require upstream switch configuration.

# ? May 8, 2020 17:51

Potato Salad: Oct 23, 2014; nobody cares

everyone's lacp implementation is awful, from "why do i have 1/n packet loss for n links" to literally unusable

# ? May 8, 2020 18:28

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

1000101 posted:

What are people actually getting wrong? There�s not a whole lot to setup on LACP beyond timers (which most platforms only have 1 option) and if the interfaces are going to actively send LACP PDUs or not. I�ve probably seen more people get static link aggregation wrong where maybe one side has the wrong load distribution algorithm set. I think the dvswitch itself supports something like 26 different options of which not all exist on all switching platforms.

That said I almost never bother with link aggregation to hypervisors anymore. 10 gig is cheap and source based load distribution doesn�t require upstream switch configuration.

Last I fought with LACP on ESXi (years ago when 6.5 was new), the host would only do slow PDUs which makes for an unacceptable time to detect a fault and down the bad link. If you ran the host in passive mode and the switch in active with fast pdus, the host would ignore the switch's parameters and continue to use slow pdus/long timeouts and that mismatch would cause the upstream switch to flap the link because PDU timeouts are being exceeded. The only solution was to run an esxcli script every time the host got rebooted to manually force the vdswitch on to fast PDUs using unsupported commands. I come to find that this issue has been present since 4.x when vdswitches first came out and they just didn't give a poo poo about fixing it.

With that said, once you manually forced it on to fast PDUs everything functioned as expected and failovers were snappy and well within tolerances, even for storage fabric with some minimal stalling on the VMs

# ? May 8, 2020 18:32

Pile Of Garbage: May 28, 2007

Potato Salad posted:

everyone's lacp implementation is awful, from "why do i have 1/n packet loss for n links" to literally unusable

Care to elaborate? I ask because I've never once had issues configuring LACP between devices of any vendor from 2x1Gb up to 4x10Gb.

# ? May 9, 2020 22:18

greatapoc: Apr 4, 2005

greatapoc posted:

Touch wood it looks like I may have fixed it but I'm not sure exactly which part did it.

Removed the team and recreated it (still using LACP)
Enabled jumbo frames on both NICs
Increased receive and transfer buffers to 4096
Added reg key HKLM\SYSTEM\CurrentControlSet\Services\VMSMP\Parameters\TenGigVmqEnabled=1 (VMQ was already enabled on the VMs)
Rebooted host

iperf and file transfers are now flying like they should but failover cluster manager is throwing up it's hands so I need to do more with that.

So it looks like I spoke too soon on this one. Although iperf and file transfers were a lot better, once we moved SQL over to it some applications couldn't connect to it and others were showing very slow queries. It appears we've fixed it by disabling RSC on the virtual switch.

# ? May 13, 2020 01:49

bad boys for life: Jun 6, 2003; by sebmojo

What are people doing for capacity planning for large vmware deployments? Vrops doesnt seem to scale to our infrastructure size. Getting recommendations to look at turbonomics and veaam one, but not sure if any of you have experience at SP level VMware deployments and have something better.

# ? May 20, 2020 21:56

TheFace: Oct 4, 2004; Fuck anyone that doesn't wanna be this beautiful

bad boys for life posted:

What are people doing for capacity planning for large vmware deployments? Vrops doesnt seem to scale to our infrastructure size. Getting recommendations to look at turbonomics and veaam one, but not sure if any of you have experience at SP level VMware deployments and have something better.

How big is your deployment that vROPs can't scale that large? If vROPs can't do it Veeam One is going to definitely fall on it's face, can't speak for Turbonomics.

# ? May 21, 2020 21:48

Maneki Neko: Oct 27, 2000

bad boys for life posted:

What are people doing for capacity planning for large vmware deployments? Vrops doesnt seem to scale to our infrastructure size. Getting recommendations to look at turbonomics and veaam one, but not sure if any of you have experience at SP level VMware deployments and have something better.

This seems like a fine question for an account manager/partner resources?

# ? May 21, 2020 23:52

Potato Salad: Oct 23, 2014; nobody cares

Not gonna lie, you're the first I've bumped into the year deploying sp infrastructure on vmware at "vROP isn't enough" scale

# ? May 22, 2020 01:11

TheFace: Oct 4, 2004; Fuck anyone that doesn't wanna be this beautiful

Between extra large nodes, and clustering vROPs scales to some 180000-200000 objects I think. That is assuming you can actually deploy XL nodes (24 vCPU is a bit of an ask, especially when VMware guidelines suggest being able to fit it in a single socket)

I've never seen Veeam One be able to handle an environment that large without being horrifically slow.

# ? May 22, 2020 21:20

abelwingnut: Dec 23, 2002

probably a pretty basic problem, but i'm having issues with virtualbox.

specifically, i'm trying to run an os x vm on a windows 10 machine. no matter what i do, i cannot escape the mouse or keyboard once i boot up the vm. i press the host key combination to decapture myself from the window, but nothing. i'm wondering if it has to do with the fact i'm using a usb wireless keyboard and a usb wireless mouse? i say that because whenever i type on the vm it is a bit...choppy and stuttered. so i'm wondering if there's some connection issue happening?

in any case, i've tried changing the host key combination from right control to right shift, and it just does nothing else. like, once i start the vm and it loads, i can only access that vm. ctrl+alt+del can't get me out, nothing can.

any ideas? i also loaded both usb input devices in the vm's settings. really not sure what's going on.

# ? May 28, 2020 02:56

wolrah: May 8, 2006; what?

abelwingnut posted:

probably a pretty basic problem, but i'm having issues with virtualbox.

specifically, i'm trying to run an os x vm on a windows 10 machine. no matter what i do, i cannot escape the mouse or keyboard once i boot up the vm. i press the host key combination to decapture myself from the window, but nothing. i'm wondering if it has to do with the fact i'm using a usb wireless keyboard and a usb wireless mouse? i say that because whenever i type on the vm it is a bit...choppy and stuttered. so i'm wondering if there's some connection issue happening?

in any case, i've tried changing the host key combination from right control to right shift, and it just does nothing else. like, once i start the vm and it loads, i can only access that vm. ctrl+alt+del can't get me out, nothing can.

any ideas? i also loaded both usb input devices in the vm's settings. really not sure what's going on.

If you have attached the USB devices to the guest that's why they're not able to escape, they're literally being disconnected from the host and passed through to the guest when it's running. Don't do that unless that's what you actually want (only really makes sense if you're trying to do a "two workstations, one PC" style setup). USB passthrough is for other devices that you need to have appear as directly connected to the guest.

# ? May 28, 2020 17:14

abelwingnut: Dec 23, 2002

ohhhhhh, got it. yea, that's worked--thanks.

now to try and figure out why this drat thing won't connect to icloud.

# ? May 28, 2020 22:51

LongSack: Jan 17, 2003

Two questions about VMWare and shared folders. For both, the host is windows 10 and the guest is Centos 7 running under VMWare Workstation Professional latest version (updated today).

1. I have a shared folder �always enabled�. Yet, every time I boot the VM, the /mnt/hgfs directory is empty. I have to do VM->Options->Shared Folders->Disable->OK followed by VM->Options->Shared Folders->Always Enabled->OK to get the mount point back. Is there a fix for this?

2. The above shared folder is on an external USB drive. If I haven�t accessed in a while then go back to it, the entire VM freezes solid for up to a minute or more. I suspect that this is because the drive has been spun down, and now it�s waiting for the drive to be spun up, but that�s just a guess. Any ideas how to fix / prevent this?

TIA

# ? Jun 2, 2020 00:32

Not Wolverine: Jul 1, 2007

I'm not sure this is the best thread, but I think it is the most likely to have knowledgeable people, is Ragnar Locker a big problem? If you don't know, Ragnar Locker is a type of ransomware released last month, allegedly this install VirtualBox and runs a VM with Windows XP, and then uses the VM to read, write, and encrypt all of your files. Supposedly this is extremely difficult to detect because all of the file access is done from VirtualBox instead of a Ragnar.exe process.

What I want to know is how are file permissions typically handled on a virtual machine? Assume I want to setup a VM and play some MP3s stored on the host machines drive, does the host machine only see that virtual box is reading my MP3s or is it possible for the host to see that napster.exe from within the VM is actually what is reading the MP3? Can I use file permissions so that only the owner on the host can read my MP3s, could I configure file permissions so that only the owner on the host and a VM user can read the files, could this be done by cloning the user ID for both the host and virtual machine? I think Ragnar Locker is likely going to require some changes to how VMs handle permissions and file access, but is this a problem that is going to require significant changes that will make VMs more difficult to configure?

# ? Jun 2, 2020 18:33

Thanks Ants: May 21, 2004; #essereFerrari

If a VM can escape Virtualbox and access system files outside of any folders shared within that VM then that's a horrific bug surely

# ? Jun 2, 2020 18:34

Zorak of Michigan: Jun 10, 2006

It installs the malware payload VM and gives it access to all your drives, so the sandbox isn't really a barrier.

# ? Jun 2, 2020 18:58

Not Wolverine: Jul 1, 2007

Zorak of Michigan posted:

It installs the malware payload VM and gives it access to all your drives, so the sandbox isn't really a barrier.

The VM doesn't make things signficiantly worse, my understanding is it just makes the malware harder to detect since instead of seeing a malware process accessing your files it just shows up as the virtualbox process accessing your files, which most IT departments might view as a normal non threatening process.

# ? Jun 2, 2020 19:54

Wibla: Feb 16, 2011

Most Oracle products should probably be marked as a threatening process.

# ? Jun 2, 2020 20:48

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

they threaten my personal well-being that's for sure

# ? Jun 2, 2020 23:42

GrandMaster: Aug 15, 2004; laidback

Hey all,
I was hoping to get some advice on what people are doing in regards to DR Testing in HCI/Stretched vSAN environments?

Our organisation is large government so there's a ton of compliance regulation for DR/BC testing, but the old controlled failover methods we used to use aren't really applicable anymore due to stretched vSAN.

Powering off a VM, moving it to the alternate site and powering it back on doesn't actually test anything since we run active/active DCs. We already know the network segments work and VMs can run anywhere.

And since it's a mission critical environment with risk to life, we can't just drop an entire DC to test the witness/failover functionality. Outage windows are strictly controlled, and we need to test on an app by app basis.

Anyways, if anyone has any ideas here I'm really interested since I'm coming up with blanks as to how else we can effectively "prove" the DR actually works.

# ? Jun 18, 2020 07:05

Zorak of Michigan: Jun 10, 2006

We struggle with similar problems. One strategy we tried with mixed success was to replicate to additional systems inside an island of containment, which we could then isolate for test purposes. We kept finding new endpoints we needed inside the IOC and vowing to get it right next time.

# ? Jun 18, 2020 08:25

GrandMaster: Aug 15, 2004; laidback

We can't run isolated, the typical strategy is failover and run from the DR site for a week then fail back and get a big green tick from the auditors.
It's easy in legacy environments, because we have *steps* to do that we can prove we completed and there's an outage. A live cross-site vmotion looks like nothing even happened haha.

I completely understand it's a process issue, not a technical one but government moves slooooow and auditors are insistent.

# ? Jun 18, 2020 14:02

TheFace: Oct 4, 2004; Fuck anyone that doesn't wanna be this beautiful

GrandMaster posted:

Hey all,
I was hoping to get some advice on what people are doing in regards to DR Testing in HCI/Stretched vSAN environments?

Our organisation is large government so there's a ton of compliance regulation for DR/BC testing, but the old controlled failover methods we used to use aren't really applicable anymore due to stretched vSAN.

Powering off a VM, moving it to the alternate site and powering it back on doesn't actually test anything since we run active/active DCs. We already know the network segments work and VMs can run anywhere.

And since it's a mission critical environment with risk to life, we can't just drop an entire DC to test the witness/failover functionality. Outage windows are strictly controlled, and we need to test on an app by app basis.

Anyways, if anyone has any ideas here I'm really interested since I'm coming up with blanks as to how else we can effectively "prove" the DR actually works.

I know this may be not possible due to budget constraints, but if you have a mission critical environment with risk to life there really should be a (smaller typically) dev environment configured similar enough to test things on. This goes beyond failure conditions of vSAN stretched cluster, as you should be testing updates, changes to apps, etc all in a dev environment.

Testing vSAN stretch clustering is fairly easy if you have an environment that you're actually willing to test on:
https://storagehub.vmware.com/t/vsan-6-7-u-3-proof-of-concept-guide-1/stretched-cluster-with-wts-failover-scenarios-1/

# ? Jun 19, 2020 17:59

Adbot: ADBOT LOVES YOU

# ? Apr 19, 2024 03:02

GrandMaster: Aug 15, 2004; laidback

Yeah it's a very good point, obviously failover stuff was all tested during commissioning, and we have multiple non-prod environments for all the critical apps, but it's all on the same hardware. The higher-ups wanted ONE BIG CLUSTER and completely disregarded advice to split non prod environments onto physically separate hardware like we used to have pre-hci.

On another note, has anyone had a positive experience with the vxrail VCF upgrades? It was sold to us as "one click will upgrade the entire environment" but in reality it's "one support case will be required for every single component, which inevitably fails during the upgrade process". From memory it's been about 15 support cases and 3 months required for our management cluster upgrade and it's not even finished yet.

God I hate vxrail haha

# ? Jun 24, 2020 00:50

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›312 »