Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Bob Morales
Aug 18, 2006


Just wear the fucking mask, Bob

I don't care how many people I probably infected with COVID-19 while refusing to wear a mask, my comfort is far more important than the health and safety of everyone around me!



Two years ago you would have went with 6.5 as it had been out for a bit. Now I'm not so sure it would be a big deal.

It'd be like going to 6.7 instead of 7.0 right now.

What else are you using with it? I know my manager hates 6.7 because it doesn't integrate with our Nimble (which is older) like it used to.

Adbot
ADBOT LOVES YOU

Internet Explorer
Jun 1, 2005


It's been a while, but isn't that just the HTML5 interface that doesn't have plug-ins? I thought the Nimble stuff should still work in the Flash client.

Bob Morales
Aug 18, 2006


Just wear the fucking mask, Bob

I don't care how many people I probably infected with COVID-19 while refusing to wear a mask, my comfort is far more important than the health and safety of everyone around me!



We upgraded our Nimble software so the newest html5 plugin doesn't have the same functionality

The flash client is deprecated now, so that plugin doesn't get updated

I guess it's more bumbles fault but whatever. You just have to pop into the Nimble to create datastores now instead of being able to do it in vsphere

Moey
Oct 22, 2010

I LIKE TO MOVE IT


Yeah, I played with the Nimble vCenter plug in years ago, I just use the web interface on the arrays to do that. It's not difficult at all.

SlowBloke
Aug 14, 2017


Wicaeed posted:

I have somewhat stupidly volunteered myself for a VMware upgrade Project of our aged vCenter 6.0 installation.

The advisor recommendations are saying we should install the 6.5.0 GA version of vCenter, but I don't see any mention of vCenter 6.7.

We do have some older hosts that can only go to 6.0.0 U2 version of VMware, however these should be compatible with vCenter 6.7 according to the VMware docs.

Am I missing anything super obvious as to why 6.7 wouldn't be showing as a recommended upgrade for us?

I do have a VMW support ticket created as well, just figured SA may have a quicker turnaround than VMware support nowadays...

There is no major issue going 6.0 to 6.7(unless you went external PSC or you are running the vcenter install on windows), there are issues going directly from 5.5 to 6.7(you need an intermediate 6.5 step to keep host compatibility). The current vmware upgrade path can be found at https://www.vmware.com/resources/compatibility/sim/interop_matrix.php#upgrade&solution=2 (insert vcenter in the text field).
Also never do an upgrade with a GA build, always at least on Update1. Current VCSA 6.7 build is Update 3g.

TheFace
Oct 4, 2004

Fuck anyone that doesn't wanna be this beautiful


Wicaeed posted:

I have somewhat stupidly volunteered myself for a VMware upgrade Project of our aged vCenter 6.0 installation.

The advisor recommendations are saying we should install the 6.5.0 GA version of vCenter, but I don't see any mention of vCenter 6.7.

We do have some older hosts that can only go to 6.0.0 U2 version of VMware, however these should be compatible with vCenter 6.7 according to the VMware docs.

Am I missing anything super obvious as to why 6.7 wouldn't be showing as a recommended upgrade for us?

I do have a VMW support ticket created as well, just figured SA may have a quicker turnaround than VMware support nowadays...

Go to 6.7, and even if you go 6.5 you sure as poo poo shouldn't go GA! LATEST UPDATES ALWAYS!!!

greatapoc
Apr 4, 2005


We've just recently bought a bunch of new Dell R640 servers to replace our aging HP c7000 blade chassis hosting our Hyper-V infrastructure. We're experiencing poor network performance on the new servers though and I'm trying to track down the source of it. Our original environment was 2012r2 hosts. We had to inplace upgrade these to 2016 to raise the functional level before adding the new 2019 Dell hosts to the cluster and then live migrating all the guests over to the new hardware. We've still got some VMs stuck on one of the old hosts but we have a plan around that. Anyway, this gives us something to work with and compare for the poor network issues we're seeing.

Differences I can see between the old host and the new: jumbo frames disabled on the new, receive and transmit buffers set to 512 on the new and auto on the old. Old host has three 1GB NICs in a LACP team to two Cisco switches, new host has 2 10GB NICs in a LACP team to the same switches. iPerf test from a VM on the old host to my PC saturates the 1Gbps link into my PC but the new host only pushes about 500mbps. Transfer between VMs on the same host is around 2gbps, transfer between VMs on different hosts is around 600mbps. VMQ is enabled, NICs are Intel X710 on the Dells.

Not sure what else to mention. I was going to try changing the NIC team from LACP to "Switch Embedded Teaming" tonight in an outage window as well as enabling jumbo frames to see if it makes a difference. Does anyone have any ideas of things to look at?

BangersInMyKnickers
Nov 3, 2004

I have a thing for courageous dongles



The x710's are unstable trashfires, especially on bonded links. I spent almost a year chasing my tail on them and trying every combination of driver and firmware imaginable and going through every level of Dell and Intel support they could throw me at until they finally relented and replaced the NDCs with Qlogics that worked flawlessly. I will say that the x710 silicon is fast as hell, can pull of full line speed on 64k frames, but that doesn't mean diddly when the loving thing is getting reinitialized five times a second because they're panicking and flapping the link. You can improve the situation by getting jumbo frames back on and cranking up your ring buffers to max (I think it was 4096, maybe 8192? 512 is way too small for a virtual host)

The qlogics were slower (could only handle 3-4gbps on 1500 frames, but still hit line speed on jumbo. Newer models are better), but they didn't panic and my lacp interface uptime was being measured in days instead of seconds which was an acceptable tradeoff.

e: I can't believe they're still selling them to be honest. The senior engineer at Dell I finally got through to said they were having no end to the problems with them and were about ready to drop Intel nics as an option. This was years ago, I figured they would have finally sorted out the issues or made a new version of a quad-port 10gige interface. If you need to stick with Intel, try the X722 instead, hopefully they got their poo poo together. X710's are old, I was dealing with this poo poo back in 2014/2015

BangersInMyKnickers fucked around with this message at 13:56 on May 6, 2020

greatapoc
Apr 4, 2005


BangersInMyKnickers posted:

The x710's are unstable trashfires, especially on bonded links.

Well that's just bloody great, we just took delivery of the things two weeks ago. Looks like we might have to go back to Dell and ask for something else. The thing is we've never had any problem with it dropping the connection, the port-channels are rock solid and haven't missed a beat. The performance is just terrible.

Thanks Ants
May 21, 2004

Bless You Ants, Blants



Tell Dell that they need to ship you new mezzanine cards with different NICs on or you won't pay the invoice

Methanar
Sep 26, 2013
ASK ME ABOUT NOT TIPPING DELIVERY DRIVERS, OR ABOUT MY DIET OF CANNED BABY CORN AND CHICKEN NUGGETS

X710 is great because it puts ESXi closer towards it's natural state. psod


Also I once discovered that my non-lacp bonded interfaces had been flapping several times a second for like a year on one of our databases.

Also sometimes the cards would just not be recognized as plugged in at all, even by the BMC until you restarted 10 times.

gently caress datacenters

Methanar fucked around with this message at 23:53 on May 6, 2020

greatapoc
Apr 4, 2005


Touch wood it looks like I may have fixed it but I'm not sure exactly which part did it.

Removed the team and recreated it (still using LACP)
Enabled jumbo frames on both NICs
Increased receive and transfer buffers to 4096
Added reg key HKLM\SYSTEM\CurrentControlSet\Services\VMSMP\Parameters\TenGigVmqEnabled=1 (VMQ was already enabled on the VMs)
Rebooted host

iperf and file transfers are now flying like they should but failover cluster manager is throwing up it's hands so I need to do more with that.

Edit: Here's a capture from where I have it running on one of the Dells then live migrate it to the one I've just (hopefully) fixed.

[ 4] 2.00-3.00 sec 54.5 MBytes 456 Mbits/sec
[ 4] 3.00-4.00 sec 25.9 MBytes 217 Mbits/sec
[ 4] 4.00-5.00 sec 49.8 MBytes 419 Mbits/sec
[ 4] 5.00-6.00 sec 43.4 MBytes 364 Mbits/sec
[ 4] 6.00-7.00 sec 48.2 MBytes 405 Mbits/sec
[ 4] 7.00-8.00 sec 49.4 MBytes 414 Mbits/sec
[ 4] 8.00-9.00 sec 39.8 MBytes 334 Mbits/sec
[ 4] 9.00-12.49 sec 7.50 MBytes 18.1 Mbits/sec
[ 4] 12.49-12.49 sec 0.00 Bytes 0.00 bits/sec
[ 4] 12.49-12.49 sec 0.00 Bytes 0.00 bits/sec
[ 4] 12.49-13.00 sec 2.62 MBytes 43.1 Mbits/sec
[ 4] 13.00-14.00 sec 111 MBytes 929 Mbits/sec
[ 4] 14.00-15.00 sec 112 MBytes 936 Mbits/sec
[ 4] 15.00-16.00 sec 110 MBytes 919 Mbits/sec
[ 4] 16.00-17.00 sec 110 MBytes 922 Mbits/sec
[ 4] 17.00-18.00 sec 105 MBytes 880 Mbits/sec
[ 4] 18.00-18.58 sec 56.5 MBytes 813 Mbits/sec

greatapoc fucked around with this message at 00:28 on May 7, 2020

Potato Salad
Oct 23, 2014

Nobody Cares




loving nobody gets LACP right

Orchestrate static aggregation channels instead of counting on LACP to do it for you

Potato Salad
Oct 23, 2014

Nobody Cares




Lacp exists to cause you more pain and suffering and production losses than the time it takes to set up channels/aggregation up by hand, every single time

It is worth taking a moment to just write some logic to set up static aggregation orchestration with whatever tools you use to manage your switches and your compute

Potato Salad fucked around with this message at 10:11 on May 7, 2020

BangersInMyKnickers
Nov 3, 2004

I have a thing for courageous dongles



greatapoc posted:

Touch wood it looks like I may have fixed it but I'm not sure exactly which part did it.

Removed the team and recreated it (still using LACP)
Enabled jumbo frames on both NICs
Increased receive and transfer buffers to 4096
Added reg key HKLM\SYSTEM\CurrentControlSet\Services\VMSMP\Parameters\TenGigVmqEnabled=1 (VMQ was already enabled on the VMs)
Rebooted host

iperf and file transfers are now flying like they should but failover cluster manager is throwing up it's hands so I need to do more with that.

Edit: Here's a capture from where I have it running on one of the Dells then live migrate it to the one I've just (hopefully) fixed.

[ 4] 2.00-3.00 sec 54.5 MBytes 456 Mbits/sec
[ 4] 3.00-4.00 sec 25.9 MBytes 217 Mbits/sec
[ 4] 4.00-5.00 sec 49.8 MBytes 419 Mbits/sec
[ 4] 5.00-6.00 sec 43.4 MBytes 364 Mbits/sec
[ 4] 6.00-7.00 sec 48.2 MBytes 405 Mbits/sec
[ 4] 7.00-8.00 sec 49.4 MBytes 414 Mbits/sec
[ 4] 8.00-9.00 sec 39.8 MBytes 334 Mbits/sec
[ 4] 9.00-12.49 sec 7.50 MBytes 18.1 Mbits/sec
[ 4] 12.49-12.49 sec 0.00 Bytes 0.00 bits/sec
[ 4] 12.49-12.49 sec 0.00 Bytes 0.00 bits/sec
[ 4] 12.49-13.00 sec 2.62 MBytes 43.1 Mbits/sec
[ 4] 13.00-14.00 sec 111 MBytes 929 Mbits/sec
[ 4] 14.00-15.00 sec 112 MBytes 936 Mbits/sec
[ 4] 15.00-16.00 sec 110 MBytes 919 Mbits/sec
[ 4] 16.00-17.00 sec 110 MBytes 922 Mbits/sec
[ 4] 17.00-18.00 sec 105 MBytes 880 Mbits/sec
[ 4] 18.00-18.58 sec 56.5 MBytes 813 Mbits/sec

Make sure you kick the tires on live migrations between hosts on the new 10gig interfaces once you have them all up. The problems I was seeing didn't manifest until I was regularly moving traffic at multi-gbps rates. Being choked down by the old server interfaces could be masking issues.


My guess is that jumbo frames were the thing that did the most good. The buffer size won't be an issue at those speeds, especially on an iperf test. Likely the old server nic's can't handle full gbps rate on 1500 mtu and that's where your bottleneck was. For all their faults, x710 can handle 1 gbps on 1500mtu.

BangersInMyKnickers fucked around with this message at 12:52 on May 7, 2020

1000101
May 14, 2003

BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

Potato Salad posted:

Lacp exists to cause you more pain and suffering and production losses than the time it takes to set up channels/aggregation up by hand, every single time

It is worth taking a moment to just write some logic to set up static aggregation orchestration with whatever tools you use to manage your switches and your compute

What are people actually getting wrong? Thereís not a whole lot to setup on LACP beyond timers (which most platforms only have 1 option) and if the interfaces are going to actively send LACP PDUs or not. Iíve probably seen more people get static link aggregation wrong where maybe one side has the wrong load distribution algorithm set. I think the dvswitch itself supports something like 26 different options of which not all exist on all switching platforms.

That said I almost never bother with link aggregation to hypervisors anymore. 10 gig is cheap and source based load distribution doesnít require upstream switch configuration.

Potato Salad
Oct 23, 2014

Nobody Cares




everyone's lacp implementation is awful, from "why do i have 1/n packet loss for n links" to literally unusable

BangersInMyKnickers
Nov 3, 2004

I have a thing for courageous dongles



1000101 posted:

What are people actually getting wrong? Thereís not a whole lot to setup on LACP beyond timers (which most platforms only have 1 option) and if the interfaces are going to actively send LACP PDUs or not. Iíve probably seen more people get static link aggregation wrong where maybe one side has the wrong load distribution algorithm set. I think the dvswitch itself supports something like 26 different options of which not all exist on all switching platforms.

That said I almost never bother with link aggregation to hypervisors anymore. 10 gig is cheap and source based load distribution doesnít require upstream switch configuration.

Last I fought with LACP on ESXi (years ago when 6.5 was new), the host would only do slow PDUs which makes for an unacceptable time to detect a fault and down the bad link. If you ran the host in passive mode and the switch in active with fast pdus, the host would ignore the switch's parameters and continue to use slow pdus/long timeouts and that mismatch would cause the upstream switch to flap the link because PDU timeouts are being exceeded. The only solution was to run an esxcli script every time the host got rebooted to manually force the vdswitch on to fast PDUs using unsupported commands. I come to find that this issue has been present since 4.x when vdswitches first came out and they just didn't give a poo poo about fixing it.

With that said, once you manually forced it on to fast PDUs everything functioned as expected and failovers were snappy and well within tolerances, even for storage fabric with some minimal stalling on the VMs

Pile Of Garbage
May 28, 2007





Potato Salad posted:

everyone's lacp implementation is awful, from "why do i have 1/n packet loss for n links" to literally unusable

Care to elaborate? I ask because I've never once had issues configuring LACP between devices of any vendor from 2x1Gb up to 4x10Gb.

greatapoc
Apr 4, 2005


greatapoc posted:

Touch wood it looks like I may have fixed it but I'm not sure exactly which part did it.

Removed the team and recreated it (still using LACP)
Enabled jumbo frames on both NICs
Increased receive and transfer buffers to 4096
Added reg key HKLM\SYSTEM\CurrentControlSet\Services\VMSMP\Parameters\TenGigVmqEnabled=1 (VMQ was already enabled on the VMs)
Rebooted host

iperf and file transfers are now flying like they should but failover cluster manager is throwing up it's hands so I need to do more with that.

So it looks like I spoke too soon on this one. Although iperf and file transfers were a lot better, once we moved SQL over to it some applications couldn't connect to it and others were showing very slow queries. It appears we've fixed it by disabling RSC on the virtual switch.

bad boys for life
Jun 6, 2003

Nothing happens to anybody which he is not fitted by nature to bear.

What are people doing for capacity planning for large vmware deployments? Vrops doesnt seem to scale to our infrastructure size. Getting recommendations to look at turbonomics and veaam one, but not sure if any of you have experience at SP level VMware deployments and have something better.

TheFace
Oct 4, 2004

Fuck anyone that doesn't wanna be this beautiful


bad boys for life posted:

What are people doing for capacity planning for large vmware deployments? Vrops doesnt seem to scale to our infrastructure size. Getting recommendations to look at turbonomics and veaam one, but not sure if any of you have experience at SP level VMware deployments and have something better.

How big is your deployment that vROPs can't scale that large? If vROPs can't do it Veeam One is going to definitely fall on it's face, can't speak for Turbonomics.

Maneki Neko
Oct 27, 2000



bad boys for life posted:

What are people doing for capacity planning for large vmware deployments? Vrops doesnt seem to scale to our infrastructure size. Getting recommendations to look at turbonomics and veaam one, but not sure if any of you have experience at SP level VMware deployments and have something better.

This seems like a fine question for an account manager/partner resources?

Potato Salad
Oct 23, 2014

Nobody Cares




Not gonna lie, you're the first I've bumped into the year deploying sp infrastructure on vmware at "vROP isn't enough" scale

TheFace
Oct 4, 2004

Fuck anyone that doesn't wanna be this beautiful


Between extra large nodes, and clustering vROPs scales to some 180000-200000 objects I think. That is assuming you can actually deploy XL nodes (24 vCPU is a bit of an ask, especially when VMware guidelines suggest being able to fit it in a single socket)

I've never seen Veeam One be able to handle an environment that large without being horrifically slow.

abelwingnut
Dec 23, 2002



probably a pretty basic problem, but i'm having issues with virtualbox.

specifically, i'm trying to run an os x vm on a windows 10 machine. no matter what i do, i cannot escape the mouse or keyboard once i boot up the vm. i press the host key combination to decapture myself from the window, but nothing. i'm wondering if it has to do with the fact i'm using a usb wireless keyboard and a usb wireless mouse? i say that because whenever i type on the vm it is a bit...choppy and stuttered. so i'm wondering if there's some connection issue happening?

in any case, i've tried changing the host key combination from right control to right shift, and it just does nothing else. like, once i start the vm and it loads, i can only access that vm. ctrl+alt+del can't get me out, nothing can.

any ideas? i also loaded both usb input devices in the vm's settings. really not sure what's going on.

wolrah
May 8, 2006
what?


abelwingnut posted:

probably a pretty basic problem, but i'm having issues with virtualbox.

specifically, i'm trying to run an os x vm on a windows 10 machine. no matter what i do, i cannot escape the mouse or keyboard once i boot up the vm. i press the host key combination to decapture myself from the window, but nothing. i'm wondering if it has to do with the fact i'm using a usb wireless keyboard and a usb wireless mouse? i say that because whenever i type on the vm it is a bit...choppy and stuttered. so i'm wondering if there's some connection issue happening?

in any case, i've tried changing the host key combination from right control to right shift, and it just does nothing else. like, once i start the vm and it loads, i can only access that vm. ctrl+alt+del can't get me out, nothing can.

any ideas? i also loaded both usb input devices in the vm's settings. really not sure what's going on.

If you have attached the USB devices to the guest that's why they're not able to escape, they're literally being disconnected from the host and passed through to the guest when it's running. Don't do that unless that's what you actually want (only really makes sense if you're trying to do a "two workstations, one PC" style setup). USB passthrough is for other devices that you need to have appear as directly connected to the guest.

abelwingnut
Dec 23, 2002



ohhhhhh, got it. yea, that's worked--thanks.

now to try and figure out why this drat thing won't connect to icloud.

LongSack
Jan 17, 2003



Two questions about VMWare and shared folders. For both, the host is windows 10 and the guest is Centos 7 running under VMWare Workstation Professional latest version (updated today).

1. I have a shared folder ďalways enabledĒ. Yet, every time I boot the VM, the /mnt/hgfs directory is empty. I have to do VM->Options->Shared Folders->Disable->OK followed by VM->Options->Shared Folders->Always Enabled->OK to get the mount point back. Is there a fix for this?

2. The above shared folder is on an external USB drive. If I havenít accessed in a while then go back to it, the entire VM freezes solid for up to a minute or more. I suspect that this is because the drive has been spun down, and now itís waiting for the drive to be spun up, but thatís just a guess. Any ideas how to fix / prevent this?

TIA

Not Wolverine
Jul 1, 2007

by Fluffdaddy


I'm not sure this is the best thread, but I think it is the most likely to have knowledgeable people, is Ragnar Locker a big problem? If you don't know, Ragnar Locker is a type of ransomware released last month, allegedly this install VirtualBox and runs a VM with Windows XP, and then uses the VM to read, write, and encrypt all of your files. Supposedly this is extremely difficult to detect because all of the file access is done from VirtualBox instead of a Ragnar.exe process.

What I want to know is how are file permissions typically handled on a virtual machine? Assume I want to setup a VM and play some MP3s stored on the host machines drive, does the host machine only see that virtual box is reading my MP3s or is it possible for the host to see that napster.exe from within the VM is actually what is reading the MP3? Can I use file permissions so that only the owner on the host can read my MP3s, could I configure file permissions so that only the owner on the host and a VM user can read the files, could this be done by cloning the user ID for both the host and virtual machine? I think Ragnar Locker is likely going to require some changes to how VMs handle permissions and file access, but is this a problem that is going to require significant changes that will make VMs more difficult to configure?

Thanks Ants
May 21, 2004

Bless You Ants, Blants



If a VM can escape Virtualbox and access system files outside of any folders shared within that VM then that's a horrific bug surely

Zorak of Michigan
Jun 10, 2006

Waiting for his chance

It installs the malware payload VM and gives it access to all your drives, so the sandbox isn't really a barrier.

Not Wolverine
Jul 1, 2007

by Fluffdaddy


Zorak of Michigan posted:

It installs the malware payload VM and gives it access to all your drives, so the sandbox isn't really a barrier.
The VM doesn't make things signficiantly worse, my understanding is it just makes the malware harder to detect since instead of seeing a malware process accessing your files it just shows up as the virtualbox process accessing your files, which most IT departments might view as a normal non threatening process.

Wibla
Feb 16, 2011


Most Oracle products should probably be marked as a threatening process.

BangersInMyKnickers
Nov 3, 2004

I have a thing for courageous dongles



they threaten my personal well-being that's for sure

GrandMaster
Aug 15, 2004
laidback

Hey all,
I was hoping to get some advice on what people are doing in regards to DR Testing in HCI/Stretched vSAN environments?

Our organisation is large government so there's a ton of compliance regulation for DR/BC testing, but the old controlled failover methods we used to use aren't really applicable anymore due to stretched vSAN.

Powering off a VM, moving it to the alternate site and powering it back on doesn't actually test anything since we run active/active DCs. We already know the network segments work and VMs can run anywhere.

And since it's a mission critical environment with risk to life, we can't just drop an entire DC to test the witness/failover functionality. Outage windows are strictly controlled, and we need to test on an app by app basis.

Anyways, if anyone has any ideas here I'm really interested since I'm coming up with blanks as to how else we can effectively "prove" the DR actually works.

Zorak of Michigan
Jun 10, 2006

Waiting for his chance

We struggle with similar problems. One strategy we tried with mixed success was to replicate to additional systems inside an island of containment, which we could then isolate for test purposes. We kept finding new endpoints we needed inside the IOC and vowing to get it right next time.

GrandMaster
Aug 15, 2004
laidback

We can't run isolated, the typical strategy is failover and run from the DR site for a week then fail back and get a big green tick from the auditors.
It's easy in legacy environments, because we have *steps* to do that we can prove we completed and there's an outage. A live cross-site vmotion looks like nothing even happened haha.

I completely understand it's a process issue, not a technical one but government moves slooooow and auditors are insistent.

TheFace
Oct 4, 2004

Fuck anyone that doesn't wanna be this beautiful


GrandMaster posted:

Hey all,
I was hoping to get some advice on what people are doing in regards to DR Testing in HCI/Stretched vSAN environments?

Our organisation is large government so there's a ton of compliance regulation for DR/BC testing, but the old controlled failover methods we used to use aren't really applicable anymore due to stretched vSAN.

Powering off a VM, moving it to the alternate site and powering it back on doesn't actually test anything since we run active/active DCs. We already know the network segments work and VMs can run anywhere.

And since it's a mission critical environment with risk to life, we can't just drop an entire DC to test the witness/failover functionality. Outage windows are strictly controlled, and we need to test on an app by app basis.

Anyways, if anyone has any ideas here I'm really interested since I'm coming up with blanks as to how else we can effectively "prove" the DR actually works.

I know this may be not possible due to budget constraints, but if you have a mission critical environment with risk to life there really should be a (smaller typically) dev environment configured similar enough to test things on. This goes beyond failure conditions of vSAN stretched cluster, as you should be testing updates, changes to apps, etc all in a dev environment.

Testing vSAN stretch clustering is fairly easy if you have an environment that you're actually willing to test on:
https://storagehub.vmware.com/t/vsan-6-7-u-3-proof-of-concept-guide-1/stretched-cluster-with-wts-failover-scenarios-1/

Adbot
ADBOT LOVES YOU

GrandMaster
Aug 15, 2004
laidback

Yeah it's a very good point, obviously failover stuff was all tested during commissioning, and we have multiple non-prod environments for all the critical apps, but it's all on the same hardware. The higher-ups wanted ONE BIG CLUSTER and completely disregarded advice to split non prod environments onto physically separate hardware like we used to have pre-hci.

On another note, has anyone had a positive experience with the vxrail VCF upgrades? It was sold to us as "one click will upgrade the entire environment" but in reality it's "one support case will be required for every single component, which inevitably fails during the upgrade process". From memory it's been about 15 support cases and 3 months required for our management cluster upgrade and it's not even finished yet.

God I hate vxrail haha

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply