Virtualization Megathread V2: VMs inside VMs

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›312 »

Potato Salad: Oct 23, 2014; nobody cares

BeastOfExmoor posted:

I'm using an old desktop (AMD FX-8350) as my "server". My plan was to install Windows Server and run some VM's in Hyper-V, but Hyper-V apparently doesn't have some of the features I wanted (USB passthrough, etc.) so I installed VMWare workstation, which apparently has an issue with Windows Server 2016's Credential Guard feature. Before I go through the rigmarole of turning that off I figured I'd pop in here and see if there's some other path I should be taking? Should I be running another VM product on the bare metal and then VM's on top of that? I'd like to run a couple Windows VM's, a Linux VM, and perhaps an OSX VM if I can managed to get that going on an AMD processor.

Don't use vmware workstation, use esxi

# ? Dec 28, 2017 12:25

Adbot: ADBOT LOVES YOU

# ? Apr 19, 2024 23:06

chutwig: May 28, 2001; BURLAP SATCHEL OF CRACKERJACKS

chutwig posted:

Does anyone have experience with doing nested virtualization in Linux guests in VMware Workstation?

In the event anyone cares, the nested virt stability issues were conclusively solved by moving to 14. Spent all day building stuff in the same VM in Player 14 with no hiccups.

# ? Jan 3, 2018 04:03

Number19: May 14, 2003; HOCKEY OWNS
FUCK YEAH

VMware's patches for this Intel CPU poo poo are out now: https://www.vmware.com/security/advisories/VMSA-2018-0002.html

# ? Jan 4, 2018 01:26

Moey: Oct 22, 2010; I LIKE TO MOVE IT

So what's the word on that performance hit?

Edit: good read linked in the storage thread.

https://lonesysadmin.net/2018/01/02/intel-cpu-design-flaw-performance-degradation-security-updates/

Moey fucked around with this message at 06:26 on Jan 4, 2018

# ? Jan 4, 2018 06:16

bull3964: Nov 18, 2000; DO YOU HEAR THAT? THAT'S THE SOUND OF ME PATTING MYSELF ON THE BACK.

Does the hypervisor patch in any way protect the guests or is is purely to patch the hypervisor itself from the exploit?

# ? Jan 4, 2018 06:28

bobfather: Sep 20, 2001; I will analyze your nervous system for beer money

I just did a command line update of ESXi 6.5 as per this guide and it went fine. Took about 3 minutes.

I always hate seeing my ESXi uptime go back down to 0, but another day, another massive exploit right?

# ? Jan 4, 2018 11:38

Pile Of Garbage: May 28, 2007

bull3964 posted:

Does the hypervisor patch in any way protect the guests or is is purely to patch the hypervisor itself from the exploit?

The latter I believe.

# ? Jan 4, 2018 15:21

bull3964: Nov 18, 2000; DO YOU HEAR THAT? THAT'S THE SOUND OF ME PATTING MYSELF ON THE BACK.

That's what I figured. No shortcuts for this.

# ? Jan 4, 2018 16:16

Alfajor: Jun 10, 2005; The delicious snack cake.

That makes sense, but would the performance hit be compounded or stacked? (or what's the best word to describe this??)
5-30% from the hypervisor
5-30% on the actual guest OS

so total performance hit could be anywhere between 10% and 60%??

Alfajor fucked around with this message at 17:11 on Jan 5, 2018

# ? Jan 5, 2018 17:05

bull3964: Nov 18, 2000; DO YOU HEAR THAT? THAT'S THE SOUND OF ME PATTING MYSELF ON THE BACK.

I don't necessarily know that the hypervisor performance hit would be passed on to the guests. If anything, I think the likely situation is that the hypervisor has more overhead on the host as an aggregate rather than a stacked performance hit.

It all depends on workload.

# ? Jan 5, 2018 17:17

Potato Salad: Oct 23, 2014; nobody cares

Alfajor posted:

That makes sense, but would the performance hit be compounded or stacked? (or what's the best word to describe this??)
5-30% from the hypervisor
5-30% on the actual guest OS

so total performance hit could be anywhere between 10% and 60%??

No.

Performance is lost when the real cpu needs to switch kernel-user mode and now cannot take advantage of memory features designed to diminish the correlating performance cost.

System calls in the guest are still made to physical user mode processes that are trapped by the hypervisor and emulated or directed to direct map hardware (think directpath io network devices, which have additional bits tracking which virtual device on which vm is allowed use of physical hardware, permitting guest mode interaction with compatible hardware). On even remotely modern hardware, there's no significant multiplication of the number of kernel context switches just because a guest makes supposed "system calls" in the guest machine's physical user processes that themselves are fulfilled by the host kernel.

Potato Salad fucked around with this message at 17:58 on Jan 5, 2018

# ? Jan 5, 2018 17:53

evol262: Nov 30, 2010; #!/usr/bin/perl

Potato Salad posted:

No.

Performance is lost when the real cpu needs to switch kernel-user mode and now cannot take advantage of memory features designed to diminish the correlating performance cost.

System calls in the guest are still made to physical user mode processes that are trapped by the hypervisor and emulated or directed to direct map hardware. On even remotely modern hardware, there's no significant multiplication of the number of kernel context switches just because a guest makes supposed "system calls" in the guest machine's physical user processes that themselves are fulfilled by the host kernel.

This is misleading.

There are two separate CVEs for Spectre, which apply to different scenarios. Your guests must also be patched. The update for ESXi blocks leaking information from the hypervisor (and potential branch injection), but mitigating attacks from user space processes on the guest against the guest kernel also requires the guest to be updated.

Without a real architectural piece from VMware indicating whether a change in vmkernel happened and exactly how they map guest page tables into vmkernel's page table (I would guess that there's not actually a lot of interaction here, but either way), both nested page tables and/or shadow page tables must be flushed to protect the guest kernel from a breakout as well, which actually does mean that there's potentially a multiplicative effect.

The hit in performance for syscalls here is due to page table isolation. There's no microcode fix for this at this point, and one may not be possible, so you're looking at disconnecting vmkernel space from guest space, then against at disconnecting guest kernelspace from guest userspace.

E: this really needs a lot of testing, and probably some tweaks from all vendors, because right now everyone is in full-on panic mode to get it fixed, and nobody's seriously looked at mitigating the impact yet

evol262 fucked around with this message at 18:15 on Jan 5, 2018

# ? Jan 5, 2018 18:11

underlig: Sep 13, 2007

Moey posted:

So what's the word on that performance hit?

Edit: good read linked in the storage thread.

https://lonesysadmin.net/2018/01/02/intel-cpu-design-flaw-performance-degradation-security-updates/

When talking about lowered performance, how is this measured? Are there any benchmark tools that i can run before and after the patch, to have some numbers to show boss?
Boss has had the week off so this is something i'm going to have to answer on monday, i patched one of my hyper-v hosts today but i can't tell what the impact is because we dont have any monitoring at all. :bravo:

# ? Jan 5, 2018 20:28

anthonypants: May 6, 2007; by Nyc_Tattoo; Dinosaur Gum

underlig posted:

When talking about lowered performance, how is this measured? Are there any benchmark tools that i can run before and after the patch, to have some numbers to show boss?
Boss has had the week off so this is something i'm going to have to answer on monday, i patched one of my hyper-v hosts today but i can't tell what the impact is because we dont have any monitoring at all.

If you don't have any monitoring why do you care about performance?

# ? Jan 5, 2018 20:32

mewse: May 2, 2006

anthonypants posted:

If you don't have any monitoring why do you care about performance?

You can legit argue that performance has been completely unaffected as per your records:

Pre-patch: unknown
Post-patch: unknown

# ? Jan 5, 2018 20:33

Moey: Oct 22, 2010; I LIKE TO MOVE IT

I had 8 ESXi VDI hosts that I got patched up (was a little behind) and have noticed pretty much no difference looking at CPU utilization across the clusters.

# ? Jan 5, 2018 20:41

Kazinsal: Dec 13, 2011

mewse posted:

You can legit argue that performance has been completely unaffected as per your records:

Pre-patch: unknown
Post-patch: unknown

98% of "meets expectations" is still "meets expectations"! :eng101:

# ? Jan 5, 2018 21:22

evol262: Nov 30, 2010; #!/usr/bin/perl

Just FYI, for all of you running Windows in an enterprise environment, it's possible that the update is not actually active. It's also possible it has no performance impact, but check for the dword given here just in case.

I guess some enterprise AV vendors do dumb stuff which BSODs Windows, so MS is requiring that they include this dword as part of their update before actually turning it on :eng99:

# ? Jan 5, 2018 21:27

Pile Of Garbage: May 28, 2007

Moey posted:

I had 8 ESXi VDI hosts that I got patched up (was a little behind) and have noticed pretty much no difference looking at CPU utilization across the clusters.

Isn't CPU utilisation dependant on workload? From what I understand the patches add additional checks for certain operations and these additional checks result in increased execution time for those operations. Therefore wouldn't the performance impact on the hypervisor, if any, manifest as increased CPU ready time as instructions are taking longer to execute and therefore contention is increased?

Alternatively I'm grossly oversimplifying things and missing the point.

# ? Jan 6, 2018 11:34

Moey: Oct 22, 2010; I LIKE TO MOVE IT

cheese-cube posted:

Isn't CPU utilisation dependant on workload?

That is my understanding, so far my boring VDI stuff has had no noticed increase.

On a side note, it looks like the Intel X710 NICs still PSOD ESXi..... anyone using a QLogic FastLinQ 41164 with ESXi?

Looking at grabbing a stack of R640 and would prefer some 4x10gbe SFP+ daughterboard NICs.

# ? Jan 8, 2018 20:44

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

Moey posted:

That is my understanding, so far my boring VDI stuff has had no noticed increase.

On a side note, it looks like the Intel X710 NICs still PSOD ESXi..... anyone using a QLogic FastLinQ 41164 with ESXi?

Looking at grabbing a stack of R640 and would prefer some 4x10gbe SFP+ daughterboard NICs.

I spent close to a year fighting with the x710's before VMware basically threw in the towel and admitted that they can never get them to behave correctly in a bunch of configurations. Pressured Dell rep and they gave us a free swap to the QLogic which worked like a charm once I got the ring buffer size increased.

# ? Jan 8, 2018 20:49

Moey: Oct 22, 2010; I LIKE TO MOVE IT

BangersInMyKnickers posted:

I spent close to a year fighting with the x710's before VMware basically threw in the towel and admitted that they can never get them to behave correctly in a bunch of configurations. Pressured Dell rep and they gave us a free swap to the QLogic which worked like a charm once I got the ring buffer size increased.

Wow, I believe I remember you posting about that hell. I can't believe they couldn't iron that out. Thanks for the quick response.

I have been pretty happy with the Emulex 4x10gbe rNDCs in our R630 servers. Halftway tempted to just order more R630s with those cards in em.

# ? Jan 8, 2018 20:58

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

We moved off them back when VMware had just released 1.4.3 native driver, 1.5.6 along with its corresponding firmware update is out so maybe that fixes it? I wouldn't recommend using the vmklinux legacy drivers, those had all the same problems but just threw in an abstraction layer so you were getting less log data on the link flapping and LACP PDU failures. If you have the hardware kicking around it'd be worth patching it up and seeing if it will behave for you. To Intel's credit, the x710's are stupidly fast (and power efficient) to the point of being able to do 10gb full duplex on 64-byte frames. Qlogic struggled to get above 5gb in one direction on 1500mtu (will do it with jumbo-frames fine) but the heavy throughput was all happening on the storage fabric with 9k mtu so it fit my use-case.

https://www.vmware.com/resources/co...r&sortOrder=Asc

# ? Jan 8, 2018 21:05

movax: Aug 30, 2008

I'm trying to do PCIe pass-through of a Samsung 950 SSD to a Linux VM (giving it a scratch drive directly for Usenet processing / decompression / par2 + some scratch database storage), and it seems to be causing the VM to either fail to boot, or hang at some point after it has managed to boot (If I reboot the host, the Linux VM will boot the first time, but not subsequent times). Dug into the logs and saw:

code:

2017-12-25T20:55:20.668Z| vcpu-0| E105: PANIC: PCIPassthruChangeIntrSettings: 0000:08:00.0 failed to register interrupt (error code 195887105)
(snip)
2017-12-25T20:55:23.771Z| vcpu-0| I125: Msg_Post: Error
2017-12-25T20:55:23.771Z| vcpu-0| I125: [msg.log.error.unrecoverable] VMware ESX unrecoverable error: (vcpu-0)
2017-12-25T20:55:23.771Z| vcpu-0| I125+ PCIPassthruChangeIntrSettings: 0000:08:00.0 failed to register interrupt (error code 195887105)
2017-12-25T20:55:23.771Z| vcpu-0| I125: [msg.panic.haveLog] A log file is available in "/vmfs/volumes/5a3ed80c-9c1d818c-2731-0cc47a868c62/hoth/vmware.log".
2017-12-25T20:55:23.771Z| vcpu-0| I125: [msg.panic.requestSupport.withoutLog] You can request support.
2017-12-25T20:55:23.771Z| vcpu-0| I125: [msg.panic.requestSupport.vmSupport.vmx86]
2017-12-25T20:55:23.771Z| vcpu-0| I125+ To collect data to submit to VMware technical support, run "vm-support".
2017-12-25T20:55:23.771Z| vcpu-0| I125: [msg.panic.response] We will respond on the basis of your support entitlement.

The LSI SAS3008 controller on my motherboard has been passed through to the FreeNAS VM without nary a complaint. The VM I'm trying to feed the SSD too is a Fedora 27 VM. Have another Fedora 27 VM running (sans pass-through) that has been perfectly stable. If I turn off MSIs (dumb), then the VM boots no problem, but of course, it can't actually use the NVMe device at that point.

ESXi 6.5 (SMP Release build-4887370) Free Version on a Supermicro X11SSL mobo.

e: Learned vmkernel.log is a thing...

code:

2018-01-10T06:18:16.413Z cpu2:216978)WARNING: PCI: 780: Reporting transaction pending on device 0000:08:00.0
2018-01-10T06:18:17.518Z cpu2:216978)WARNING: PCI: 813: Detected transaction pending on device 0000:08:00.0 after reset
2018-01-10T06:18:17.518Z cpu2:216978)IOMMU: 2176: Device 0000:08:00.0 placed in new domain 0x430393e73aa0.
2018-01-10T06:18:17.539Z cpu4:216979)VMMVMKCall: 218: Received INIT from world 216979
2018-01-10T06:18:17.540Z cpu4:216979)CpuSched: 692: user latency of 216989 PVSCSI-216979:0 0 changed by 216979 vmm0:hoth -6
2018-01-10T06:18:17.540Z cpu4:216979)PVSCSI: 3411: scsi0: wdt=1 intrCoalescingMode=2 flags=0x1f
2018-01-10T06:18:17.540Z cpu7:216982)VMMVMKCall: 218: Received INIT from world 216982
2018-01-10T06:18:17.540Z cpu0:216983)VMMVMKCall: 218: Received INIT from world 216983
2018-01-10T06:18:17.540Z cpu3:216984)VMMVMKCall: 218: Received INIT from world 216984
2018-01-10T06:18:17.542Z cpu4:216988)WARNING: NetDVS: 681: portAlias is NULL
2018-01-10T06:18:17.542Z cpu4:216988)Net: 2524: connected hoth eth0 to VM Network, portID 0x200000a
2018-01-10T06:18:17.544Z cpu1:67521)Config: 706: "SIOControlFlag2" = 0, Old Value: 1, (Status: 0x0)
2018-01-10T06:18:17.802Z cpu7:216988)WARNING: PCI: 780: Reporting transaction pending on device 0000:08:00.0
2018-01-10T06:18:18.904Z cpu7:216988)WARNING: PCI: 813: Detected transaction pending on device 0000:08:00.0 after reset
2018-01-10T06:18:18.905Z cpu7:216988)WARNING: PCI: 780: Reporting transaction pending on device 0000:08:00.0
2018-01-10T06:18:20.007Z cpu7:216988)WARNING: PCI: 813: Detected transaction pending on device 0000:08:00.0 after reset
2018-01-10T06:18:20.007Z cpu7:216988)IOMMU: 2176: Device 0000:08:00.0 placed in new domain 0x430393e73aa0.
**************** 2018-01-10T06:18:20.338Z cpu1:216988)WARNING: MSI: 361: MSI already enabled for device 0000:08:00.0 ****************
2018-01-10T06:18:20.338Z cpu1:216988)IntrCookie: 1439: Unable to allocate cookies: Failure
2018-01-10T06:18:20.338Z cpu1:216988)VMKPCIPassthru: 1890: failed to allocate MSI interrupt

I wonder if something else snagged it, or if that was "left over" from the first time I booted that VM? I came back from vacation to that VM being hung at 20% CPU usage.

Unrelated, who the gently caress at VMWare though pushing everything to their abortion of a Web UI was a good idea?

movax fucked around with this message at 08:18 on Jan 11, 2018

# ? Jan 11, 2018 08:12

Potato Salad: Oct 23, 2014; nobody cares

Dude, don't bother doing nvme passthrough

The esx nvme driver is great. Discover the drive from the host, put vmfs on it.

E: at the very least, do a raw storage mapping. Pcie psssthrough isn't going to save you any performance

Potato Salad fucked around with this message at 21:35 on Jan 12, 2018

# ? Jan 12, 2018 21:31

Happiness Commando: Feb 1, 2002; $$ joy at gunpoint $$

Who knows about VMWare vGPU stuff for doing AutoCAD over RDS?

Most of my posts belong in the small shop thread, and I'm definitely out of my depth on this one. Potential client currently has a beefy rear end server with 2012 R2 being an RDS host for terminal service users running AutoCAD and an LOB app that relies on DirectX for GPU calculations. They are complaining that performance is bad. I called up the vendor and they said that one of their customers figured out a VMWare solution that passes vGPUs to the desktop sessions - all other solutions won't work because of how RDS video drivers are software only (or something) - but they have no architecture whitepapers and it's not a supported solution.

Apparently, the solution is buying a beefy rear end video card like an Nvidia Tesla series and then implementing VMWare Horizon. Supposedly, Server 2016 also does something similar with RemoteFX, but maybe not. This is a brand new technology for me, and I don't know who to talk to besides maybe a Dell rep.

Can anyone point me to some links or give me a quick mini effortpost about GPU - VDI solutions? My only other alternative is to tell the client to buy a bunch of beefy rear end workstations. It might end up being the better cost option, but something seems wrong about recommending that. Almost all of my experience is with Hyper-V small shop terminal servers for remote Office Suite work...

# ? Jan 13, 2018 17:03

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

Hyper-V/vGPU/RemoteFX was definitely the cost effective route when I looked in to that a few years ago. About a quarter of the price of an equivalent VMware rollout. Big trick is that Hyper-V presents an abstracted, virtualized GPU instead of a "native certified" one (cough Autodesk horseshit cough) so it will run in software rendering unless you set a system variable that forces it to target the DX level you specify Once you do that, it works fine. I ran 4 concurrent CAD sessions on some piece of crap W4100. Didn't get far enough to test on a S7150x2 before funding dried up.

# ? Jan 13, 2018 17:37

Potato Salad: Oct 23, 2014; nobody cares

Sick a S7150 or S7150x2 in a rdsh host. Buy a few user CALs. Turn on RemoteFX with click-through menus.

Done.

It's really goddamn easy.

# ? Jan 13, 2018 19:44

Potato Salad: Oct 23, 2014; nobody cares

S7150s are supported on, among other platforms, RD730s. Ten grand gets you a p good 8 user rdsh host

# ? Jan 13, 2018 19:45

Thanks Ants: May 21, 2004; #essereFerrari

You can test it all out on a G-series AWS or an NV Azure instance as well.

# ? Jan 13, 2018 20:26

Happiness Commando: Feb 1, 2002; $$ joy at gunpoint $$

You people are all so loving awesome

# ? Jan 14, 2018 00:03

bsaber: Jul 27, 2007

Not sure if this is the right place to ask this. I want to get a Dell R710 for learning at home and I'll be running Proxmox. Currently this is the specs of the one I'm looking to get:

Dual Xeon E5645
64GB RAM
PERC6i RAID controller
2x 1TB 7.2k SATA drives
iDRAC6 Enterprise

Now I might increase the number of drives but my question is should I go with the 570W PSU or 870W? How do I determine which PSU I would need?

# ? Jan 15, 2018 08:21

BangersInMyKnickers: Nov 3, 2004; I have a thing for courageous dongles

You can use this Dell tool to calculate consumption on a build: http://www.dell.com/calc

They didn't have the R710 in there, but the 720 with an equivalent TDP processor came close enough reporting 200W nominal with a max around 320W so you're likely well within spec. For your workload it will probably idle sub-100W unless you turn off power management. The motherboard firmware does track TDP of components and will hobble your CPU down to limited clocks if you exceed it but that's only something you see with the X series Xeons.

# ? Jan 15, 2018 16:22

bsaber: Jul 27, 2007

Oh cool, thanks! Didn�t know they had a calculator.

# ? Jan 15, 2018 20:54

Happiness Commando: Feb 1, 2002; $$ joy at gunpoint $$

Potato Salad posted:

Sick a S7150 or S7150x2 in a rdsh host. Buy a few user CALs. Turn on RemoteFX with click-through menus.

Done.

It's really goddamn easy.

The plan is a 2016 Hyper-V host with 2016 RDSH (among others) inside of it.

All the MS documentation I can find says RemoteFX only works for one concurrent login. We are going to use Gen 1 VMs because Datto backup units can't export images of gen 2 VMs.

MS indicates that gen 1 RDSH servers are unsupported.

Am I missing something here? It does look like I could use DDA to pass the whole video card to the RDSH guest, though, so that would work, I think?

(I have no hardware to test this on)

# ? Jan 15, 2018 21:24

jre: Sep 2, 2011; To the cloud ?

bsaber posted:

Not sure if this is the right place to ask this. I want to get a Dell R710 for learning at home and I'll be running Proxmox. Currently this is the specs of the one I'm looking to get:

Dual Xeon E5645
64GB RAM
PERC6i RAID controller
2x 1TB 7.2k SATA drives
iDRAC6 Enterprise

Now I might increase the number of drives but my question is should I go with the 570W PSU or 870W? How do I determine which PSU I would need?

Are you going to be running this in your house ? Are you prepared for how noisy the fans will be ?

# ? Jan 15, 2018 21:32

bsaber: Jul 27, 2007

jre posted:

Are you going to be running this in your house ? Are you prepared for how noisy the fans will be ?

Yeah, it�ll be in a closet under the stairs. So... shouldn�t be much a problem I think?

# ? Jan 15, 2018 22:38

Thanks Ants: May 21, 2004; #essereFerrari

It absolutely will be. The pitch is as bad as the volume.

# ? Jan 15, 2018 23:13

bsaber: Jul 27, 2007

Huh... well I could put it in the garage then. I�ll just have to run Ethernet.

# ? Jan 16, 2018 00:09

Adbot: ADBOT LOVES YOU

# ? Apr 19, 2024 23:06

jre: Sep 2, 2011; To the cloud ?

bsaber posted:

Huh... well I could put it in the garage then. I�ll just have to run Ethernet.

Curious why you've went for a rack mount server for home lab ? What technologies are you wanting to learn ?

# ? Jan 16, 2018 01:02

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Virtualization Megathread V2: VMs inside VMs

«‹›312 »