Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
chutwig
May 28, 2001

BURLAP SATCHEL OF CRACKERJACKS

Docjowles posted:

My impression has been that it's on the client side, but I could be way off base. I just know that it seems like when the connection to Rabbit is interrupted for 1 microsecond, all of the 5000 OpenStack logs start spamming stack traces and never recover until everything is restarted.

I'm with you on the "Rabbit sucks" bandwagon, though, for reasons totally outside of OpenStack.

I'd be a lot more likely to blame Kombu/OpenStack than Rabbit for this, to be honest. I do backend engineering for OpenStack stuff now at my current employer and ran RabbitMQ clusters at prior employers, and Rabbit has always been pretty low on my list of concerns.

My OpenStack contribution: we currently run about 8 Havana clusters on Ubuntu 12.04, and my first major project after changing jobs was to migrate the codebase to Ubuntu 14.04/Kilo, assigned to me about 12 hours after Kilo dropped. Having come from the magical VMware universe where things Just Work but also things Cost A Lot Of Money, OpenStack has given me a lot of interesting experiences and a lot to think about. I admit I am skeptical that the project will survive without imploding under its own weight; I think there will always be a need for a private cloud offering like OpenStack, and for continuity's sake I hope it is OpenStack, but spend a few hours inside it and you come out really feeling like it's 20 different projects with 20 different visions and very little gluing it together.

Adbot
ADBOT LOVES YOU

chutwig
May 28, 2001

BURLAP SATCHEL OF CRACKERJACKS

Docjowles posted:

Yup, that is what I meant by "client side". If a client (in this case, various OpenStack services) can't deal with the connection to Rabbit dropping for 1 second without completely blowing the gently caress up, that's not really the server's problem.

I missed the word "client". :v:

chutwig
May 28, 2001

BURLAP SATCHEL OF CRACKERJACKS

PCjr sidecar posted:

In my experience, Emulex/QLogic Ethernet 'storage adapter / offload' cards are hilariously poo poo. If your 24-core 2.5 GHz server can't spare the miniscule CPU overhead involved, those cards are going to fall over at that rate anyway.

I got stuck with trying to sort out the FCoE plans of somebody who had left my last company. He had armed all the servers with Intel X520-DA2s. VMware and bare-metal Linux both used software FCoE initiators with these cards and were completely unable to keep up with even modest traffic, so VMs would constantly kernel panic when FCoE on the hypervisor poo poo the bed and all their backing datastores dropped off. I ordered a couple of Emulex OneConnect cards that obviated the need for the software FCoE crap and they worked really well, right up until one of the cards crashed due to faulty firmware. At that point I'd had enough of trying to salvage the FCoE poo poo, sent everything back and bought a pair of MDSes and some Qlogic FC HBAs, and never thought about my SAN crashing again.

Moral of the story: it did make a difference for me, but don't do FCoE unless you have a Cisco engineer on-site and a company-funded expense account with the nearest boozeria.

chutwig
May 28, 2001

BURLAP SATCHEL OF CRACKERJACKS

Vulture Culture posted:

We had the problem you're talking about in Kilo and it went away when we identified and went around the MySQL issue where you can't actually do multi-master in Nova with Galera.

We front our MySQL cluster with HAProxy with the MySQL backend in backup mode so that only one instance is interacted with for exactly that reason.

chutwig
May 28, 2001

BURLAP SATCHEL OF CRACKERJACKS

Have any of you OpenStackers successfully migrated from Nova networking to Neutron? We are still using Nova networking at present but want to begin laying the groundwork for a switch to Neutron, because our tenants are increasingly asking for features that are only in Neutron, we expect to eventually scale our clusters to the point where we will need something like Calico to maintain a sane network topology, and Nova networking just doesn't get any love any more (and why would it?).

I'm pretty new to the OpenStack world, but I've already heard all sorts of horror stories about Neutron. A lot of them are pretty old and I imagine things have improved since then, but I'd like to hear from any ops that are currently using it, particularly if you had a prior nova-network-based setup.

chutwig
May 28, 2001

BURLAP SATCHEL OF CRACKERJACKS

ILikeVoltron posted:

Do you happen to know how this is configured? Like, how did you tell the other members of the MySQL cluster to be backup mode?

It's configuration on HAProxy only, MySQL has no awareness of it. "Backup mode" in the context of HAProxy means that HAProxy will only send traffic to a single backend instead of spreading it around the different backends: https://github.com/bloomberg/chef-bcpc/blob/master/cookbooks/bcpc/templates/default/haproxy-head.cfg.erb#L34-L44

Our setup leans heavily on the notion of a master virtual IP that floats around the cluster, i.e., there's one control plane node (head node) that is more equal than all the rest, and that's whichever one is holding the VIP. That node serves MySQL and HAProxy exclusively and also runs cronjobs that should only run on one system, everything else that goes through the VIP is distributed by HAProxy to all the control plane nodes. You'll need something like that for our jankity MySQL+HAProxy workaround to work for you.

chutwig
May 28, 2001

BURLAP SATCHEL OF CRACKERJACKS

ILikeVoltron posted:

Ok, so the haproxy config out of redhat's installer is "stick on dst" and "timeout server 90m" like so:

code:
listen galera
  bind x.x.x.x:3306
  mode  tcp
  option  tcplog
  option  httpchk
  option  tcpka
  stick  on dst
  stick-table  type ip size 2
  timeout  client 90m
  timeout  server 90m
  server host1 x.x.x.x:3306  check inter 1s port 9200 on-marked-down shutdown-sessions
  server host2 x.x.x.x:3306  check inter 1s port 9200 on-marked-down shutdown-sessions
  server host3 x.x.x.x:3306  check inter 1s port 9200 on-marked-down shutdown-sessions
I'm more interested in how you're doing replication between the mysql hosts as after some disk tests it seems to me that this matters:

code:
mysql -e "SHOW STATUS LIKE 'wsrep_local_recv_queue_avg';"
+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| wsrep_local_recv_queue_avg | 0.545683 |
+----------------------------+----------+
The server running the mongo primary node for ceilometer seems to spike up to 1.5 or so, and then we start to see some failures. This could be nothing, but one of my colleague thinks we could be seeing IO based stalling causing replication to stall causing things to wait eventually fail.

We don't run Ceilometer any more because it kicked the grap out of our control plane nodes. Vulture Culture's suggestion of sar is good because it'll let you corroborate the stalls to load average or iowait issues going on on the node at that time.

How do you mean how we're doing replication between the MySQL hosts? They all talk to each other directly, it's normal whatever Galera does.

chutwig
May 28, 2001

BURLAP SATCHEL OF CRACKERJACKS

Vulture Culture posted:

I'm considering an upgrade to El Capitan over my holiday break. Does VirtualBox still have any major issues outstanding on this OSX version?

VirtualBox 5.x has been fine for me since I upgraded.

chutwig
May 28, 2001

BURLAP SATCHEL OF CRACKERJACKS

Does anyone have experience with doing nested virtualization in Linux guests in VMware Workstation?

I have a Xenial guest in Workstation 12 that I use for multi-VM Vagrant environments, using both VirtualBox and KVM through libvirt. The nested VMs are very unstable; if they have multiple cores allocated it's a 100% guarantee that processes in them will segfault constantly. With only one core allocated, stability improves but is still not great. I have various Ansible playbooks and build scripts for building these environments, and it sucks when they constantly fall over and explode.

By way of comparison, the same multi-VM environments are totally stable non-nested. I've tested on VirtualBox on a Mac and also with KVM on a regular Linux server and everything is fine there. I would like to set up another nested virtualization test using VMware Fusion on a Mac but haven't had time to do so yet. My main incentives for getting the nested virt working are because the desktop workstation I have is much more powerful than my MBP and has a lot more memory, so I can build much bigger and more elaborate test environments (and I don't want to tie up a $20k server for my virt experiments when I can get them all done on a much cheaper workstation). However, it has to run Windows, so I can't put Linux right on the machine.

Adbot
ADBOT LOVES YOU

chutwig
May 28, 2001

BURLAP SATCHEL OF CRACKERJACKS

chutwig posted:

Does anyone have experience with doing nested virtualization in Linux guests in VMware Workstation?

In the event anyone cares, the nested virt stability issues were conclusively solved by moving to 14. Spent all day building stuff in the same VM in Player 14 with no hiccups.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply