Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Space Whale
Nov 6, 2014

qhat posted:

I don't know why but there's a team in my company using some dynamodbs because they think a million records is apparently too much data to query on without nosql. This is the same team whose webpage load time I brought down from 15 seconds to half a second by adding a single index to one of the SQL tables they do have.

The funniest thing is they're not even sharding their dynamodbs rofl.

Jesus Christ why do I have imposter syndrome ever then

Adbot
ADBOT LOVES YOU

hobbesmaster
Jan 28, 2008

Space Whale posted:

Jesus Christ why do I have imposter syndrome ever then

because you don't know how to wrap something simple in a bunch of crazy poo poo

see: :cloud: lightbulbs

PIZZA.BAT
Nov 12, 2016


:cheers:


Space Whale posted:

5000 tho and a tb of data?

oh lol nvm. yeah we’re not brought in until the client is dealing with petabytes of data that has to be queried in milliseconds.

nosql still has its advantages but at that level it’s definitely not that

MononcQc
May 29, 2007

Sapozhnik posted:

when you've got a working set that can't fit into a single instance's ram then you start thinking about application level sharding

when you're google or facebook or amazon and your business data is a gigantic planet-spanning graph, that's the point at which you bust out the super-heavy weaponry.
afaict whatsapp has over a billion users and most of their work is done by regular DBs (in fact bad ones like in Erlang that can't store more than what RAM fits -- unless they have patches not mentioned in their talks) and very clever sharding adapted to the replication and query schemes they need.

They probably don't have that much data in terms of absolute storage (I'd bet all poo poo like videos and images get thrown in a separate store and they just keep a ref/URI in the main DB), but thing is you can get really loving far with just sharding.

Space Whale
Nov 6, 2014

MononcQc posted:

afaict whatsapp has over a billion users and most of their work is done by regular DBs (in fact bad ones like in Erlang that can't store more than what RAM fits -- unless they have patches not mentioned in their talks) and very clever sharding adapted to the replication and query schemes they need.

They probably don't have that much data in terms of absolute storage (I'd bet all poo poo like videos and images get thrown in a separate store and they just keep a ref/URI in the main DB), but thing is you can get really loving far with just sharding.

So how does sharding actually work? Is there a load balancer or do the SQL servers just talk to each other?

Janitor Prime
Jan 22, 2004

PC LOAD LETTER

What da fuck does that mean

Fun Shoe

Space Whale posted:

So how does sharding actually work? Is there a load balancer or do the SQL servers just talk to each other?

Distributed Hashing and black magic or some poo poo

Space Whale
Nov 6, 2014

Janitor Prime posted:

Distributed Hashing and black magic or some poo poo

I thought that was red and green, since you're sharing the hash

Arcsech
Aug 5, 2008

Space Whale posted:

So how does sharding actually work? Is there a load balancer or do the SQL servers just talk to each other?

i mean there's a few different ways to do it

if your data has very clear logical divisions - say, every users data is 100% constrained to a geographic region - you can manually shard on that and just have a different server for each region/shard and have basically separate copies of your app for each region. see also: blizzard, a bunch of other online games where you have to specify whether you're logging into "North America" or "Korea" or whatever

some databases have it built in - you choose a sharding key and your queries all get routed to the correct server using the shard key specified in the query no matter which node they go to in the first place. some databases may provide options for cross-shard queries, but this varies

really "sharding" on its own doesn't imply a lot of detail, it just means "put your data into different buckets somehow so that each bucket is mostly independent", which is how all of the big-data systems like dynamo and cassandra work anyway. what matters is how much you do manually with knowledge of your application and query patterns vs. how much you rely on your database to do for you, which may or may not be well-suited to your application's actual workload

it probably won't be well-suited to your actual workload because designing your data model so that it does work well with dynamo/cassandra/bigtable/etc requires actual thought *sharts*

Arcsech fucked around with this message at 22:46 on Apr 27, 2018

hobbesmaster
Jan 28, 2008

please tell me the database term was not taken from ultima online

Arcsech
Aug 5, 2008

hobbesmaster posted:

please tell me the database term was not taken from ultima online

it seems as though this is a distinct possibility, given that the lore came from the need to create parallel independent instances of the game https://www.raphkoster.com/2009/01/08/database-sharding-came-from-uo/

per wikipedia there was a system for replicated data that had SHARD as an acronym for "System for Highly Available Replicated Data" that existed before ultima online but, well, given that nobody's ever heard of SHARD-the-database and database engineers are turbonerds, what do you think is more likely

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison
most the time people complain about poor db performance, they really just need better schemas and queries

Gazpacho
Jun 18, 2004

by Fluffdaddy
Slippery Tilde

hobbesmaster posted:

please tell me the database term was not taken from ultima online
smh @ people who don't associate it with the dark crystal :corsair:

qhat
Jul 6, 2015


uncurable mlady posted:

most the time people complain about poor db performance, they really just need better schemas and queries

Unless you're Google, this is always the case.

Pythagoras a trois
Feb 19, 2004

I have a lot of points to make and I will make them later.
Do web developers or devops do any work on the side? I'm about to start selling some of my personal hours to the highest bidder, but elance makes me think my experience doing anything well is a liability because I'm not scrambling to do it fast / shittily / in bulk.

If good self employment strategies for extra cash belong in another thread I can post there instead.

kitten emergency
Jan 13, 2008

get meow this wack-ass crystal prison

qhat posted:

Unless you're Google, this is always the case.

eh, i wouldn't be quite that reductive

Space Whale
Nov 6, 2014

Arcsech posted:

i mean there's a few different ways to do it

if your data has very clear logical divisions - say, every users data is 100% constrained to a geographic region - you can manually shard on that and just have a different server for each region/shard and have basically separate copies of your app for each region. see also: blizzard, a bunch of other online games where you have to specify whether you're logging into "North America" or "Korea" or whatever

some databases have it built in - you choose a sharding key and your queries all get routed to the correct server using the shard key specified in the query no matter which node they go to in the first place. some databases may provide options for cross-shard queries, but this varies

really "sharding" on its own doesn't imply a lot of detail, it just means "put your data into different buckets somehow so that each bucket is mostly independent", which is how all of the big-data systems like dynamo and cassandra work anyway. what matters is how much you do manually with knowledge of your application and query patterns vs. how much you rely on your database to do for you, which may or may not be well-suited to your application's actual workload

it probably won't be well-suited to your actual workload because designing your data model so that it does work well with dynamo/cassandra/bigtable/etc requires actual thought *sharts*

So if you have a lot of, say, MLS data and real estate poo poo you could... split by those divisions? :q:

Ellie Crabcakes
Feb 1, 2008

Stop emailing my boyfriend Gay Crungus

Cheekio posted:

Do web developers or devops do any work on the side?
On the weekends I reclaim my dignity by working as a lot lizard.

ADINSX
Sep 9, 2003

Wanna run with my crew huh? Rule cyberspace and crunch numbers like I do?

Relational databases definitely still have a place in the world, but work best as a view of the data, not the source of truth.

Sure you can shard things out almost indefinitely; but when you want to run analytics on your entire dataset, or subsets of that dataset; you're stuck stitching together partial results from thousands of databases.

The better approach (depending on your usecase of course) is to try and have a single event stream that can put data in multiple places, the database being one of them. A snappy UI can be driven off a SQL database while larger analytics can come from BigQuery/Redshift/Spark Jobs/Beam Jobs etc etc. Important changes made in the UI can be broadcast to that event stream (as well as the local database its using in some sort of "unofficial" change, to make the change seem immediate).

Using a single (or multiple) relational databases might be ok for awhile, but it doesn't take "Google" levels of data for this to be a problem. I work at a medium sized company trying to transition to this model, away from thousands of postgres instances across about 100 physical servers (with really beefy specs). The Postgres servers are fine (well, ok, not really) at delivering reports over relatively short time periods, but we constantly get requests from customers for reports across DB boundaries, or for long periods of time, and that really interferes with our transactional load.

Googling for Whatapp's architecture diagram reveals they do this: data gets put in a relational database but also in riak and probably in a bunch of other places they don't list on the public documents.

HoboMan
Nov 4, 2010

sounds like you didn't shard correctly, op

qhat
Jul 6, 2015


Hitting up a buddy who works at big name company in Vancouver, seeing if he can get me a job there. It would be a huge pay increase if I were to I reckon. Long live nepotism.

Space Whale
Nov 6, 2014
Vancouver Washington or B.C. tho

Blinkz0rz
May 27, 2001

MY CONTEMPT FOR MY OWN EMPLOYEES IS ONLY MATCHED BY MY LOVE FOR TOM BRADY'S SWEATY MAGA BALLS

ADINSX posted:

Relational databases definitely still have a place in the world, but work best as a view of the data, not the source of truth.

Sure you can shard things out almost indefinitely; but when you want to run analytics on your entire dataset, or subsets of that dataset; you're stuck stitching together partial results from thousands of databases.

The better approach (depending on your usecase of course) is to try and have a single event stream that can put data in multiple places, the database being one of them. A snappy UI can be driven off a SQL database while larger analytics can come from BigQuery/Redshift/Spark Jobs/Beam Jobs etc etc. Important changes made in the UI can be broadcast to that event stream (as well as the local database its using in some sort of "unofficial" change, to make the change seem immediate).

Using a single (or multiple) relational databases might be ok for awhile, but it doesn't take "Google" levels of data for this to be a problem. I work at a medium sized company trying to transition to this model, away from thousands of postgres instances across about 100 physical servers (with really beefy specs). The Postgres servers are fine (well, ok, not really) at delivering reports over relatively short time periods, but we constantly get requests from customers for reports across DB boundaries, or for long periods of time, and that really interferes with our transactional load.

Googling for Whatapp's architecture diagram reveals they do this: data gets put in a relational database but also in riak and probably in a bunch of other places they don't list on the public documents.

we do the same except cassandra and a lot of beefy mysql (😱) rds instances

tk
Dec 10, 2003

Nap Ghost

ADINSX posted:

Relational databases definitely still have a place in the world, but work best as a view of the data, not the source of truth.

Sure you can shard things out almost indefinitely; but when you want to run analytics on your entire dataset, or subsets of that dataset; you're stuck stitching together partial results from thousands of databases.

The better approach (depending on your usecase of course) is to try and have a single event stream that can put data in multiple places, the database being one of them. A snappy UI can be driven off a SQL database while larger analytics can come from BigQuery/Redshift/Spark Jobs/Beam Jobs etc etc. Important changes made in the UI can be broadcast to that event stream (as well as the local database its using in some sort of "unofficial" change, to make the change seem immediate).

Using a single (or multiple) relational databases might be ok for awhile, but it doesn't take "Google" levels of data for this to be a problem. I work at a medium sized company trying to transition to this model, away from thousands of postgres instances across about 100 physical servers (with really beefy specs). The Postgres servers are fine (well, ok, not really) at delivering reports over relatively short time periods, but we constantly get requests from customers for reports across DB boundaries, or for long periods of time, and that really interferes with our transactional load.

Googling for Whatapp's architecture diagram reveals they do this: data gets put in a relational database but also in riak and probably in a bunch of other places they don't list on the public documents.

Do this it makes GDPR compliance real fun.

qhat
Jul 6, 2015


Space Whale posted:

Vancouver Washington or B.C. tho

BC.

ADINSX
Sep 9, 2003

Wanna run with my crew huh? Rule cyberspace and crunch numbers like I do?

tk posted:

Do this it makes GDPR compliance real fun.

Yeah this is the wrench that gets thrown into the model. Its fun to talk about everything as an event and the ability to recreate the exact same state by re-streaming all events from the beginning of time. But what happens when a customer leaves? Or they have a retention policy?

We're working on cleanup jobs that just go through all the views and remove the relevant data; basically you have to violate the "immutable" part of the events. In an ideal world there would be "delete" events that would remove records from the views... but you still need to remove them from any copy of the event stream itself, so what can you do.

I guess the main point of the post was that saying "only google cares about this" is not true; its not hard to get too much data for a cluster of databases to no longer be the best solution. I don't even think its wrong for startups to design things from the onset this way. Maybe its a little resume-driven-development, but in their eyes they either grow to that scale or die anyway, so might as well plan to succeed.

I'm gonna be interviewing with a satellite imaging company next week. They produce about 5-10TB of imagery a day and deal with even more serious "big data" problems, so thats pretty exciting.

Notorious b.s.d.
Jan 25, 2003

by Reene

ADINSX posted:

Using a single (or multiple) relational databases might be ok for awhile, but it doesn't take "Google" levels of data for this to be a problem. I work at a medium sized company trying to transition to this model, away from thousands of postgres instances across about 100 physical servers (with really beefy specs). The Postgres servers are fine (well, ok, not really) at delivering reports over relatively short time periods, but we constantly get requests from customers for reports across DB boundaries, or for long periods of time, and that really interferes with our transactional load.

with very few exceptions, sql databases are build and sold for oltp workloads. always have been.

reporting and analysis are weak, at best. you are not going to do olap on a database designed for oltp, and vice versa

Notorious b.s.d.
Jan 25, 2003

by Reene
i am also curious what you consider "beefy." it's 2018: you can order an off-the-shelf x86 server with 384 cores and 48 tb of ram

if you have deep pockets, petabytes of ram and thousands of cores are an option

Notorious b.s.d.
Jan 25, 2003

by Reene

Cheekio posted:

Do web developers or devops do any work on the side? I'm about to start selling some of my personal hours to the highest bidder, but elance makes me think my experience doing anything well is a liability because I'm not scrambling to do it fast / shittily / in bulk.

If good self employment strategies for extra cash belong in another thread I can post there instead.

elance and friends are a complete waste of time

the buyers are idiots and the only successful sellers are in low-CoL countries so they will work for $5/hr

don't even consider it unless you live in rural india and you have a very high tolerance for bullshit

ADINSX
Sep 9, 2003

Wanna run with my crew huh? Rule cyberspace and crunch numbers like I do?

I think we're saying the same thing? Their main strength is OLTP but of course you can get away with running analytics on them as well, especially when the data is small. When our company was building this out, it was the mid aughts and it was a team of people who thought the database could solve everything (I've only been here for 2 years so this is second hand). So we have OLTP, OLAP and even a nightmarish web of triggers to create business objects as events are inserted into the database. Its a real triple threat and now machines are falling over.

As for how beefy the servers are, not that beefy by those standards. Ram on the order of dozens of gigabytes, no idea about the number of cores. As I understand it most of them were acquired several years ago and are getting on in age, leaving the company with a decision: Buy a new round of hardware, or move most of the data to cloud services.

Fiedler
Jun 29, 2002

I, for one, welcome our new mouse overlords.

ADINSX posted:

As for how beefy the servers are, not that beefy by those standards. Ram on the order of dozens of gigabytes, no idea about the number of cores.

So not beefy at all, then.

ADINSX
Sep 9, 2003

Wanna run with my crew huh? Rule cyberspace and crunch numbers like I do?

please don't server shame

Notorious b.s.d.
Jan 25, 2003

by Reene

ADINSX posted:

I think we're saying the same thing? Their main strength is OLTP but of course you can get away with running analytics on them as well, especially when the data is small. When our company was building this out, it was the mid aughts and it was a team of people who thought the database could solve everything (I've only been here for 2 years so this is second hand). So we have OLTP, OLAP and even a nightmarish web of triggers to create business objects as events are inserted into the database. Its a real triple threat and now machines are falling over.

yeah this is not a fundamental problem in sql databases

this is a problem at your company. y'all should... stop doing that.

ADINSX posted:

As for how beefy the servers are, not that beefy by those standards. Ram on the order of dozens of gigabytes, no idea about the number of cores. As I understand it most of them were acquired several years ago and are getting on in age, leaving the company with a decision: Buy a new round of hardware, or move most of the data to cloud services.

i think my smallest server at work has 768 gb. they are 1U boxes that we use for dumb cloud-type poo poo

you should definitely buy a new round of hardware, first, to buy you time for the multi-year transition to cloud services. that ain't exactly like flipping a switch or something

Sapozhnik
Jan 2, 2005

Nap Ghost

Notorious b.s.d. posted:

i am also curious what you consider "beefy." it's 2018: you can order an off-the-shelf x86 server with 384 cores and 48 tb of ram

if you have deep pockets, petabytes of ram and thousands of cores are an option

spending a quarter of a million dollars or more on a beast like that just seems like all sorts of bad idea.

Notorious b.s.d.
Jan 25, 2003

by Reene

Sapozhnik posted:

spending a quarter of a million dollars or more on a beast like that just seems like all sorts of bad idea.

a quarter of a million dollars would be a rounding error on my group's technology spend. not my company's spend. just my group.

commodity hardware is really, really cheap

for the cost of 1 fte, you could have four beefy database servers in two redundant pairs (bearing in mind they will be amortized out across three or four years on a lease)

Notorious b.s.d. fucked around with this message at 19:45 on Apr 28, 2018

Sapozhnik
Jan 2, 2005

Nap Ghost
vast acres of ram are always nice to have but how can you make effective use that many cores tied to a single system bus?

your budget is enviable but also not indicative of anything in one direction or the other. any fool can shovel money into a furnace.

Notorious b.s.d.
Jan 25, 2003

by Reene

Sapozhnik posted:

vast acres of ram are always nice to have but how can you make effective use that many cores tied to a single system bus?

obviously reaching across sockets on a NUMA system is a lot slower than local access

but it is much, much, much, much, much faster than reaching out across a network

ADINSX
Sep 9, 2003

Wanna run with my crew huh? Rule cyberspace and crunch numbers like I do?

Notorious b.s.d. posted:

yeah this is not a fundamental problem in sql databases

this is a problem at your company. y'all should... stop doing that.

Yes... I know... we are in the process of stopping.

Its not a fundamental problem with relational databases, but it is a fundamental problem when you say "Relational databases can be used for anything so long as you shard them and use indexes". Which is how we started talking about this.

Notorious b.s.d. posted:

i think my smallest server at work has 768 gb. they are 1U boxes that we use for dumb cloud-type poo poo

you should definitely buy a new round of hardware, first, to buy you time for the multi-year transition to cloud services. that ain't exactly like flipping a switch or something

We're about a year into the transition, and even if we weren't its not up to me, I'm only on the software side of things.

Notorious b.s.d.
Jan 25, 2003

by Reene

Sapozhnik posted:

your budget is enviable but also not indicative of anything in one direction or the other. any fool can shovel money into a furnace.

it is a reminder that not all the world is a startup struggling to justify a larger ec2 instance type

technology in general is insanely expensive, and commodity hardware is very, very cheap compared to either the cost of labor or the business value generated

spending a quarter million on a beefy database server isn't just a good idea, it's often a no-brainer -- why invest any time or effort into designing a replacement for an oltp database when you can just upgrade the servers

jony neuemonic
Nov 13, 2009

Notorious b.s.d. posted:

you should definitely buy a new round of hardware, first, to buy you time for the multi-year transition to cloud services. that ain't exactly like flipping a switch or something

current job refuses to accept this idea or the reality of their situation. they’re absolutely certain they can just dump two decades of legacy crap into aws if they just try harder.

Adbot
ADBOT LOVES YOU

ADINSX
Sep 9, 2003

Wanna run with my crew huh? Rule cyberspace and crunch numbers like I do?

jony neuemonic posted:

current job refuses to accept this idea or the reality of their situation. they’re absolutely certain they can just dump two decades of legacy crap into aws if they just try harder.

Because they realize once the immediate fire is put out the organization will go right back to creating a bunch of small new fires

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply