NoSQL : This Thread is Web Scale

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > NoSQL : This Thread is Web Scale

«‹›4 »

PIZZA.BAT: Nov 12, 2016

I'm kind of surprised we don't have one of these. This thread can be the general catch-all for NoSQL tech discussion and questions.

There seems to be a bit of confusion around what NoSQL actually is so this OP will be a crash course.

What IS NoSQL?
In a nutshell, NoSQL is any database tech that stores information in any manner besides tables formed of rows and columns linked together by joins. This is a very broad category because needless to say- there's a lot of ways to store data besides in tables!

Such as what?
There's four primary flavors of NoSQL that makes up most of what's out there:

Key-Value : Basically a really sophisticated hashmap. Objects are stored in the DB and retrieved by the key. That's basically it. Key players include Dynamo, MemCache, Aerospike, and Redis.
Column or Columnar : Basically the same as your standard relational DB only it's column-oriented rather than row oriented. The basic point is to fundamentally change how data is clustered and stored on disk in order to make certain queries/writes much more performant. Only column DB I know off the top of my head is Cassandra.
Document : This is what most people think of when they hear NoSQL. These DBs store data as documents, usually XML or JSON. Instead of representing an entity via a series of rows joined across multiple tables, you hold all the data in a single document. The primary advantage here is that your queries become dramatically simpler and you don't have to spend tons of front time modeling data. Big players are Mongo and Couch. If you have $$$ and need ACID you probably want MarkLogic.
Graph : These things are insane but they're pretty fun. Data is stored as a 'web' of nodes all linked together by predicates. So you can have a 'web' of people as the subject nodes while your predicates are their relationships to each other. Then using SPARQL to ask questions like 'Give me all people who were friends with Elvis Presley's Mother in Law who also were two or less jumps away from Buddy Holly' or something like that. Primary player here is Neo4j

Additional Reading: Check out NoSQL For Dummies. You can also find digital versions of this for free if you Google around a bit.

PIZZA.BAT fucked around with this message at 11:06 on Aug 27, 2019

# ? Dec 3, 2018 05:01

Adbot: ADBOT LOVES YOU

# ? Apr 28, 2024 21:20

rarbatrol: Apr 17, 2011; Hurt//maim//kill.

Part of my job a couple years ago involved evaluating a bunch of NoSQL technologies for integration with our current stack, for the better part of a year. One of the candidates on my personal short list was FoundationDB. It's a high-performance key-value store with ACID transactions, and a plugin system they call "layers" that allow you to map more complex models down to the KV store. Frustratingly enough, the company was bought by Apple halfway through our project and taken off the market. Well apparently they've open-sourced it earlier this year: https://github.com/apple/foundationdb, and have just released a document store layer in the last few days.

We ended up using Elasticsearch (also open source: https://github.com/elastic/elasticsearch), which is more of a search engine than a document database, but it kind of straddles the line and ended up working okay. The searching capabilities are fantastic, but the default configurations are hobbyist friendly, which you either need to know what you're doing or pay them for support.

rarbatrol fucked around with this message at 00:59 on Dec 5, 2018

# ? Dec 4, 2018 03:22

PIZZA.BAT: Nov 12, 2016

rarbatrol posted:

Part of my job a couple years ago involved evaluating a bunch of NoSQL technologies for integration with our current stack, for the better part of a year. One of the candidates on my personal short list was FoundationDB. It's a high-performance key-value store with ACID transactions, and a plugin system they call "layers" that allow you to map more complex models down to the KV store. Frustratingly enough, the company was bought by Apple halfway through our project and taken off the market. Well apparently they've open-sourced it earlier this year: https://github.com/apple/foundationdb, and have just released a document store layer in the last few days.

Ha I used to be able to brag about how the only other ACID NoSQL DB on the market was gobbled up and blackholed by Apple. Good to know they're kind of back though this will be interesting to play with

# ? Dec 4, 2018 03:52

Arcsech: Aug 5, 2008

rarbatrol posted:

We ended up using Elasticsearch (also open source: https://github.com/elastic/elasticsearch), which is more of a search engine than a document database, but it kind of straddles the line and ended up working okay. The searching capabilities are fantastic, but the default configurations are hobbyist friendly, which you either need to know what you're doing or pay them for support.

I was going to mention Elasticsearch - Elastic has historically pretty particular about calling it a "search engine" not a "database", but it's basically a database, especially with recent improvements to reliability. Also kind of going in a time-series type direction for logs and metrics and whatnot.

Another database that's NoSQL (although ACID and sorta-relational) that's pretty interesting is Datomic by Cognitect, the folks behind Clojure, although it's not open source. It's basically datalog as a database, with basically built-in event sourcing so you can do point-in-time queries and treat the database at a particular point in time as an immutable value without holding a transaction open. It's pretty weird, there's nothing else really like it. Even though I linked to it, don't bother looking at the website, it's utterly useless. Just watch this video instead.

# ? Dec 4, 2018 04:24

luchadornado: Oct 7, 2004; A boombox is not a toy!

I'm glad this thread is here, even though C* and Mongo have given me plenty of headaches over the years. Mongo was horrible with split brain (maybe better around the Jepsen tweaks of 3.4+), and tomb-stoning in C* is the current bane of my existence. https://jepsen.io/analyses is some good reading on the pitfalls of databases that give you the P in CAP - many of which are NoSQL.

Don't forget God-tier RocksDB (https://rocksdb.org/) in the KV section. Arguments abound that Kafka is actually a DB as well.

# ? Dec 4, 2018 17:33

Munkeymon: Aug 14, 2003; Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.

rarbatrol posted:

Part of my job a couple years ago involved evaluating a bunch of NoSQL technologies for integration with our current stack, for the better part of a year. One of the candidates on my personal short list was FoundationDB. It's a high-performance key-value store with ACID transactions, and a plugin system they call "layers" that allow you to map more complex models down to the KV store. Frustratingly enough, the company was bought by Apple halfway through our project and taken off the market. Well apparently they've open-sourced it earlier this year: https://github.com/apple/foundationdb, and have just released a document store layer in the last few days.

We ended up using Elasticsearch (also open source: https://github.com/elastic/elasticsearch), which is more of a search engine than a document database, but it kind of straddles the line and ended up working okay. The searching capabilities are fantastic, but the default configurations are hobbyist friendly, which you either need to know what you're doing or pay them for support.

The URL parser included the comma after the foundationdb link - might want to fix it if you feel like it. Looks cool, though.

# ? Dec 4, 2018 17:41

rarbatrol: Apr 17, 2011; Hurt//maim//kill.

Munkeymon posted:

The URL parser included the comma after the foundationdb link - might want to fix it if you feel like it. Looks cool, though.

Thanks, I've corrected it.

While I'm here, one thing I've noticed during my research is that very few databases can actually handle stupid-large data. Elasticsearch can store and index a 10MB document, but it's going to be very, very, grumpy about it. Microsoft's SQL server can do something like 2GB in a single cell, but again, you're going to have a bad time. Maybe this is only a real problem in certain industries?

# ? Dec 5, 2018 01:29

PIZZA.BAT: Nov 12, 2016

My experience with MarkLogic is that it can handle documents of any size pretty well going up to its limit of half a gig.

Also full disclosure for this thread: I work for MarkLogic so I'm gonna have a bit of a bias itt

# ? Dec 5, 2018 01:42

Scaramouche: Mar 26, 2001; SPACE FACE! SPACE FACE!

Closest I've gotten to NoSQL is wacking together Lucene/SOLR search implementations but I won't lie and say I had a deep understanding of the underlying.

I'm curious about the Graph stuff; is that related to the Facebook GraphQL stuff that Shopify also uses? I've basically only made request/response harnesses for them but never really set it up on the server side.

# ? Dec 5, 2018 20:11

PIZZA.BAT: Nov 12, 2016

Scaramouche posted:

Closest I've gotten to NoSQL is wacking together Lucene/SOLR search implementations but I won't lie and say I had a deep understanding of the underlying.

I'm curious about the Graph stuff; is that related to the Facebook GraphQL stuff that Shopify also uses? I've basically only made request/response harnesses for them but never really set it up on the server side.

From my quick googling this seems to be solely an API tool rather than a storage solution.

The wiki article on graph DBs is actually pretty thorough and does a good job describing them: https://en.m.wikipedia.org/wiki/Graph_database

# ? Dec 5, 2018 20:26

Ape Fist: Feb 23, 2007; Nowadays, you can do anything that you want; anal, oral, fisting, but you need to be wearing gloves, condoms, protection.

CosmosDB if you want to pay Microsoft a billion dollars for a hard-fork of Mongo.

# ? Dec 5, 2018 21:49

luchadornado: Oct 7, 2004; A boombox is not a toy!

Scaramouche posted:

I'm curious about the Graph stuff; is that related to the Facebook GraphQL stuff that Shopify also uses?

Not at all - GraphQL is a query language for APIs, not unlike REST, except the consumer specifies the data and the shape returned. The backend does all the heavy lifting of determining how to fulfill and transform the response.

Graph databases are all about the relationships between nodes. If you wanted to store the business relationships between companies and query against them, that's where you might want a Graph database. Imagine how you would model and query against that with an RDBMS and your head will start hurting.

# ? Dec 6, 2018 03:50

Naar: Aug 19, 2003; The Time of the Eye is now; Fun Shoe

When I worked at a certain well-known broadcasting corporation, we gave MarkLogic a shitload of money. Apparently they did buy some of the team very expensive tequila, though? Kudos Rex-Goliath, it actually works pretty well (though please stop making those cringeworthy videos).

# ? Dec 6, 2018 17:13

Star War Sex Parrot: Oct 2, 2003

rarbatrol posted:

Part of my job a couple years ago involved evaluating a bunch of NoSQL technologies for integration with our current stack, for the better part of a year. One of the candidates on my personal short list was FoundationDB. It's a high-performance key-value store with ACID transactions, and a plugin system they call "layers" that allow you to map more complex models down to the KV store. Frustratingly enough, the company was bought by Apple halfway through our project and taken off the market. Well apparently they've open-sourced it earlier this year: https://github.com/apple/foundationdb, and have just released a document store layer in the last few days.

Yeah FoundationDB is awesome. There are several impressive products out there built on top of it (for example, Snowflake stores their metadata in Foundation).

# ? Dec 6, 2018 19:08

PIZZA.BAT: Nov 12, 2016

Naar posted:

When I worked at a certain well-known broadcasting corporation, we gave MarkLogic a shitload of money. Apparently they did buy some of the team very expensive tequila, though? Kudos Rex-Goliath, it actually works pretty well (though please stop making those cringeworthy videos).

lol yeah our marketing is all around bad and everyone here knows it. it�s bad enough that every client has told me it sucks straight to my face. �\_(ツ)_/�

# ? Dec 6, 2018 19:40

xpander: Sep 2, 2004

Does anyone here have experience using RavenDB? The place I'm working at now uses it, and while I wasn't a fan at first, the new version(4.0) seems pretty decent.

# ? Dec 6, 2018 19:48

PIZZA.BAT: Nov 12, 2016

Amazon just fired a broadside into MongoDB�s hull: https://aws.amazon.com/blogs/aws/new-amazon-documentdb-with-mongodb-compatibility-fast-scalable-and-highly-available/

Not only did they make a full document database that largely mimics MongoDB- they straight up cloned the API. IMO this is really going to hurt Mongo�s ability to move forward as their customer base is going to
cement on the APIs that allow them to switch back and forth with Amazon�s service. But maybe not.

# ? Jan 14, 2019 14:48

tadashi: Feb 20, 2006

This is exciting stuff, especially if you want to wow people with the term "Big Data".

Just thought I'd share:
If people are considering or already decided to use AWS, one thing that an AWS specialist brought up at a boot camp I went to was that, if you have a lot of data, it may be cheaper to dump all your data into an S3 bucket before injesting it into your actual database (in whatever format you're using). Just use what you need in the database service.

Excuse me for not understanding every situation where that will apply but the main point was that database services are expensive compared to S3.

# ? Jan 14, 2019 16:51

Arcsech: Aug 5, 2008

Rex-Goliath posted:

Amazon just fired a broadside into MongoDB�s hull: https://aws.amazon.com/blogs/aws/new-amazon-documentdb-with-mongodb-compatibility-fast-scalable-and-highly-available/

Not only did they make a full document database that largely mimics MongoDB- they straight up cloned the API. IMO this is really going to hurt Mongo�s ability to move forward as their customer base is going to
cement on the APIs that allow them to switch back and forth with Amazon�s service. But maybe not.

From what I've read elsewhere online, it's 1) only kinda-compatible with MongoDB (a lot of things don't work, or don't work the same as on the real MongoDB), even comparing to the last OSS Mongo version, and 2) is probably based on Postgres' JSONB support and is using Aurora on the backend, which will have significantly different performance characteristics, similar to this old, semi-abandoned project.

I would avoid using this - I suspect its primary purpose is for political games, rather than actually being a good product.

Full disclosure: I work for a company that competes with (a portion of) AWS, but these are my honest thoughts and I don't think I'd think any different were I in a different position.

Arcsech fucked around with this message at 18:29 on Jan 14, 2019

# ? Jan 14, 2019 18:25

abelwingnut: Dec 23, 2002

oh hell yes. been hoping this thread would pop up.

i have some bigger questions for later, but right now a rather small one. what�s the best front-end option for a local, off-the-network mongodb? have used studio3t, but i�m not sure if there�s better out there. free would be nice too.

# ? Jan 30, 2019 17:27

Razzled: Feb 3, 2011; MY HARLEY IS COOL

Anyone have any good reading or class suggestions for data modeling knowledge with Cassandra? It's an area where I have so little experience that almost all of my suggestions or work in that area amounts to trial and error. I understand the basics but when it comes to best practices etc I'm just totally in the dark

# ? Jan 30, 2019 18:06

PIZZA.BAT: Nov 12, 2016

Abel Wingnut posted:

oh hell yes. been hoping this thread would pop up.

i have some bigger questions for later, but right now a rather small one. what�s the best front-end option for a local, off-the-network mongodb? have used studio3t, but i�m not sure if there�s better out there. free would be nice too.

Razzled posted:

Anyone have any good reading or class suggestions for data modeling knowledge with Cassandra? It's an area where I have so little experience that almost all of my suggestions or work in that area amounts to trial and error. I understand the basics but when it comes to best practices etc I'm just totally in the dark

Unfortunately I think this thread is gonna be pretty quiet for a while until more people find us. If you guys find answers to your stuff though please come back to fill us in!

# ? Feb 3, 2019 00:49

Mr Shiny Pants: Nov 12, 2012

Is this also the place to ask questions about Hadoop?

# ? Feb 3, 2019 09:34

PIZZA.BAT: Nov 12, 2016

Mr Shiny Pants posted:

Is this also the place to ask questions about Hadoop?

Sure!

# ? Feb 3, 2019 15:51

BabyFur Denny: Mar 18, 2003

Mr Shiny Pants posted:

Is this also the place to ask questions about Hadoop?

go ahead, I know a lot about hadoop and kafka thingies.

# ? Feb 5, 2019 10:22

abelwingnut: Dec 23, 2002

alright, so let me know what you guys think of this

i need to monitor millions of hits/impressions per day--hopefully way more in the not too distant future. at the same time, i want to back the data up and run queries on it. pretty normal request for the data world, but i'm new to nosql, so let me know if this makes sense.

given the high frequency of reads and writes we expect to perform on this initial, capturing database, i was thinking of installing a scylla instance. i've heard it's extremely quick, and that's really what i'm after with this database. no querying, and it will never be huge, so space isn't a big concern. the primary purpose is to capture data from thousands of sources, then move it to the next database. so it seems like scylla fits?

so the data scylla captures would then be offloaded hourlyquarter-daily/half-daily/daily (need to feel this out) to another scylladb, or possibly mongodb or cassandradb, that would serve as the archive. this would really be about storage, doubling as a backup.

for querying, then i'd run nightly scripts to aggregate data from the archive to probably a sql instance. i don't need to get too granular with the queries, it really is more about the like-groups of data.

thoughts? too many moving parts? is there a way to consolidate some of these things? basically need speed up front, storage in the middle, and querying power on the side. figured these were the best solutions, and hope to test them out this week. let me know if you have any other potential db systems in mind. like i said, i'm new to this world. i'm really a sql guy through-and-through.

e: in an ideal world, all of these db would be sql-based. everyone on the team knows the language, and it's tried and true. problem is i don't think it can handle the i/o required for the first capturing db, and i know it can't handle the size the archive will eventually get to. happy to be proven wrong if you have examples, though!

abelwingnut fucked around with this message at 15:21 on Feb 5, 2019

# ? Feb 5, 2019 15:08

PIZZA.BAT: Nov 12, 2016

Abel Wingnut posted:

alright, so let me know what you guys think of this

It sounds like you're trying to build a data hub

Essentially- you split your database into two databases, which it seems like you're already doing. Your first database is the 'staging' database which basically acts as your classical data lake that you dump your data into. You then have scheduled jobs which iterate over that dumped data and harmonizes it into your clean product and places it into you second database, 'final'.

Something you want to keep in mind is that with a sql db you're going to have a lot of hurdles to clear any time you want to ingest new data sources or if an existing source changes. It's much easier to ingest the data as-is as quickly as possible and then figure it out in your harmonization stage. Likewise, making your final db relational also places a hurdle in front of you any time you decide you want different information out of your harmonization- which is going to happen a lot as new data sources changes your understanding of what you want out of your final product. It's much easier to keep your staging and final dbs non-relational and then possibly export that final data under a relational lens.

# ? Feb 5, 2019 16:34

BabyFur Denny: Mar 18, 2003

I would have suggested Apache Kafka as the transport layer, it's commonly used to deliver log messages to any downstream consumers and a very solid piece of software. From there you can basically persist it to as many different systems as you like.

# ? Feb 5, 2019 16:44

Votlook: Aug 20, 2005

Abel Wingnut posted:

alright, so let me know what you guys think of this

I've worked a system like that using Kafa, AWS Athena & S3

We shoved all the events into Kafka, setup some servers to read the events from Kafka, store them in S3 in some common format like JSON, using the
date in the key.
With minimal effort (setting up some partitioning) you can query all the events in S3 using SQL with Athena.
We also hooked up a streaming platform like Flink/Kafka Streams/Storm to run some analytics as events arrived,
this is nice for queries like "how many impressions did we get in the last hour" were you don't want to wait on a nightly batch job.

It was not the cheapest solution, but it worked pretty well and was easy to setup and operate by just a few people.

Using relational db's all the way is probably not a great fit, i'd use a NoSQL db to store the raw events and something like Spark to
fill relational databases with weekly/hourly/etc aggregates.

Votlook fucked around with this message at 17:06 on Feb 5, 2019

# ? Feb 5, 2019 17:00

Razzled: Feb 3, 2011; MY HARLEY IS COOL

I will just point out that Athena can get very costly if you are doing queries over large files.

If you need to do real-time esque processing I'd look into handling all ETL through a storm topology and then having your NOSQL db be the receiver of that stream of data. I want to say that any of the myriad of analytics layers that can sit on top of Hadoop would be adequate as an archive for data at rest.

# ? Feb 6, 2019 08:54

Votlook: Aug 20, 2005

Razzled posted:

I will just point out that Athena can get very costly if you are doing queries over large files.

If you need to do real-time esque processing I'd look into handling all ETL through a storm topology and then having your NOSQL db be the receiver of that stream of data. I want to say that any of the myriad of analytics layers that can sit on top of Hadoop would be adequate as an archive for data at rest.

Using Athena was pretty costly yes, but very convenient for a small team.

I really want to check out databases like ClickHouse or Druid sometime, they seem to fit this usecase well. Does anyone here have experience with those?

# ? Feb 6, 2019 15:49

abelwingnut: Dec 23, 2002

thanks for all the recommendations above.

i'm actually putting it together right now. we're going with kafka to scylla--seemed like a better option than what we were thinking.

problem is i'm having the damnedest time trying to configure kafka, though. for whatever reason zookeeper will just not start as a service.

i believe it must have to do with my user rights. here's an example.

i'm able to run the script directly when i move to opt/zookeeper, but i am unable to run the script specifying the script's full location, opt/zookeeper/bin/zkServer.sh when my pwd is the root. that suggests to me something's up with my rights. i don't know, i'm entirely new to linux so this is really proving a pain in the rear end.

hmm, maybe not. when i try bin/zkServer.sh stop from opt/zookeeper i get nearly the same error as seen in my first screenshot:

pulling my hair out on this

abelwingnut fucked around with this message at 04:52 on Feb 20, 2019

# ? Feb 20, 2019 04:47

BabyFur Denny: Mar 18, 2003

What do the logs say?

# ? Feb 20, 2019 08:13

abelwingnut: Dec 23, 2002

well, i couldn't access the logs in the previous try. i ended up starting a new droplet with some better stats on digital ocean, then followed this: https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-ubuntu-18-04

now it works despite not installing zookeeper. i have no idea how, but the tests are fine. it seems like there are some zookeeper services, but i truly have zero idea how they got there.

:confused:

# ? Feb 21, 2019 12:31

Razzled: Feb 3, 2011; MY HARLEY IS COOL

Abel Wingnut posted:

well, i couldn't access the logs in the previous try. i ended up starting a new droplet with some better stats on digital ocean, then followed this: https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-ubuntu-18-04

now it works despite not installing zookeeper. i have no idea how, but the tests are fine. it seems like there are some zookeeper services, but i truly have zero idea how they got there.

One thing to keep in mind is that while Kafka REQUIRES zookeeper to run, zookeeper does not require kafka. It is used for a multitude of platforms and can be setup as an independent cluster. For portability purposes I suspect, Kafka includes bins of the zk server so that you don't need to do this to get the bare minimum kafka up and running. Most likely your original zk installation was misconfigured or installed in non default locations.

Digiocean docs are good though

# ? Feb 21, 2019 19:09

BabyFur Denny: Mar 18, 2003

Kafka also tries to be less and less reliant on Zookeeper. Tbh I don't know exactly what they're still needed for in the later 2.1 versions.

# ? Feb 21, 2019 19:29

Gangsta Lean: Dec 3, 2001; Calm, relaxed...what could be more fulfilling?

Pretty sure Kafka either installed zookeeper as a dependency or came bundled with it when I installed it on Ubuntu 18.04 last spring/summer. I also remember having to use zookeeper host:port for some Kafka commands, and Kafka host:port for others.

# ? Feb 21, 2019 23:48

Mr Shiny Pants: Nov 12, 2012

I am learning about Hadoop, Kafka and Spark to see if they would work for our company, does anyone have general do's and don'ts? They are pretty complex software packages. Some general experiences running them would be awesome.

I have just ordered a Raspberry Pi cluster to play around with Kubernetes and the aforementioned software. Inspired by: https://blog.sicara.com/build-own-cloud-kubernetes-raspberry-pi-9e5a98741b49

# ? Mar 8, 2019 09:43

BabyFur Denny: Mar 18, 2003

Mr Shiny Pants posted:

I am learning about Hadoop, Kafka and Spark to see if they would work for our company, does anyone have general do's and don'ts? They are pretty complex software packages. Some general experiences running them would be awesome.

I have just ordered a Raspberry Pi cluster to play around with Kubernetes and the aforementioned software. Inspired by: https://blog.sicara.com/build-own-cloud-kubernetes-raspberry-pi-9e5a98741b49

Learn MapReduce since the concepts are still everywhere in all modern frameworks even if the mapreduce framework itself is outdated. learn when you need to redistribute your data by key (i.e. data is shuffled around) and what can be done locally vs. distributed. Read Martin Kleppman's Designing Data Intensive Applications. Figure out what happens if any individual component in your architecture fails. Avoid Spark.

Also feel free to ask here. I have plenty of experience with running those three frameworks.

# ? Mar 8, 2019 15:02

Adbot: ADBOT LOVES YOU

# ? Apr 28, 2024 21:20

DELETE CASCADE: Oct 25, 2017; i haven't washed my penis since i jerked it to a phtotograph of george w. bush in 2003

for your sanity's sake, make sure there's no possible way your data can fit in memory before you start farting around mapping and reducing

# ? Mar 8, 2019 18:26

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > NoSQL : This Thread is Web Scale

«‹›4 »