NoSQL : This Thread is Web Scale

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > NoSQL : This Thread is Web Scale

PIZZA.BAT: Nov 12, 2016

I'm kind of surprised we don't have one of these. This thread can be the general catch-all for NoSQL tech discussion and questions.

There seems to be a bit of confusion around what NoSQL actually is so this OP will be a crash course.

What IS NoSQL?
In a nutshell, NoSQL is any database tech that stores information in any manner besides tables formed of rows and columns linked together by joins. This is a very broad category because needless to say- there's a lot of ways to store data besides in tables!

Such as what?
There's four primary flavors of NoSQL that makes up most of what's out there:

Key-Value : Basically a really sophisticated hashmap. Objects are stored in the DB and retrieved by the key. That's basically it. Key players include Dynamo, MemCache, Aerospike, and Redis.
Column or Columnar : Basically the same as your standard relational DB only it's column-oriented rather than row oriented. The basic point is to fundamentally change how data is clustered and stored on disk in order to make certain queries/writes much more performant. Only column DB I know off the top of my head is Cassandra.
Document : This is what most people think of when they hear NoSQL. These DBs store data as documents, usually XML or JSON. Instead of representing an entity via a series of rows joined across multiple tables, you hold all the data in a single document. The primary advantage here is that your queries become dramatically simpler and you don't have to spend tons of front time modeling data. Big players are Mongo and Couch. If you have $$$ and need ACID you probably want MarkLogic.
Graph : These things are insane but they're pretty fun. Data is stored as a 'web' of nodes all linked together by predicates. So you can have a 'web' of people as the subject nodes while your predicates are their relationships to each other. Then using SPARQL to ask questions like 'Give me all people who were friends with Elvis Presley's Mother in Law who also were two or less jumps away from Buddy Holly' or something like that. Primary player here is Neo4j

Additional Reading: Check out NoSQL For Dummies. You can also find digital versions of this for free if you Google around a bit.

PIZZA.BAT fucked around with this message at 11:06 on Aug 27, 2019

# ¿ Dec 3, 2018 05:01

Adbot: ADBOT LOVES YOU

# ¿ Apr 29, 2024 06:48

PIZZA.BAT: Nov 12, 2016

rarbatrol posted:

Part of my job a couple years ago involved evaluating a bunch of NoSQL technologies for integration with our current stack, for the better part of a year. One of the candidates on my personal short list was FoundationDB. It's a high-performance key-value store with ACID transactions, and a plugin system they call "layers" that allow you to map more complex models down to the KV store. Frustratingly enough, the company was bought by Apple halfway through our project and taken off the market. Well apparently they've open-sourced it earlier this year: https://github.com/apple/foundationdb, and have just released a document store layer in the last few days.

Ha I used to be able to brag about how the only other ACID NoSQL DB on the market was gobbled up and blackholed by Apple. Good to know they're kind of back though this will be interesting to play with

# ¿ Dec 4, 2018 03:52

PIZZA.BAT: Nov 12, 2016

My experience with MarkLogic is that it can handle documents of any size pretty well going up to its limit of half a gig.

Also full disclosure for this thread: I work for MarkLogic so I'm gonna have a bit of a bias itt

# ¿ Dec 5, 2018 01:42

PIZZA.BAT: Nov 12, 2016

Scaramouche posted:

Closest I've gotten to NoSQL is wacking together Lucene/SOLR search implementations but I won't lie and say I had a deep understanding of the underlying.

I'm curious about the Graph stuff; is that related to the Facebook GraphQL stuff that Shopify also uses? I've basically only made request/response harnesses for them but never really set it up on the server side.

From my quick googling this seems to be solely an API tool rather than a storage solution.

The wiki article on graph DBs is actually pretty thorough and does a good job describing them: https://en.m.wikipedia.org/wiki/Graph_database

# ¿ Dec 5, 2018 20:26

PIZZA.BAT: Nov 12, 2016

Naar posted:

When I worked at a certain well-known broadcasting corporation, we gave MarkLogic a shitload of money. Apparently they did buy some of the team very expensive tequila, though? Kudos Rex-Goliath, it actually works pretty well (though please stop making those cringeworthy videos).

lol yeah our marketing is all around bad and everyone here knows it. it�s bad enough that every client has told me it sucks straight to my face. �\_(ツ)_/�

# ¿ Dec 6, 2018 19:40

PIZZA.BAT: Nov 12, 2016

Amazon just fired a broadside into MongoDB�s hull: https://aws.amazon.com/blogs/aws/new-amazon-documentdb-with-mongodb-compatibility-fast-scalable-and-highly-available/

Not only did they make a full document database that largely mimics MongoDB- they straight up cloned the API. IMO this is really going to hurt Mongo�s ability to move forward as their customer base is going to
cement on the APIs that allow them to switch back and forth with Amazon�s service. But maybe not.

# ¿ Jan 14, 2019 14:48

PIZZA.BAT: Nov 12, 2016

Abel Wingnut posted:

oh hell yes. been hoping this thread would pop up.

i have some bigger questions for later, but right now a rather small one. what�s the best front-end option for a local, off-the-network mongodb? have used studio3t, but i�m not sure if there�s better out there. free would be nice too.

Razzled posted:

Anyone have any good reading or class suggestions for data modeling knowledge with Cassandra? It's an area where I have so little experience that almost all of my suggestions or work in that area amounts to trial and error. I understand the basics but when it comes to best practices etc I'm just totally in the dark

Unfortunately I think this thread is gonna be pretty quiet for a while until more people find us. If you guys find answers to your stuff though please come back to fill us in!

# ¿ Feb 3, 2019 00:49

PIZZA.BAT: Nov 12, 2016

Mr Shiny Pants posted:

Is this also the place to ask questions about Hadoop?

Sure!

# ¿ Feb 3, 2019 15:51

PIZZA.BAT: Nov 12, 2016

Abel Wingnut posted:

alright, so let me know what you guys think of this

It sounds like you're trying to build a data hub

Essentially- you split your database into two databases, which it seems like you're already doing. Your first database is the 'staging' database which basically acts as your classical data lake that you dump your data into. You then have scheduled jobs which iterate over that dumped data and harmonizes it into your clean product and places it into you second database, 'final'.

Something you want to keep in mind is that with a sql db you're going to have a lot of hurdles to clear any time you want to ingest new data sources or if an existing source changes. It's much easier to ingest the data as-is as quickly as possible and then figure it out in your harmonization stage. Likewise, making your final db relational also places a hurdle in front of you any time you decide you want different information out of your harmonization- which is going to happen a lot as new data sources changes your understanding of what you want out of your final product. It's much easier to keep your staging and final dbs non-relational and then possibly export that final data under a relational lens.

# ¿ Feb 5, 2019 16:34

PIZZA.BAT: Nov 12, 2016

Abel Wingnut posted:

with couchbase, i could see making each hash a document, thereby making hash the key. then from there, i could create the necessary sub-objects (not really sure on the correct term--i need to research this more. arrays maybe?). in any case, then i could go wild with indices to satisfy the queries. indices in couchbase, from what i understand, are stored in memory, and are superfast. but this doesn't strike me as the best strategy, as i'll just be doubling the data in memory and being wasteful.

I don't have experience with couchbase but I DO have experience with document-based DBs so I can maybe help with your data modeling a bit. Generally if you're using a document-store you want your use case to drive your model. What are the business requirements driving this? If the queries going into the other database are producing a single 'entity' to be consumed then that's generally what you'll want your document to look like.

Secondly on assigning keys. I'm assuming this is similar to a range index in MarkLogic where the keys are stored in memory for fast retrieval. My general rule of thumb is to store primary keys in memory if you can afford it due to the dramatic performance gain you get from it. Typically your keys are going to be a few dozen bytes tops and when multiplied across a few millions documents gives you what... a few hundred megs? This is assuming a naive storage method as well and not something like a spanning tree. Very small price to pay considering the increased performance you get.

I'd recommend benchmarking the database and enabling the index just to see how much memory it takes. It can't hurt to try. Resources are meant to be spent!

# ¿ Mar 19, 2019 16:12

PIZZA.BAT: Nov 12, 2016

Again- I don't know about the specific technologies you're using but most document stores don't naively store documents as-is. They should have some sort of compression occurring so that if you have a lot of redundant data you will only really need to worry about it when it's in memory. Also when I see client's starting to worry about 'lots of redundant data' I find that when I drill down into what they're doing it's because they're still using old sql habits. For example- say you have a primary key that's used as an identifier across dozens or hundreds of different tables. A lot of times I'll see clients storing that identifier all throughout the document as they feel they need to keep it in every location it showed up in the original tables. Why? You only need to have it stored once at the root and that's it.

Also keep in mind that if you have repeating elements you don't have to dedicate a field to describing what the particular element is. This is necessary in the SQL world but there's a more elegant way of storing it in a document. Say for example addresses. Your SQL table could look something like this:

USER_PK ADDRESS_TYPE ADDRESS 61872   HOME         '123 FAKE STREET' 61872   WORK         '456 EXAMPLE LN'

The naive approach to turning this into a document would look something like this

{   'Person': {     'ID': '61872',     'ADDRESSES':[     {       'USER_PK':'61872',       'ADDRESS_TYPE':'WORK',       'ADDRESS':'123 FAKE ST'     },     {       'USER_PK':'61872',       'ADDRESS_TYPE':'HOME',       'ADDRESS':'456 EXAMPLE LN'     }     ]   } }

However all you're really doing here is storing a relational table in a big document. Don't forget that your element names are allowed to be descriptive here!

{   'Person': {     'ID': '61872',     'WORK_ADDRESS':'123 FAKE ST',     'HOME_ADDRESS':'456 EXAMPLE LN'   } }

It's a very simple example so hopefully I'm getting my point across. You'll find that you'll be able to make your documents much more information-dense if you take some time to sit down and think how to best represent the data as a document rather than copying over a bunch of relational tables. As a rule of thumb- just ask yourself how you would want the data to look if this were something being printed on an actual piece of paper for a human to read and use. That's usually the best direction

# ¿ Mar 23, 2019 21:30

PIZZA.BAT: Nov 12, 2016

Skim Milk posted:

oh jeeze im so sorry. i thought this was YOSPOS. i was phone posting. my mistake

no silly i linked it from yospos

# ¿ Mar 26, 2019 02:44

PIZZA.BAT: Nov 12, 2016

Can you give more info on what you mean by �jumble of nonsense�?

# ¿ Apr 20, 2019 15:55

PIZZA.BAT: Nov 12, 2016

Kudaros posted:

I'm a data scientist coming from an academic background. Good on the machine learning, stats, etc., but not so great with databases. I can query relational databases well enough but I'm thinking now about how to organize enterprise data. This isn't my role, but apparently it is nobody's role at my company, at this point.

There are schema strewn about in pl/sql, postgres, MSSQL server, for every line of business, every variation and merger of the company over the past 30 years, and often times for various clients. It is an unbearable (and undocumented!) mess. I'm not even really sure how to reverse engineer all of it.

Allegedly someone is working on a datalake, but I've no idea what that's realistically going to look like. Is there a general workflow for mashing this all together and curating portions of it for purposes of streamlined analytics?

This is what I specialize in & do for a living. First thing you need to do is determine if you want a data federation, a data lake, or a data hub. He goes into what those are and their pros and cons here:

https://www.marklogic.com/blog/data-lakes-data-hubs-federation-one-best/

Answering the question of which of those three you need first will help guide which tech and processes you'll need to adopt.

# ¿ May 11, 2019 01:32

PIZZA.BAT: Nov 12, 2016

Pollyanna posted:

Turns out we didn�t need that index at all. The _new_ problem is that somewhere in our massive collection of crap is a string with a null character in it - and I have no idea where. All my attempts at dumping the collection and grepping for \0 have failed. What�s the easiest way to search through a collection and find any fields that gave a string with a null character in it?

Depends on the tech you�re using. Sounds like you want to generate an index over whatever element/field you want to search and then search it for null values once it�s done

# ¿ Jul 10, 2019 22:27

PIZZA.BAT: Nov 12, 2016

so it could be any element anywhere that�s null and that�s what you want to find? don�t want to sound like an evangelical but any time you�re looking for a needle in a haystack that�s when you want marklogic. you could run a wildcard in the element name for a value of null and see what pops out

# ¿ Jul 10, 2019 22:37

PIZZA.BAT: Nov 12, 2016

yeah we�re not cheap. if that type of problem is something you frequently run into though you should consider us

# ¿ Jul 10, 2019 22:45

PIZZA.BAT: Nov 12, 2016

New kid on the block: https://www.fusiondb.com/

# ¿ Oct 3, 2019 20:30

PIZZA.BAT: Nov 12, 2016

abelwingnut posted:

what does this offer exactly?

I have no clue!

# ¿ Oct 8, 2019 03:15

PIZZA.BAT: Nov 12, 2016

xQuery is good on MarkLogic but it also uses its own proprietary compiler so �\_(ツ)_/�

# ¿ Oct 8, 2019 17:03

PIZZA.BAT: Nov 12, 2016

Pollyanna posted:

Okay, so. We have several collections of documents with up to 69 (nice) fields each. A large subset of these fields are either null or hold an array of acceptable values for that given field (e.g. age_of_car has [0, 1, 2] representing the age of the car in years). In our program flow we have a set of data for a single instance of our data type (a potential insurance buyer) that has attributes for all these 69 fields.

We want to find all documents in one of these collections that match to this data type instance, where �match� means that for each document field, the data type either has a value in the field�s array, or the document�s field is null. We accomplish this by constructing a large query for each one of those 69 fields:

code:

{
  "operation"=>"find",
  "database"=>"my_database",
  "collection"=>"my.collection",
  "filter"=>{
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]}
  }
}

This seems to work well enough so far in Mongo, but we�re running into performance issues with DocumentDB. That got me thinking about if this is really the best way to accomplish this. Is there a better solution for this kind of lookup/operation?

Are there any particular issues you�re running into with Mongo that drove you guys to investigate DocumentDB?

# ¿ Oct 18, 2019 19:31

PIZZA.BAT: Nov 12, 2016

Yeah this really does sound like your architect is making it more complex for the sake of it.

# ¿ Oct 18, 2019 21:41

PIZZA.BAT: Nov 12, 2016

couchbase should be able to handle that just fine

# ¿ Feb 12, 2020 19:29

PIZZA.BAT: Nov 12, 2016

eschaton posted:

write a tiny bit of code to import the repetitive, implicit-schema JSON bullshit into a real database, even just SQLite, and do your queries against that

it�ll take you barely any time at all to pull the data in, then you can create some indexes and go to town

Please don�t be that guy itt, thanks

# ¿ Feb 14, 2020 15:17

PIZZA.BAT: Nov 12, 2016

Dumb Lowtax posted:

Would I be able to do those queries online (not locally)? How hard would this be compared to just hosting the JSON somewhere, if I am completely unfamiliar with the types of software used for SQL, and don't remember anything at all from my databases class?

Even with those limitations I've otherwise managed to host small websites using mlab (free mongodb host) and storing nothing but json strings in it. But a 5gig file I'm not so sure that works for. At least mlab won't go that big.

Yeah 5 GB is pushing it if you�re looking for free hosting. The only thing I can think of is DynamoDB which has 25 GB of free hosting but I also have zero experience with that.

Honestly with the fact you�re *just* above the free tiers in demand you may just want to bite the bullet and pay for something. I doubt it will cost you more than a few bucks.

# ¿ Feb 14, 2020 15:58

PIZZA.BAT: Nov 12, 2016

I know this shouldn�t be anything new to anyone itt but here�s a pretty thorough takedown of Mongo if anyone needs to talk their management out of sticking their dick in that mousetrap: http://jepsen.io/analyses/mongodb-4.2.6

# ¿ May 15, 2020 21:48

Adbot: ADBOT LOVES YOU

# ¿ Apr 29, 2024 06:48

PIZZA.BAT: Nov 12, 2016

This would be pretty straightforward with nosql DBs that have triples. You should be able to get what you're looking for with the $lookup feature though. Don't know how it works under the hood / it's performance implications but it'll let you write the query as a left-outer join

PIZZA.BAT fucked around with this message at 17:42 on Dec 30, 2020

# ¿ Dec 30, 2020 17:39

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > NoSQL : This Thread is Web Scale