|
I'm kind of surprised we don't have one of these. This thread can be the general catch-all for NoSQL tech discussion and questions. There seems to be a bit of confusion around what NoSQL actually is so this OP will be a crash course. What IS NoSQL? In a nutshell, NoSQL is any database tech that stores information in any manner besides tables formed of rows and columns linked together by joins. This is a very broad category because needless to say- there's a lot of ways to store data besides in tables! Such as what? There's four primary flavors of NoSQL that makes up most of what's out there:
Additional Reading: Check out NoSQL For Dummies. You can also find digital versions of this for free if you Google around a bit. PIZZA.BAT fucked around with this message at 11:06 on Aug 27, 2019 |
# ? Dec 3, 2018 05:01 |
|
|
# ? Apr 28, 2024 21:20 |
|
Part of my job a couple years ago involved evaluating a bunch of NoSQL technologies for integration with our current stack, for the better part of a year. One of the candidates on my personal short list was FoundationDB. It's a high-performance key-value store with ACID transactions, and a plugin system they call "layers" that allow you to map more complex models down to the KV store. Frustratingly enough, the company was bought by Apple halfway through our project and taken off the market. Well apparently they've open-sourced it earlier this year: https://github.com/apple/foundationdb, and have just released a document store layer in the last few days. We ended up using Elasticsearch (also open source: https://github.com/elastic/elasticsearch), which is more of a search engine than a document database, but it kind of straddles the line and ended up working okay. The searching capabilities are fantastic, but the default configurations are hobbyist friendly, which you either need to know what you're doing or pay them for support. rarbatrol fucked around with this message at 00:59 on Dec 5, 2018 |
# ? Dec 4, 2018 03:22 |
|
rarbatrol posted:Part of my job a couple years ago involved evaluating a bunch of NoSQL technologies for integration with our current stack, for the better part of a year. One of the candidates on my personal short list was FoundationDB. It's a high-performance key-value store with ACID transactions, and a plugin system they call "layers" that allow you to map more complex models down to the KV store. Frustratingly enough, the company was bought by Apple halfway through our project and taken off the market. Well apparently they've open-sourced it earlier this year: https://github.com/apple/foundationdb, and have just released a document store layer in the last few days. Ha I used to be able to brag about how the only other ACID NoSQL DB on the market was gobbled up and blackholed by Apple. Good to know they're kind of back though this will be interesting to play with
|
# ? Dec 4, 2018 03:52 |
|
rarbatrol posted:We ended up using Elasticsearch (also open source: https://github.com/elastic/elasticsearch), which is more of a search engine than a document database, but it kind of straddles the line and ended up working okay. The searching capabilities are fantastic, but the default configurations are hobbyist friendly, which you either need to know what you're doing or pay them for support. I was going to mention Elasticsearch - Elastic has historically pretty particular about calling it a "search engine" not a "database", but it's basically a database, especially with recent improvements to reliability. Also kind of going in a time-series type direction for logs and metrics and whatnot. Another database that's NoSQL (although ACID and sorta-relational) that's pretty interesting is Datomic by Cognitect, the folks behind Clojure, although it's not open source. It's basically datalog as a database, with basically built-in event sourcing so you can do point-in-time queries and treat the database at a particular point in time as an immutable value without holding a transaction open. It's pretty weird, there's nothing else really like it. Even though I linked to it, don't bother looking at the website, it's utterly useless. Just watch this video instead.
|
# ? Dec 4, 2018 04:24 |
|
I'm glad this thread is here, even though C* and Mongo have given me plenty of headaches over the years. Mongo was horrible with split brain (maybe better around the Jepsen tweaks of 3.4+), and tomb-stoning in C* is the current bane of my existence. https://jepsen.io/analyses is some good reading on the pitfalls of databases that give you the P in CAP - many of which are NoSQL. Don't forget God-tier RocksDB (https://rocksdb.org/) in the KV section. Arguments abound that Kafka is actually a DB as well.
|
# ? Dec 4, 2018 17:33 |
|
rarbatrol posted:Part of my job a couple years ago involved evaluating a bunch of NoSQL technologies for integration with our current stack, for the better part of a year. One of the candidates on my personal short list was FoundationDB. It's a high-performance key-value store with ACID transactions, and a plugin system they call "layers" that allow you to map more complex models down to the KV store. Frustratingly enough, the company was bought by Apple halfway through our project and taken off the market. Well apparently they've open-sourced it earlier this year: https://github.com/apple/foundationdb, and have just released a document store layer in the last few days. The URL parser included the comma after the foundationdb link - might want to fix it if you feel like it. Looks cool, though.
|
# ? Dec 4, 2018 17:41 |
|
Munkeymon posted:The URL parser included the comma after the foundationdb link - might want to fix it if you feel like it. Looks cool, though. Thanks, I've corrected it. While I'm here, one thing I've noticed during my research is that very few databases can actually handle stupid-large data. Elasticsearch can store and index a 10MB document, but it's going to be very, very, grumpy about it. Microsoft's SQL server can do something like 2GB in a single cell, but again, you're going to have a bad time. Maybe this is only a real problem in certain industries?
|
# ? Dec 5, 2018 01:29 |
|
My experience with MarkLogic is that it can handle documents of any size pretty well going up to its limit of half a gig. Also full disclosure for this thread: I work for MarkLogic so I'm gonna have a bit of a bias itt
|
# ? Dec 5, 2018 01:42 |
|
Closest I've gotten to NoSQL is wacking together Lucene/SOLR search implementations but I won't lie and say I had a deep understanding of the underlying. I'm curious about the Graph stuff; is that related to the Facebook GraphQL stuff that Shopify also uses? I've basically only made request/response harnesses for them but never really set it up on the server side.
|
# ? Dec 5, 2018 20:11 |
|
Scaramouche posted:Closest I've gotten to NoSQL is wacking together Lucene/SOLR search implementations but I won't lie and say I had a deep understanding of the underlying. From my quick googling this seems to be solely an API tool rather than a storage solution. The wiki article on graph DBs is actually pretty thorough and does a good job describing them: https://en.m.wikipedia.org/wiki/Graph_database
|
# ? Dec 5, 2018 20:26 |
|
CosmosDB if you want to pay Microsoft a billion dollars for a hard-fork of Mongo.
|
# ? Dec 5, 2018 21:49 |
|
Scaramouche posted:I'm curious about the Graph stuff; is that related to the Facebook GraphQL stuff that Shopify also uses? Not at all - GraphQL is a query language for APIs, not unlike REST, except the consumer specifies the data and the shape returned. The backend does all the heavy lifting of determining how to fulfill and transform the response. Graph databases are all about the relationships between nodes. If you wanted to store the business relationships between companies and query against them, that's where you might want a Graph database. Imagine how you would model and query against that with an RDBMS and your head will start hurting.
|
# ? Dec 6, 2018 03:50 |
|
When I worked at a certain well-known broadcasting corporation, we gave MarkLogic a shitload of money. Apparently they did buy some of the team very expensive tequila, though? Kudos Rex-Goliath, it actually works pretty well (though please stop making those cringeworthy videos).
|
# ? Dec 6, 2018 17:13 |
|
rarbatrol posted:Part of my job a couple years ago involved evaluating a bunch of NoSQL technologies for integration with our current stack, for the better part of a year. One of the candidates on my personal short list was FoundationDB. It's a high-performance key-value store with ACID transactions, and a plugin system they call "layers" that allow you to map more complex models down to the KV store. Frustratingly enough, the company was bought by Apple halfway through our project and taken off the market. Well apparently they've open-sourced it earlier this year: https://github.com/apple/foundationdb, and have just released a document store layer in the last few days.
|
# ? Dec 6, 2018 19:08 |
|
Naar posted:When I worked at a certain well-known broadcasting corporation, we gave MarkLogic a shitload of money. Apparently they did buy some of the team very expensive tequila, though? Kudos Rex-Goliath, it actually works pretty well (though please stop making those cringeworthy videos). lol yeah our marketing is all around bad and everyone here knows it. it’s bad enough that every client has told me it sucks straight to my face. ¯\_(ツ)_/¯
|
# ? Dec 6, 2018 19:40 |
|
Does anyone here have experience using RavenDB? The place I'm working at now uses it, and while I wasn't a fan at first, the new version(4.0) seems pretty decent.
|
# ? Dec 6, 2018 19:48 |
|
Amazon just fired a broadside into MongoDB’s hull: https://aws.amazon.com/blogs/aws/new-amazon-documentdb-with-mongodb-compatibility-fast-scalable-and-highly-available/ Not only did they make a full document database that largely mimics MongoDB- they straight up cloned the API. IMO this is really going to hurt Mongo’s ability to move forward as their customer base is going to cement on the APIs that allow them to switch back and forth with Amazon’s service. But maybe not.
|
# ? Jan 14, 2019 14:48 |
|
This is exciting stuff, especially if you want to wow people with the term "Big Data". Just thought I'd share: If people are considering or already decided to use AWS, one thing that an AWS specialist brought up at a boot camp I went to was that, if you have a lot of data, it may be cheaper to dump all your data into an S3 bucket before injesting it into your actual database (in whatever format you're using). Just use what you need in the database service. Excuse me for not understanding every situation where that will apply but the main point was that database services are expensive compared to S3.
|
# ? Jan 14, 2019 16:51 |
|
Rex-Goliath posted:Amazon just fired a broadside into MongoDB’s hull: https://aws.amazon.com/blogs/aws/new-amazon-documentdb-with-mongodb-compatibility-fast-scalable-and-highly-available/ From what I've read elsewhere online, it's 1) only kinda-compatible with MongoDB (a lot of things don't work, or don't work the same as on the real MongoDB), even comparing to the last OSS Mongo version, and 2) is probably based on Postgres' JSONB support and is using Aurora on the backend, which will have significantly different performance characteristics, similar to this old, semi-abandoned project. I would avoid using this - I suspect its primary purpose is for political games, rather than actually being a good product. Full disclosure: I work for a company that competes with (a portion of) AWS, but these are my honest thoughts and I don't think I'd think any different were I in a different position. Arcsech fucked around with this message at 18:29 on Jan 14, 2019 |
# ? Jan 14, 2019 18:25 |
|
oh hell yes. been hoping this thread would pop up. i have some bigger questions for later, but right now a rather small one. what’s the best front-end option for a local, off-the-network mongodb? have used studio3t, but i’m not sure if there’s better out there. free would be nice too.
|
# ? Jan 30, 2019 17:27 |
|
Anyone have any good reading or class suggestions for data modeling knowledge with Cassandra? It's an area where I have so little experience that almost all of my suggestions or work in that area amounts to trial and error. I understand the basics but when it comes to best practices etc I'm just totally in the dark
|
# ? Jan 30, 2019 18:06 |
|
Abel Wingnut posted:oh hell yes. been hoping this thread would pop up. Razzled posted:Anyone have any good reading or class suggestions for data modeling knowledge with Cassandra? It's an area where I have so little experience that almost all of my suggestions or work in that area amounts to trial and error. I understand the basics but when it comes to best practices etc I'm just totally in the dark Unfortunately I think this thread is gonna be pretty quiet for a while until more people find us. If you guys find answers to your stuff though please come back to fill us in!
|
# ? Feb 3, 2019 00:49 |
|
Is this also the place to ask questions about Hadoop?
|
# ? Feb 3, 2019 09:34 |
|
Mr Shiny Pants posted:Is this also the place to ask questions about Hadoop? Sure!
|
# ? Feb 3, 2019 15:51 |
|
Mr Shiny Pants posted:Is this also the place to ask questions about Hadoop? go ahead, I know a lot about hadoop and kafka thingies.
|
# ? Feb 5, 2019 10:22 |
|
alright, so let me know what you guys think of this i need to monitor millions of hits/impressions per day--hopefully way more in the not too distant future. at the same time, i want to back the data up and run queries on it. pretty normal request for the data world, but i'm new to nosql, so let me know if this makes sense. given the high frequency of reads and writes we expect to perform on this initial, capturing database, i was thinking of installing a scylla instance. i've heard it's extremely quick, and that's really what i'm after with this database. no querying, and it will never be huge, so space isn't a big concern. the primary purpose is to capture data from thousands of sources, then move it to the next database. so it seems like scylla fits? so the data scylla captures would then be offloaded hourlyquarter-daily/half-daily/daily (need to feel this out) to another scylladb, or possibly mongodb or cassandradb, that would serve as the archive. this would really be about storage, doubling as a backup. for querying, then i'd run nightly scripts to aggregate data from the archive to probably a sql instance. i don't need to get too granular with the queries, it really is more about the like-groups of data. thoughts? too many moving parts? is there a way to consolidate some of these things? basically need speed up front, storage in the middle, and querying power on the side. figured these were the best solutions, and hope to test them out this week. let me know if you have any other potential db systems in mind. like i said, i'm new to this world. i'm really a sql guy through-and-through. e: in an ideal world, all of these db would be sql-based. everyone on the team knows the language, and it's tried and true. problem is i don't think it can handle the i/o required for the first capturing db, and i know it can't handle the size the archive will eventually get to. happy to be proven wrong if you have examples, though! abelwingnut fucked around with this message at 15:21 on Feb 5, 2019 |
# ? Feb 5, 2019 15:08 |
|
Abel Wingnut posted:alright, so let me know what you guys think of this It sounds like you're trying to build a data hub Essentially- you split your database into two databases, which it seems like you're already doing. Your first database is the 'staging' database which basically acts as your classical data lake that you dump your data into. You then have scheduled jobs which iterate over that dumped data and harmonizes it into your clean product and places it into you second database, 'final'. Something you want to keep in mind is that with a sql db you're going to have a lot of hurdles to clear any time you want to ingest new data sources or if an existing source changes. It's much easier to ingest the data as-is as quickly as possible and then figure it out in your harmonization stage. Likewise, making your final db relational also places a hurdle in front of you any time you decide you want different information out of your harmonization- which is going to happen a lot as new data sources changes your understanding of what you want out of your final product. It's much easier to keep your staging and final dbs non-relational and then possibly export that final data under a relational lens.
|
# ? Feb 5, 2019 16:34 |
|
I would have suggested Apache Kafka as the transport layer, it's commonly used to deliver log messages to any downstream consumers and a very solid piece of software. From there you can basically persist it to as many different systems as you like.
|
# ? Feb 5, 2019 16:44 |
|
Abel Wingnut posted:alright, so let me know what you guys think of this I've worked a system like that using Kafa, AWS Athena & S3 We shoved all the events into Kafka, setup some servers to read the events from Kafka, store them in S3 in some common format like JSON, using the date in the key. With minimal effort (setting up some partitioning) you can query all the events in S3 using SQL with Athena. We also hooked up a streaming platform like Flink/Kafka Streams/Storm to run some analytics as events arrived, this is nice for queries like "how many impressions did we get in the last hour" were you don't want to wait on a nightly batch job. It was not the cheapest solution, but it worked pretty well and was easy to setup and operate by just a few people. Using relational db's all the way is probably not a great fit, i'd use a NoSQL db to store the raw events and something like Spark to fill relational databases with weekly/hourly/etc aggregates. Votlook fucked around with this message at 17:06 on Feb 5, 2019 |
# ? Feb 5, 2019 17:00 |
|
I will just point out that Athena can get very costly if you are doing queries over large files. If you need to do real-time esque processing I'd look into handling all ETL through a storm topology and then having your NOSQL db be the receiver of that stream of data. I want to say that any of the myriad of analytics layers that can sit on top of Hadoop would be adequate as an archive for data at rest.
|
# ? Feb 6, 2019 08:54 |
|
Razzled posted:I will just point out that Athena can get very costly if you are doing queries over large files. Using Athena was pretty costly yes, but very convenient for a small team. I really want to check out databases like ClickHouse or Druid sometime, they seem to fit this usecase well. Does anyone here have experience with those?
|
# ? Feb 6, 2019 15:49 |
|
thanks for all the recommendations above. i'm actually putting it together right now. we're going with kafka to scylla--seemed like a better option than what we were thinking. problem is i'm having the damnedest time trying to configure kafka, though. for whatever reason zookeeper will just not start as a service. i believe it must have to do with my user rights. here's an example. i'm able to run the script directly when i move to opt/zookeeper, but i am unable to run the script specifying the script's full location, opt/zookeeper/bin/zkServer.sh when my pwd is the root. that suggests to me something's up with my rights. i don't know, i'm entirely new to linux so this is really proving a pain in the rear end. hmm, maybe not. when i try bin/zkServer.sh stop from opt/zookeeper i get nearly the same error as seen in my first screenshot: pulling my hair out on this abelwingnut fucked around with this message at 04:52 on Feb 20, 2019 |
# ? Feb 20, 2019 04:47 |
|
What do the logs say?
|
# ? Feb 20, 2019 08:13 |
|
well, i couldn't access the logs in the previous try. i ended up starting a new droplet with some better stats on digital ocean, then followed this: https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-ubuntu-18-04 now it works despite not installing zookeeper. i have no idea how, but the tests are fine. it seems like there are some zookeeper services, but i truly have zero idea how they got there.
|
# ? Feb 21, 2019 12:31 |
|
Abel Wingnut posted:well, i couldn't access the logs in the previous try. i ended up starting a new droplet with some better stats on digital ocean, then followed this: https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-ubuntu-18-04 One thing to keep in mind is that while Kafka REQUIRES zookeeper to run, zookeeper does not require kafka. It is used for a multitude of platforms and can be setup as an independent cluster. For portability purposes I suspect, Kafka includes bins of the zk server so that you don't need to do this to get the bare minimum kafka up and running. Most likely your original zk installation was misconfigured or installed in non default locations. Digiocean docs are good though
|
# ? Feb 21, 2019 19:09 |
|
Kafka also tries to be less and less reliant on Zookeeper. Tbh I don't know exactly what they're still needed for in the later 2.1 versions.
|
# ? Feb 21, 2019 19:29 |
|
Pretty sure Kafka either installed zookeeper as a dependency or came bundled with it when I installed it on Ubuntu 18.04 last spring/summer. I also remember having to use zookeeper host:port for some Kafka commands, and Kafka host:port for others.
|
# ? Feb 21, 2019 23:48 |
|
I am learning about Hadoop, Kafka and Spark to see if they would work for our company, does anyone have general do's and don'ts? They are pretty complex software packages. Some general experiences running them would be awesome. I have just ordered a Raspberry Pi cluster to play around with Kubernetes and the aforementioned software. Inspired by: https://blog.sicara.com/build-own-cloud-kubernetes-raspberry-pi-9e5a98741b49
|
# ? Mar 8, 2019 09:43 |
|
Mr Shiny Pants posted:I am learning about Hadoop, Kafka and Spark to see if they would work for our company, does anyone have general do's and don'ts? They are pretty complex software packages. Some general experiences running them would be awesome. Learn MapReduce since the concepts are still everywhere in all modern frameworks even if the mapreduce framework itself is outdated. learn when you need to redistribute your data by key (i.e. data is shuffled around) and what can be done locally vs. distributed. Read Martin Kleppman's Designing Data Intensive Applications. Figure out what happens if any individual component in your architecture fails. Avoid Spark. Also feel free to ask here. I have plenty of experience with running those three frameworks.
|
# ? Mar 8, 2019 15:02 |
|
|
# ? Apr 28, 2024 21:20 |
|
for your sanity's sake, make sure there's no possible way your data can fit in memory before you start farting around mapping and reducing
|
# ? Mar 8, 2019 18:26 |