Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
PIZZA.BAT
Nov 12, 2016


:cheers:


Pollyanna posted:

Turns out we didn’t need that index at all. The _new_ problem is that somewhere in our massive collection of crap is a string with a null character in it - and I have no idea where. All my attempts at dumping the collection and grepping for \0 have failed. What’s the easiest way to search through a collection and find any fields that gave a string with a null character in it?

Depends on the tech you’re using. Sounds like you want to generate an index over whatever element/field you want to search and then search it for null values once it’s done

Adbot
ADBOT LOVES YOU

Pollyanna
Mar 5, 2005

Milk's on them.


I have no idea what field it is, I just know there’s a NUL in a string somewhere, and AWS DMS completely dies and refuses to continue processing when it comes across that. If there’s a way to search through an entire document, that’d be great.

PIZZA.BAT
Nov 12, 2016


:cheers:


so it could be any element anywhere that’s null and that’s what you want to find? don’t want to sound like an evangelical but any time you’re looking for a needle in a haystack that’s when you want marklogic. you could run a wildcard in the element name for a value of null and see what pops out

Pollyanna
Mar 5, 2005

Milk's on them.


Oof, we don’t have time or capacity to bring on an enterprise solution. :( I’ll prolly have to bug AWS for help on this or something. If only their error messages were better.

PIZZA.BAT
Nov 12, 2016


:cheers:


yeah we’re not cheap. if that type of problem is something you frequently run into though you should consider us

Pollyanna
Mar 5, 2005

Milk's on them.


Our problems are way more than just technical in nature :suicide:

Pollyanna
Mar 5, 2005

Milk's on them.


So I have a mongodump bson file and I can replicate the “insertion error” by trying to restore with it, but I have no idea how to get mongorestore to tell me what it failed on. It just says “Failed insertion error” and nothing else. Is there a way to get the logging to, for example, output the ID of what it failed to insert?

Pollyanna fucked around with this message at 21:55 on Jul 11, 2019

Pollyanna
Mar 5, 2005

Milk's on them.


Think I found it. Protip: NUL is specified as %00 in URLs. Honestly gently caress it, I don’t wanna try and sort through those, the stakeholder can deal with a backup.

Pollyanna
Mar 5, 2005

Milk's on them.


Is there any reason why a lookup on a field would fail to use an index when it exists? Reason I ask is that we have a Monge database and an AWS DocumentDB database in parity, with the same indexes, and a query on one field in Mongo will use the index, but the same in DocumentDB will not.

code:
mongo:

> db.getCollection('somecollection').find({
  time_start: {$lt: ISODate("2019-09-19 00:00:00.000Z")}
}).explain()

{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "somedatabase.somecollection",
        "indexFilterSet" : false,
        "parsedQuery" : {
            "time_start" : {
                "$lt" : ISODate("2019-09-19T00:00:00.000Z")
            }
        },
        "winningPlan" : {
            "stage" : "FETCH",
            "inputStage" : {
                "stage" : "IXSCAN",
                "keyPattern" : {
                    "time_start" : 1
                },
                "indexName" : "time_start_1",
                "isMultiKey" : false,
                "multiKeyPaths" : {
                    "time_start" : []
                },
                "isUnique" : false,
                "isSparse" : false,
                "isPartial" : false,
                "indexVersion" : 1,
                "direction" : "forward",
                "indexBounds" : {
                    "time_start" : [ 
                        "(true, new Date(1568851200000))"
                    ]
                }
            }
        },
        "rejectedPlans" : []
    },
    "serverInfo" : {
        "host" : "someip",
        "port" : 27017,
        "version" : "3.4.14",
        "gitVersion" : "fd954412dfc10e4d1e3e2dd4fac040f8b476b268"
    },
    "ok" : 1.0
}

---

docdb:

> db.getCollection('somecollection').find({
  time_start: {$lt: ISODate("2019-09-19 00:00:00.000Z")}
}).explain()

{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "somedatabase.somecollection",
        "winningPlan" : {
            "stage" : "COLLSCAN"
        }
    },
    "serverInfo" : {
        "host" : "someip",
        "port" : 27017,
        "version" : "3.6.0"
    },
    "ok" : 1.0
}
Why would this happen? The indexes are identical, other than being v: 1 in Mongo and v: 2 in DocumentDB. The query is also the same.

Arcsech
Aug 5, 2008

Pollyanna posted:

Is there any reason why a lookup on a field would fail to use an index when it exists? Reason I ask is that we have a Monge database and an AWS DocumentDB database in parity, with the same indexes, and a query on one field in Mongo will use the index, but the same in DocumentDB will not.

Why would this happen? The indexes are identical, other than being v: 1 in Mongo and v: 2 in DocumentDB. The query is also the same.

DocumentDB is backed by a different storage engine than MongoDB, and likely has different limitations around indices. Because DocumentDB is proprietary, you'll have to ask Amazon to know for sure.

Pollyanna
Mar 5, 2005

Milk's on them.


Arcsech posted:

DocumentDB is backed by a different storage engine than MongoDB, and likely has different limitations around indices. Because DocumentDB is proprietary, you'll have to ask Amazon to know for sure.

:suicide: gently caress. Alright, I’ll have to contact them. I’ll see if I can’t figure out a workaround in the meantime.

EDIT: As far as I can tell, it simply won't use an index if it has to do anything other than a straight-up equality comparison. So, we can't do less-than queries in a performant manner. Eugh.

Pollyanna fucked around with this message at 01:32 on Sep 20, 2019

Plorkyeran
Mar 22, 2007

To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed

Arcsech posted:

DocumentDB is backed by a different storage engine than MongoDB, and likely has different limitations around indices. Because DocumentDB is proprietary, you'll have to ask Amazon to know for sure.

“Backed by a different storage engine” is a significant understatement. It’s an entirely different thing that just implements the same API.

Star War Sex Parrot
Oct 2, 2003

Pollyanna posted:

EDIT: As far as I can tell, it simply won't use an index if it has to do anything other than a straight-up equality comparison. So, we can't do less-than queries in a performant manner. Eugh.
Can you define index types? It sounds like it’s using a hash index instead of a range index like a B+ tree, but I’ve never used DocumentDB.

Pollyanna
Mar 5, 2005

Milk's on them.


Star War Sex Parrot posted:

Can you define index types? It sounds like it’s using a hash index instead of a range index like a B+ tree, but I’ve never used DocumentDB.

No - DocumentDB only supports Single Field, Compound, and Multikey indexes, and the latter two are done by a workaround of some sort. https://docs.aws.amazon.com/en_pv/documentdb/latest/developerguide/mongo-apis.html in the Indexes section.

Now that I’ve slept on it, I’m less pissed, but this is still frustrating and confusing. We’re going to deprioritize this query weirdness and make sure all the other queries are still okay before tackling this one.

Pollyanna fucked around with this message at 15:39 on Sep 20, 2019

Pollyanna
Mar 5, 2005

Milk's on them.


Alright, I think I’m just missing some understanding on how this poo poo works. Lemme start over.

I have a collection of documents, about 27 million large. Each document has a time_start field and a time_end field, both dates.

We want to query for the following:

1. time_start is less than a given datetime, AND

2a. time_end is greater than another given datetime, OR

2b. time_end is not present in the given document

How should I define the index on these documents given that I want to make this query? As I understand it, I would need a compound index on time_start and time_end, since I’m searching for them at the same time. Basically the following index:

code:
{
  “time_start”: 1,
  “time_end”: 1
}
However, using this index on a subset of that collection, at 600k, still takes about 10 minutes to return a cursor. This clearly isn’t scaling well nor is it particularly fast anyway.

Am I using the wrong index? Are there tweaks I need to make? Or does this query genuinely just take that long?

Edit: some more background info. The other queries we do on this collection involve matches against some other fields, also indexed, and sorting by time_start descending and time_end both ascending and descending, and in at least one case, sorting one of the other fields descending. We also do range queries on time_start being between two different dates.

Pollyanna fucked around with this message at 18:08 on Sep 23, 2019

PIZZA.BAT
Nov 12, 2016


:cheers:


New kid on the block: https://www.fusiondb.com/

Pollyanna
Mar 5, 2005

Milk's on them.


Oh, that was a fun game. Needed more polishing though.

Star War Sex Parrot
Oct 2, 2003

There's almost nothing there, but this dude seems to be a frequent contributor to eXist-db.

duck monster
Dec 15, 2004


Somebody shoot their web guy. A paragraph of meaningless marketing babble, aaaand six paragraphs about the lovely license. Nobody gives a gently caress about the dumb restrictive license, and the fact that it can query json is meaningless. *ALL* currently existing databases I'm aware of can do that. The page doesn't give me a single reason to want to use it, and a few reasons why I wouldnt.

duck monster fucked around with this message at 02:29 on Oct 8, 2019

abelwingnut
Dec 23, 2002



what does this offer exactly?

PIZZA.BAT
Nov 12, 2016


:cheers:


abelwingnut posted:

what does this offer exactly?

I have no clue!

ConanThe3rd
Mar 27, 2009
So it's a 15th competing standard then?

Arcsech
Aug 5, 2008

duck monster posted:

Somebody shoot their web guy. A paragraph of meaningless marketing babble, aaaand six paragraphs about the lovely license. Nobody gives a gently caress about the dumb restrictive license, and the fact that it can query json is meaningless. *ALL* currently existing databases I'm aware of can do that. The page doesn't give me a single reason to want to use it, and a few reasons why I wouldnt.

But it's "100% eXist-db API Compatible"!

Which might mean something if anyone had a clue what "eXist-db" was.

Edit:

Wikipedia posted:

eXist-db provides XQuery and XSLT as its query and application programming languages

:barf:

Arcsech fucked around with this message at 17:00 on Oct 8, 2019

PIZZA.BAT
Nov 12, 2016


:cheers:


xQuery is good on MarkLogic but it also uses its own proprietary compiler so ¯\_(ツ)_/¯

Pollyanna
Mar 5, 2005

Milk's on them.


Putting in a 👎 and a 🖕 for AWS DocumentDB. Performance has been way worse than our own Mongo solution and it can barely handle any query more complex than matching on a single field. Its performance is also wildly inconsistent, we’ve had performance tests/comparisons range from somewhat worse to 4x worse. Do not use.

I’ll post again later with details on our use case, to avoid the X-Y problem.

Pollyanna
Mar 5, 2005

Milk's on them.


Okay, so. We have several collections of documents with up to 69 (nice) fields each. A large subset of these fields are either null or hold an array of acceptable values for that given field (e.g. age_of_car has [0, 1, 2] representing the age of the car in years). In our program flow we have a set of data for a single instance of our data type (a potential insurance buyer) that has attributes for all these 69 fields.

We want to find all documents in one of these collections that match to this data type instance, where “match” means that for each document field, the data type either has a value in the field’s array, or the document’s field is null. We accomplish this by constructing a large query for each one of those 69 fields:

code:
{
  "operation"=>"find",
  "database"=>"my_database",
  "collection"=>"my.collection",
  "filter"=>{
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]}
  }
}
This seems to work well enough so far in Mongo, but we’re running into performance issues with DocumentDB. That got me thinking about if this is really the best way to accomplish this. Is there a better solution for this kind of lookup/operation?

PIZZA.BAT
Nov 12, 2016


:cheers:


Pollyanna posted:

Okay, so. We have several collections of documents with up to 69 (nice) fields each. A large subset of these fields are either null or hold an array of acceptable values for that given field (e.g. age_of_car has [0, 1, 2] representing the age of the car in years). In our program flow we have a set of data for a single instance of our data type (a potential insurance buyer) that has attributes for all these 69 fields.

We want to find all documents in one of these collections that match to this data type instance, where “match” means that for each document field, the data type either has a value in the field’s array, or the document’s field is null. We accomplish this by constructing a large query for each one of those 69 fields:

code:
{
  "operation"=>"find",
  "database"=>"my_database",
  "collection"=>"my.collection",
  "filter"=>{
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]},
    "someField"=>{"$in"=>["?"]}
  }
}
This seems to work well enough so far in Mongo, but we’re running into performance issues with DocumentDB. That got me thinking about if this is really the best way to accomplish this. Is there a better solution for this kind of lookup/operation?

Are there any particular issues you’re running into with Mongo that drove you guys to investigate DocumentDB?

Pollyanna
Mar 5, 2005

Milk's on them.


PIZZA.BAT posted:

Are there any particular issues you’re running into with Mongo that drove you guys to investigate DocumentDB?

We have to manage it ourselves, which have led to (nonspecific problems in the past that I wasn’t around for but currently don’t seem to be a problem). Plus, nobody but us uses Mongo, and the only Mongo expert we had left last year, and DocunentDB has people devoted to it, right? :shepface:

Other than that, nothing that DocumentDB would solve. It’s all problems with our overall architecture/solution (latency issues, lack of mongo knowledge, the two databases going out of sync - we mirror our postgres DB in Mongo because ?????????).

I personally think our time is better spent removing a second database from the picture entirely, but I’m not the manager here :shrug:

tl;dr: we don’t wanna manage our own boxes but Atlas is too expensive.

Progressive JPEG
Feb 19, 2003

TBH it doesn't feel like switching from one thing that nobody at the company knows about to a proprietary thing that nobody at the company knows about is going to fix the issue.

There are valid reasons to mirror a DB if the data needs to be represented or queried in a way that one of the DBs isn't capable of. That might have been the original reason for copying into MongoDB. But it's much more likely to have been done for resume padding reasons, and if they left it sounds like it worked!

From what it sounds like, I'd try to consolidate into Postgres unless there's something very specific that Postgres can't accomplish (even with a plugin or something), since I imagine that's easier to hire for if nothing else. Sorry nosql thread.

PIZZA.BAT
Nov 12, 2016


:cheers:


Yeah this really does sound like your architect is making it more complex for the sake of it.

Pollyanna
Mar 5, 2005

Milk's on them.


What architect :v:

Arcsech
Aug 5, 2008

Pollyanna posted:

Putting in a 👎 and a 🖕 for AWS DocumentDB. Performance has been way worse than our own Mongo solution and it can barely handle any query more complex than matching on a single field. Its performance is also wildly inconsistent, we’ve had performance tests/comparisons range from somewhat worse to 4x worse. Do not use.

I’ll post again later with details on our use case, to avoid the X-Y problem.

I am shocked, shocked I say, that a thing Amazon built in a rush solely as a "gently caress you" in response to a company changing their license so AWS couldn't rip off their poo poo for free is bad.

Just shocked.

Star War Sex Parrot
Oct 2, 2003

Cross-posting my response to Pollyanna's situation because lol

Star War Sex Parrot posted:

so you're duplicating the data from postgres so you can do filters on a nosql system built on top of postgres? nice

Pollyanna
Mar 5, 2005

Milk's on them.


Pollyanna posted:

i hate it and it needs to die

Jaded Burnout
Jul 10, 2004


Hello folks. I'm putting together a small app which's data are very much a graph, a "simple directed graph" to be precise. You mentioned Neo4j might be the current forerunner upthread?

I don't particularly care about, well, anything, really, but I'd like to optimise for developer expediency and clarity.

The frontend is React, and I would've thought Facebook would have some good support for working with graph data, but apparently nothing obvious, even GraphQL isn't really all that graphy. Backend is Rails. Current main DB is postgresql. Hosting is Heroku.

Pollyanna
Mar 5, 2005

Milk's on them.


Out of curiosity, what kind of data is this? What does it represent? I was always unsure of what non-relational data is or looked like.

Jaded Burnout
Jul 10, 2004


Pollyanna posted:

Out of curiosity, what kind of data is this? What does it represent? I was always unsure of what non-relational data is or looked like.

A dependency graph among a set of tasks, with tasks depending on one another arbitrarily. The only rule being no loops. So I suppose in that case it's a DAG, but not just a tree or set of trees, since there can be more than one task that depends on a single other task.

After some more poking around I'm going to try neo4j with neo4jrb.

Happy Thread
Jul 10, 2005

by Fluffdaddy
Plaster Town Cop
What's an easy free way to host a 5 gig JSON file for one person at a time to query? It's just the Yelp Academic Dataset for a personal demo for a school project.

I'm looking at some free mongodb hosts like MongoDB and Heroku but their specs just list RAM sizes (all less than a gig at the free tier) and don't seem to come with, like, a hard drive to pull a big file from.

redleader
Aug 18, 2005

Engage according to operational parameters

Dumb Lowtax posted:

What's an easy free way to host a 5 gig JSON file for one person at a time to query? It's just the Yelp Academic Dataset for a personal demo for a school project.

I'm looking at some free mongodb hosts like MongoDB and Heroku but their specs just list RAM sizes (all less than a gig at the free tier) and don't seem to come with, like, a hard drive to pull a big file from.

does it need to be json? how big would it be if you shredded it into a more useful/queryable form?

Adbot
ADBOT LOVES YOU

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



Both AWS and Azure will give you $200 worth of services for free per year on their educational plans, which I'm assuming just means to every .edu email address. $200 should buy you plenty of S3/blob storage for a class to bang on.

e: oh, duh, you'd probably want to make that queryable. IDK how much NoSQL that'd buy you on either platform but I'm sure It Depends™

Munkeymon fucked around with this message at 15:32 on Feb 12, 2020

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply