Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
npe
Oct 15, 2004

leecming posted:


Also, does MySQL support REGEXP groupings? I can't seem to find a way to do so. I want to do a ORDER BY (Title REGEXP blah) where the first word is dropped if it is "A", "An" or "The".

If you can get your regex working, you should be able to fake this with an inline view: take the query you have above, add a column to your SELECT that uses functions to strip out the leading word, and then wrap the whole thing in another SELECT * FROM ( ...nasty regex query here... ) GROUP BY fixed_title.

I should point out that all of this is horribly expensive to the database, it is going to be doing massive table scans to get what you want. I hope there aren't a lot of records (or a lot of requests).

Adbot
ADBOT LOVES YOU

npe
Oct 15, 2004

Dakha posted:

Is there some simple way we can rework things so that the extraction can be done in a reasonable amount of time? Some of the more complicated queries are taking FAR too long to run

We have a similar situation to this, and we've been experimenting on the best option for performance. We're an Oracle shop though, so I don't know how much this will help for your case, but we have to search for millions of records and then order the output very specifically. Selecting using IN lists was sucking. What we do now is bulk load the list of values to a temporary table, create an index, and then simply query with a join against this table.

Since we are loading to a temporary scratch table (that we create on the fly, immediately) we are able to use Oracle sqlldr with direct path loading, which is ridiculously fast (basically it's a bulk write straight onto disk). This means that the short amount of time we lose doing the table/index create is more than made up for by the time saved in being able to query out arbitrarily large sets of values.

npe
Oct 15, 2004

Victor posted:

Maybe there is no analog on Oracle, but in SQL Server, one would just create a one-column table with the clustered index on that column; no additional index would be needed. (Clustered indexes are simply indexes that define how the rows are physically ordered.)

Well, I'm simplifying things a bit here. We generally have several columns, with multiple indexes due to the nature of our work. My only point was that even with the cost of building the indexes, we still came out ahead in terms of overall execution time with this approach.

npe
Oct 15, 2004
Use double quotes, just beware that when doing this oracle suddenly gets case sensitive:

code:
SQL> create table dumb (
  2   "ABC/123" number(5),
  3   "xyz/789" number(5)
  4  );

Table created.

SQL> select "ABC/123" from dumb;

no rows selected

SQL> select "XYZ/789" from dumb;
select "XYZ/789" from dumb
       *
ERROR at line 1:
ORA-00904: "XYZ/789": invalid identifier


SQL> select "xyz/789" from dumb;

no rows selected

npe
Oct 15, 2004
Commas are syntactically equivalent to inner join syntax, so don't get hung up on it. Have you tried running EXPLAIN on your query to see where the cost is?

Complicated looking queries are not necessarily problematic, you'll want to learn to trust EXPLAIN.

npe
Oct 15, 2004
The cheap way around this is to just use max() or min() to get the one you want. Like:

code:
SELECT r.object, max(r.parent), r.version
  FROM relationships r
  WHERE r.area = 1
  GROUP BY r.object, r.version;
This will work as long as you only have the single un-grouped column you want data from, if you have multiple then you're going to need an inline view and a join back to the ungrouped table.

npe
Oct 15, 2004
You mean like

code:
SELECT r.object, r.parent, r.version
  FROM relations r
  JOIN infopool i
    ON r.parent = i.id
 WHERE i.type = 201
?

Sounds like you just need a primer on basic SQL joins, there has to be a million of them out there.

Edit: please don't paypal me anything

npe
Oct 15, 2004

Land Mime posted:

This seems...wrong somehow, but creating an index on `timestamp` will make your query return the correct result. At least on MySQL 5.

edit: I mean I get why it works, it just seems to me that an index shouldn't affect the result of a query.

The reason it seems wrong is because MySQL is silently letting you get away with doing something you shouldn't be able to, and that is wrong. When you refer to a non-grouped column anywhere in your query, what are you really asking? For example...

code:
 SELECT a,b  FROM foo GROUP BY a
...this will give me rows with identical values of `a` grouped together, but what about b? There could be multiple rows, so which one is it supposed to give you? This is what aggregate functions like max() and min() are for, they let you specify which row you want.

Other databases, like Oracle, will actually error if you try to do this; you are required to have all columns in either your GROUP BY or wrapped in aggregate functions. MySQL just gives you "whichever" and I don't know if it's well-defined. It seems like adding an index has altered the result for you and should be a good indication to not trust it long term.

npe
Oct 15, 2004
You really need to ask yourself if the EAV model is necessary. I work professionally on a gigantic EAV-modeled database, and I'll tell you that while it is sometimes necessary due to application requirements, it's never easy to deal with. The headache you're dealing with now is only the beginning of your pain.

npe
Oct 15, 2004

Triple Tech posted:

1) When should business logic be written at the database level versus the programming language level? Or is it something like "always in the DB unless you can't"?

This is a major pissing match everywhere you go. There are two camps, I like to call them "innies" and "outies". People who love ORM layers (in particular the RoR people) are big-time outies, and go as far as insisting you don't need things like foreign key constraints.

Personally I think it depends on the app. The database we have at work is an application into itself, so embedding the relationships of data via constraints and triggers makes sense, because there is no guarantee that any particular application code will be involved when changes are made.

quote:

2) My coworker just told me of this technique I've never heard before, so I wanted to know how common place it is. When importing a set of data, like many rows, you create a temporary table that's the same as the table you want to insert the stuff into, then you use some crazy SQL command to import the stuff en masse, and then delete the temporary table. True? What crazy command is this? And you wouldn't just loop over the stuff and INSERT each row/item?

INSERT ... SELECT ... is the normal way to do this. However, in general it's slower because you are duplicating the number of inserts overall (and this doubles the number of transactions that have to be recorded for the log). The reason you would want to do this is if you want to do some transformation of the data in bulk before you do the final insert. Oracle provided "global temporary tables" for this very thing, which are sort of like session-based table data structures.

npe
Oct 15, 2004
This is Oracle, so your database of choice may differ... but:

code:
SQL> describe temp_table;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 FOO                                                NUMBER(5)
 BAR                                                VARCHAR2(16)

SQL> describe real_table;
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 FOO                                                NUMBER(5)
 BAR                                                VARCHAR2(16)

SQL> select * from temp_table;

       FOO BAR
---------- ----------------
         1 row one
         2 row two

SQL> insert into real_table (foo, bar) select foo, bar from temp_table;

2 rows created.

npe
Oct 15, 2004

Victor posted:

This technique can be extremely performant (yes, that's becoming a word, deal with it grammar Nazis). I can bulk insert tens of thousands of rows per second into a single-socket, quad-core SQL Server. If we had gigabet network, I wouldn't be surprised if I could get it up to hundreds of thousands of rows per second. In contrast, inserting rows one by one is much slower.

Well sure, once you already have the data inserted into the temp table this is true. The most common thing, however, is that you are trying to insert new data, in which case you are probably not getting a benefit because you still have to get the data inserted into the temp table somehow, either through row-by-row inserts or some bulk loading process (which could be done to the actual table anyways). Overall inserting the data directly will be faster if you don't have to transform it first.

The exception I know of offhand is with something Oracle has called an external table. This is where you take a text file and drop it on the local disk of the database, and then create a table using special syntax so that it reads the file directly as a table. Then you can INSERT SELECT from this table directly, and it essentially avoid the original insert operations (since it's doing a direct read from the file). However this approach has it's own problems, not the least of which is figuring out a good method for dropping off files securely on your database server.

npe
Oct 15, 2004

Victor posted:

If you don't need to do any keying (transforming a string such as "Mammography" to the ID of the row in some "dictionary" table that contains that code), then yes, you can insert directly into the permanent table. However, if you need to do transforms, and the permanent table has indexes on columns needing transforms, it will most definitely be faster to use a temporary table. Sometimes one must use a temporary table: what if the permanent table has a column for the ID for the code but not the code itself? (E.g. procedure_fk, which points to a Procedure record (id: 1, description: "Mammography"). Yes, I've been looking at a lot of hospital data lately.)

I think we're saying the same exact thing.. it's not faster if you just want to load the data as is, but if you need to transform the data first it can help.

npe
Oct 15, 2004
It's true that constraints of any sort make deletes more costly, but deletes are costly in the first place and should be avoided as a matter of design if possible. I would say start with FK's where they should go, and consider dropping them for cases where you absolutely need to in order to maintain performance.

npe
Oct 15, 2004

Triple Tech posted:


Do you template your SQL?

Every time I see SQL intertwined with code (Perl in this case), I can't help but think... God. That is so ugly. Something is wrong... But I'm not sure what.

Then it dawned on me on the train... The concern can be seperated out with a templating engine. SQL, like HTML, is hardly ever useful in a 100% static form. Requests and desires are generated on the fly. HTML is easily handled by something like Mason (templating for Perl). So, shouldn't SQL be easily handled as well in a Mason context?

The idea seems far too radical, at least for me. In all four of my SQL-aware jobs, not one of them seperate the SQL out into different files. I feel like I'm really breaking ground with this one.*

What do you guys do, do you seperate out your SQL like you seperate out your HTML? Or does it just get jammed right into the code?

* Not really, I'm just being hyperbolic.

This is exactly what we do, but it's a home-grown system. The idea is that I can write a query with bind variables, and put it in a sql template file. It looks something like

code:

 [lookupSomeValue]

  SELECT blah 
    FROM table 
   WHERE column = :val

Then I can call $db->lookupSomeValue(-val => $whatever) directly and it manages preparing and caching the handle and everything. This works great because we have full time Oracle developers working here who like to be able to hunt through our queries and tweak their plans, and this way they can do so by just looking at pretty much straight sql.

For things that can't be done with bind vars (like in lists and subqueries), we also have more typical templating... such as

code:

 [dynamicQuery]

  SELECT blah
    FROM table
   WHERE column IN ({list})

Then I call $db->dynamicQuery(-list => join ',', @values) or whatever. The newest version even lets you nest query templates inside other templates, so you can nest one query as a subquery in another template, and then "push" multiple instances together when you call the method.

Overall it's been extremely successful for us. The result is we don't end up with raw SQL mixed into any perl code, ever. It's pretty nice.

npe
Oct 15, 2004
You can't do that with parameterized queries on any RDBMS I've ever used. You will have to prepare a new query with the new table name if you want to change it. Now, whether the abstraction you are using has this built in or not, I don't know, but under the hood it would certainly be preparing a new query for you, as opposed to just binding in parameters.

npe
Oct 15, 2004

fletcher posted:

I understand the idea of separating logic in a model-view-control system, but I don't understand the concept of separating the SQL with a templating system. I'm working in PHP, where can I find more info about how and why to do it?

I wouldn't sweat SQL templating, it's not a very standard approach and if you don't see the benefit then don't worry about it. For web frameworks, ORM classes seem to be more popular anyways.

npe
Oct 15, 2004
Ok, let me clarify on why we do it, maybe it will help.

In our particular case, we have an application backed by an extremely complex schema with quite a lot of rows (tables in the billions of records). Of course, we have a team of developers who focus on writing Oracle PL/SQL. In the same way that you might hire an HTML/CSS guy to do design and layout, we've hired developers who worry about the performance of the myriad queries we rely on.

Because of this we can't use an ORM approach, because query tuning on this scale is extremely finicky. For applications where all you do is CRUD operations, ORM's are fine, but that's not us.

For insert/update operations we naturally use stored procedures, so those aren't a problem, and for the more common complex queries they maintain views for us to use. But overall, there are probably hundreds of types of queries in use by various facets of our application, and encoding all of them into procedures or views would pose a different set of problems.

This originally meant that for the static, non-changing queries we pretty much just hardcoded them into the application source directly, and for dynamic queries we'd end up doing custom query building from string concatenation.

However, this meant working with the Oracle guys was a huge pain, because our queries were embedded in our application code (which was mostly perl). For the static queries it was more of a nuisance, but for dynamically generated queries it was a disaster. What would happen is an Oracle guy would see an inefficient query running, and send us a revised query that could replace it. But the revised query would involve completely restructuring the query building function, and usually we'd avoid doing that and just put up with crappy query performance when we could.

With templated SQL, it's very much like templated HTML. We can send the .sql files to the Oracle guys, and they can run the static queries through their query analyzers, and with a little extra effort they can pretty clearly do the same for the dynamic queries. They can now do this without having to worry at all about the application code itself.

Again, it's sort of specific to a situation where separation of application logic and database queries is important.

npe fucked around with this message at 02:21 on Jan 30, 2008

npe
Oct 15, 2004
I do know that in oracle, the OVER clause on any analytic function serves to specify the ordering (and optionally the partitioning) of the rows to perform the function on. So yes, you are asking for the row_num according to the ordering in your OVER() clause, which can be different than the ordering of your actual query. This is useful for situations where you need to determine the ranking of results according to one criteria, but are actually interested in returning them according to another.

I don't know MSSQL and it might be different, but I suspect this much is the same.

npe
Oct 15, 2004

chocojosh posted:

Is ON DELETE CASCADE evil?

This question is an easy way to start a pissing match between database developers, from my experience. In my opinion they are in the category of "things we can do if we really need to, but let's try to avoid it". This is because you can easily end up with a byzantine labyrinth of logic consisting of triggers and cascades that is very difficult to follow. Trying to play "guess how this record is getting deleted" is not fun.

Of course, deleting records in general is considered something to avoid for us, so we naturally view automatic deletion with even more suspicion.

npe
Oct 15, 2004

Victor posted:

So it's LAST_INSERT_ID() for MySQL 5 (what about 4?), SCOPE_IDENTITY() for SQL Server 2000+, what about Oracle, Postgre?

Oracle uses RETURNING:

code:

 INSERT INTO my_table (foo, bar) VALUES ('foo', 'bar') RETURNING id INTO :id

You would then retrieve the bind variable as normal.

npe
Oct 15, 2004
You should be able to use either IN() or an inline view. Like...

code:
SELECT * FROM blah WHERE foo IN
(SELECT foo,COUNT(*) FROM foobar
 WHERE fred IS NOT NULL 
 GROUP BY foo HAVING COUNT(*) = 1)

-- alternately, with an inline view

SELECT * FROM 
 blah JOIN
 (SELECT foo,COUNT(*) FROM foobar
  WHERE fred IS NOT NULL 
  GROUP BY foo HAVING COUNT(*) = 1) s ON s.foo = blah.foo

npe
Oct 15, 2004

Senso posted:

This does not work, I get a "ORA-00913: too many values" error. I'll try to make the second option work and let you know how that goes.

That's because I'm stupid. Take out the COUNT(*) from the select clause:

code:
SELECT * FROM blah WHERE foo IN
(SELECT foo FROM foobar
 WHERE fred IS NOT NULL 
 GROUP BY foo HAVING COUNT(*) = 1)
Sorry! :downs:

npe
Oct 15, 2004
Dealing with empty date ranges is actually a huge pain in the rear end with vanilla ANSI SQL. In fact I don't know if there is a way to do it without creating a reference table for your dates. There are often platform specific ways of doing it though, I could help you if you ran Oracle. :v:

In your case you need someone who knows MS SQL. Paging Victor to thread 340343284.

npe
Oct 15, 2004

Lancer383 posted:

Any truth to what she is suggesting? It doesn't seem right to me, but I can't put my finger on why. I've found that with many of that office's queries, speed issues tend to be an issue with not using the table indexes in joins, and an obscene use of subqueries and DISTINCTs.

Maybe it depends on something wonky you have going on. On an indexed column it is actually slower for me:

code:
SQL> select jobid from jobs where jobid in (31759, 31758, 31757);

     JOBID
----------
     31757
     31758
     31759


Execution Plan
----------------------------------------------------------

------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)|
------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |     3 |    15 |     2   (0)|
|   1 |  INLIST ITERATOR  |         |       |       |            |
|[b]*  2 |   INDEX RANGE SCAN| JOBS_PK |     3 |    15 |     2   (0)[/b]|
------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("JOBID"=31757 OR "JOBID"=31758 OR "JOBID"=31759)

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
          [b]7  consistent gets[/b]
          0  physical reads
          0  redo size
        356  bytes sent via SQL*Net to client
        332  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          3  rows processed
code:
SQL> select jobid from jobs where decode(jobid,31759, 1, 31758, 1, 31757, 1, 0) = 1;

     JOBID
----------
     31757
     31758
     31759


Execution Plan
----------------------------------------------------------

---------------------------------------------------------------------
| Id  | Operation            | Name    | Rows  | Bytes | Cost (%CPU)|
---------------------------------------------------------------------
|   0 | SELECT STATEMENT     |         |   184 |   920 |    10   (0)|
|[b]*  1 |  INDEX FAST FULL SCAN| JOBS_PK |   184 |   920 |    10   (0)[/b]|
---------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(DECODE("JOBID",31759,1,31758,1,31757,1,0)=1)

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         [b]48  consistent gets[/b]
          0  physical reads
          0  redo size
        356  bytes sent via SQL*Net to client
        332  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          3  rows processed

npe
Oct 15, 2004
Unindexed, I get identical plans:

code:
---------------------------------------------------------------
| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)|
---------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |   184 |  1656 |    56   (2)|
|*  1 |  TABLE ACCESS FULL| JOBS |   184 |  1656 |    56   (2)|
---------------------------------------------------------------
The oracle optimizer is pretty good most of the time, and crappy developers always try to second guess it. I can't imagine a scenario where the CBO is fooled into doing a worse scan using IN() than it would for a DECODE().

npe
Oct 15, 2004

Lancer383 posted:

Yeap, totally agree. I wish I could, but I'm not able to do an EXPLAIN PLAN the way that the DBA's have our privileges set up.

If you can create tables in your user schema, you can create your own plan table and use EXPLAIN PLAN INTO.

See http://www.adp-gmbh.ch/ora/explainplan.html
For the table details, see http://www.adp-gmbh.ch/ora/exp_plan/plan_table.html

Here's an example:

code:
SQL> create table my_plans (
        statement_id       varchar2(30),
        plan_id            number,
        timestamp          date,
        remarks            varchar2(4000),
        operation          varchar2(30),
        options            varchar2(255),
        object_node        varchar2(128),
        object_owner       varchar2(30),
        object_name        varchar2(30),
        object_alias       varchar2(65),
        object_instance    numeric,
        object_type        varchar2(30),
        optimizer          varchar2(255),
        search_columns     number,
        id                 numeric,
        parent_id          numeric,
        depth              numeric,        
        position           numeric,
        cost               numeric,
        cardinality        numeric,
        bytes              numeric,
        other_tag          varchar2(255),
        partition_start    varchar2(255),
        partition_stop     varchar2(255),
        partition_id       numeric,
        other              long,
        other_xml          clob,
        distribution       varchar2(30),
        cpu_cost           numeric,
        io_cost            numeric,
        temp_space         numeric,
        access_predicates  varchar2(4000),
        filter_predicates  varchar2(4000),
        projection         varchar2(4000),
        time               numeric,
        qblock_name        varchar2(30)
 38  );

Table created.

SQL> explain plan into my_plans for 
  2    select 1 from dual;

Explained.

SQL> select * from table(dbms_xplan.display('my_plans', null, 'typical'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
Plan hash value: 1388734953

-------------------------------------------------------------------------
| Id  | Operation        | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT |      |     1 |       |     2   (0)| 00:00:01 |
|   1 |  FAST DUAL       |      |     1 |       |     2   (0)| 00:00:01 |
-------------------------------------------------------------------------

8 rows selected.

npe fucked around with this message at 17:27 on Mar 6, 2008

npe
Oct 15, 2004

Lancer383 posted:

Yeah, I don't know. Looked up that message on Google and it looks like something that needs to be enabled in the DB's settings, and I'm afraid of venturing into the configuration of the DB, as I can imagine the DBA's wouldn't be too crazy about that.

Ignore CPU costing for now, it's not going to help.

Those plans do indicate that the decode is faster, but I don't know why. Well, I do know why -- you're getting a bad plan when using the IN list. I don't know why that is though. What version of Oracle is this?

Edit: gently caress it

npe fucked around with this message at 21:56 on Mar 6, 2008

npe
Oct 15, 2004

Lancer383 posted:

I believe it's 8/8i. We're forced to execute SQL statements through this old program called Infomaker that I believe is really intended to be a report-building program, but it's all they give us.
[/code]

The thing is, you're getting full table scans out of both. This is going to suck no matter how you slice it. While the plans you posted (and some of the plans I get when trying similar syntax) seem to indicate a higher row cost, I don't think this is necessarily true (they tend to be just guesses from the CBO). I don't believe they actually indicate much of a difference, full scans are expensive no matter what.

I'm going to keep poking at it, though. Generally speaking, you want to be most concerned with the access path more than anything. In this case, they are both full scans so I really don't think there's an appreciable difference.

This applies for me too, I made a mistake earlier in calling hers "slower" and using the estimated row/byte cost. A coworker had to remind me not to trust those, and that the two are probably equivalent. Hey, I'm a developer and not a DBA. :)

npe fucked around with this message at 22:06 on Mar 6, 2008

npe
Oct 15, 2004
Sounds like you are on the right track. Always get in good with your DBA's, they will help you find poorly performing queries in production and if you're lucky, they will go as far as to rewrite things for you if they know a better way.

Also, I completely forgot to note that this is on Oracle 8, so yeah... that sucks. :(

npe
Oct 15, 2004
I've never heard of anything, but I'm not really paying attention on that end of things. Personally, I'd just stick with the homebrew app, it will be less lovely than whatever "enterprise" solution is sold to you, and will be customizable in the future as well. Just my opinion though.

npe
Oct 15, 2004

fletcher posted:

In MySQL is there a way to do a SELECT but have it omit rows where the value in a column is within a certain range of the previous row it returned? Like if I have a table with a timestamp column and I want to return rows that are atleast 10,000 seconds apart or something. Does that make sense? I'm terrible at trying to explain things.

What you're looking for is something called an analytic function. Oracle (and I think now certain versions of MS SQL server) have these but to my knowledge MySQL does not. Specifically, Oracle provides SELECT foo, LAG (foo, 1) OVER (ORDER BY blah) FROM ... which will let you access data from a "window" of data. These are sometimes called "windowing" queries.

However, the last time this came up I did find this page which might have a passable way to fake it.

npe
Oct 15, 2004

functional posted:

I've been playing with this... When we GROUP BY on the two fields, we don't actually get what we're trying to do. It goes in the other direction: now the same user can vote twice if he comes from a different IP. And the same IP can vote twice if he comes from a different username (definitely not what we want).

Er, maybe I'm misunderstanding, but this doesn't make a lot of sense. If you have

code:
 User | IP
----------------
 Joe  | 1.2.3.4
 Sam  | 1.2.3.4
...you're saying you don't want these to count as two votes? In that case, where does Sam's vote go?

It sounds like you're trying to apply a constraint to your data model after the data already exists, the constraint being that users:ip's have a 1:1 ratio for votes. Is that correct?

Also, that blog is terrible because he advocates further extending MySQL's already dangerously broken group by syntax to be even more broken.

npe
Oct 15, 2004
Well the problem I guess is really more like the following:

code:
 User | IP
----------------
 Joe  | 1.2.3.4
 Sam  | 1.2.3.4
 Sam  | 5.6.7.8
 Al   | 5.6.7.8
Which of these rows should be discarded?

Edit: Where I'm going here is that the problem is being misrepresented as a GROUP BY/DISTINCT problem and really it's a modeling problem. It sounds like you'd like to be able to GROUP BY user OR ip, which is impossible.

This query might be possible but I think it's going to be exceedingly ugly.

npe fucked around with this message at 19:10 on Mar 26, 2008

npe
Oct 15, 2004

functional posted:

Yes, I'm going to give up doing this in MySQL. It's simply not equipped for the job.

You'd have a hard time pulling this off in any database, because the modeling is bad. You have a sort of entity defined based on one or the other attribute, that would more appropriately be represented by a related table.

npe
Oct 15, 2004
Regarding the check constraint, the mysql manual says:

quote:

The CHECK clause is parsed but ignored by all storage engines.

For the NOT NULL, I believe this is relevant:

quote:

Before MySQL 5.0.2, MySQL is forgiving of illegal or improper data values and coerces them to legal values for data entry. In MySQL 5.0.2 and up, that remains the default behavior, but you can change the server SQL mode to select more traditional treatment of bad values such that the server rejects them and aborts the statement in which they occur.

npe
Oct 15, 2004
You should think about whether or not you want the actual PDF's in your database. The alternative is to not store the actual files, but to load them to a more typical file server running a webserver, and then store the path to the file in the database. This is a bit of a religious war where I work, but the long and the short of it is that these two approaches have (as always) advantages and disadvantages:

In the database:
Advantages
- data integrity (your pdf's won't get deleted by a wayward shell script)
- ease of retrieval (just use the database connection, select the BLOB, presto!)
Disadvantages
- bloats the size of your table, can't be easily divided into multiple storage nodes without dealing with some form of database partitioning/clustering or using some form of clustered filesystem/san
- forces all requests for the file to be served by the database, turning your database into a file server, which it's not really built for

Outside the database:
Advantages
- more easily managed storage of files, can be done across multiple storage servers with no trouble
- keeps your tables normal sized
- requests for the file can be served over something like apache, which is built for exactly that
Disadvantages
- no data integrity enforced by the database (rm -rf *, oh poo poo)
- can't just SELECT your file out, have to SELECT out the location and then retrieve over http, which is sometimes a little more complicated depending on what you're doing

npe
Oct 15, 2004

Xae posted:

I prefer to keep the min the DB, because I am a DB whore. The ease of access on a database can be improved by using stored procedures to create a hybrid system. You can have one procedure that when passed the name(PK) of a file, and a location will write out the file to that location for easy pick up. A complementary procedure can be used to replace the blob in the database with the one at the specified location.

Well, that solves some cases but not all. In our case this wouldn't solve the problem, because you are still left with the critical disadvantages of having all of your file data being served at some point via the database (even with your solution, which just adds another layer, it doesn't really take out the bottleneck), which means pushing a high volume of data through what is commonly an already overstressed component. It also means we'd be storing terabytes of file data as part of the database and it would make managing of table data exponentially more difficult. By storing the files on external SAN nodes we can trivially add more storage without database downtime.

I have no idea what kind of needs LightI3ulb has, though. For something small it's probably no problem, but I wanted to point out that this is frequently a hotly contested design point. At my office we refer to people who are "innies" vs those who are "outies".

npe
Oct 15, 2004

Xae posted:

You should be able to partition the table, vertically if the volume of the files is too much for your database. Split the tablespace and have the blob table on a separate set of disks.

Partitioning brings it's own set of problems (we partition extensively currently). Storing 350tb of files inside the tables is not really a serious option even to our most rabid advocates of that approach.

Adbot
ADBOT LOVES YOU

npe
Oct 15, 2004
I've been on the defensive end of a few EAV fights, but I'll always say that this is not the case to use it. If you can model your relationships as tables, then do it. There's no reason to fear having a lot of tables, that's exactly what the database is there for.

EAV gains traction in cases where the application has to be designed to allow dynamic run-time modification of the relationships by the users on a constant basis. I don't want to get dragged into this argument again but that's where at least some people will justify using it.

If you can model your relationships statically, it's always the way to go.

Edited to add: the reason to hate on this so much is that if you're just going with a "dynamic" model to avoid having to manage all of these relationships, it's just self-delusion. These relationships need to be defined and managed somewhere and if you don't do it via your schema design, you'll end up doing it either in the table data or in your application code. It will end up being more work, more prone to problems, and harder to version control, and piss off everyone who has to work on it later.

npe fucked around with this message at 21:04 on Apr 9, 2008

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply