SQL Problem: Merge record history tables

A colleague asked me if I could help solve a problem in SQL, and I was grateful because I love solving problems in SQL! So we sat down, built a test suite, and worked together to solve it.

I will first present the problem and invite you to have a go at solving it for yourself; tomorrow I will post the solution we came up with. I think our solution is reasonably simple and elegant, but I’m certain there are other solutions as well, some of which may be more elegant or efficient.

I have two “record history” tables, from different data sources, that record the history of changes to two sets of columns describing the same entity. Here are the example tables:

T1 (deptid, from_date, to_date, data1)
T2 (deptid, from_date, to_date, data2)

So T1 records the changes in the column “data1” over time for each deptid; and T2 records the changes in the column “data2” over time for each deptid.

Assumptions/Constraints on T1 and T2:

  • All columns are not Null.
  • All dates are just dates (no time values).
  • All date ranges are valid (i.e. from_date <- to_date).
  • Within each table, all date ranges for a particular deptid are gap-free and have no overlaps.

We want to produce a query that returns the following structure:

T3 (deptid, from_date, to_date, data1, data2)

Where the dates for a particular deptid overlap between T1 and T2, we want to see the corresponding data1 and data2 values. Where the dates do not overlap (e.g. where we have a value for data1 in T1 but not in T2), we want to see the corresponding data1 but NULL for data2; and vice versa.

Below is the setup and test cases you can use if you wish. Alternatively, you can use the SQL Fiddle here.

create table t1
  (deptid number not null
  ,from_date date not null
  ,to_date date not null
  ,data1 varchar2(30) not null
  ,primary key (deptid, from_date)
  );

create table t2
  (deptid number not null
  ,from_date date not null
  ,to_date date not null
  ,data2 varchar2(30) not null
  ,primary key (deptid, from_date)
  );

--Test Case 1 (4001)
--T1  |-------------||-------------|
--T2         |-------------||-------------||-------------|

insert into t1 values (4001, date'2015-01-01', date'2015-01-31', 'jan data 1');
insert into t1 values (4001, date'2015-02-01', date'2015-02-28', 'feb data 1');
insert into t2 values (4001, date'2015-01-15', date'2015-02-14', 'jan-feb data 2');
insert into t2 values (4001, date'2015-02-15', date'2015-03-14', 'feb-mar data 2');
insert into t2 values (4001, date'2015-03-15', date'2015-04-14', 'mar-apr data 2');

--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4001    01-JAN-2015  14-JAN-2015  jan data 1
--4001    15-JAN-2015  31-JAN-2015  jan data 1      jan-feb data 2
--4001    01-FEB-2015  14-FEB-2015  feb data 1      jan-feb data 2
--4001    15-FEB-2015  28-FEB-2015  feb data 1      feb-mar data 2
--4001    01-MAR-2015  14-MAR-2015                  feb-mar data 2
--4001    15-MAR-2015  14-APR-2015                  mar-apr data 2

--Test Case 2 (4002)
--T1         |-------------||-------------||-------------|
--T2  |-------------||-------------|

insert into t1 values (4002, date'2015-01-15', date'2015-02-14', 'jan-feb data 1');
insert into t1 values (4002, date'2015-02-15', date'2015-03-14', 'feb-mar data 1');
insert into t1 values (4002, date'2015-03-15', date'2015-04-14', 'mar-apr data 1');
insert into t2 values (4002, date'2015-01-01', date'2015-01-31', 'jan data 2');
insert into t2 values (4002, date'2015-02-01', date'2015-02-28', 'feb data 2');

--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4002    01-JAN-2015  14-JAN-2015                  jan data 2
--4002    15-JAN-2015  31-JAN-2015  jan-feb data 1  jan data 2
--4002    01-FEB-2015  14-FEB-2015  jan-feb data 1  feb data 2
--4002    15-FEB-2015  28-FEB-2015  feb-mar data 1  feb data 2
--4002    01-MAR-2015  14-MAR-2015  feb-mar data 1
--4002    15-MAR-2015  14-APR-2015  mar-apr data 1

--Test Case 3 (4003)
--T1  |-------------|
--T2  |-------------|

insert into t1 values (4003, date'2015-01-01', date'2015-01-31', 'jan data 1');
insert into t2 values (4003, date'2015-01-01', date'2015-01-31', 'jan data 2');

--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4003    01-JAN-2015  31-JAN-2015  jan data 1      jan data 2

--Test Case 4 (4004)
--T1  |-------------|
--T2      |-----|

insert into t1 values (4004, date'2015-01-01', date'2015-01-31', 'jan data 1');
insert into t2 values (4004, date'2015-01-10', date'2015-01-20', 'jan data int 2');

--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4004    01-JAN-2015  09-JAN-2015  jan data 1
--4004    10-JAN-2015  20-JAN-2015  jan data 1      jan data int 2
--4004    21-JAN-2015  31-JAN-2015  jan data 1

--Test Case 5 (4005)
--T1      |-----|
--T2  |-------------|

insert into t1 values (4005, date'2015-01-10', date'2015-01-20', 'jan data int 1');
insert into t2 values (4005, date'2015-01-01', date'2015-01-31', 'jan data 2');

--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4005    01-JAN-2015  09-JAN-2015                  jan data 2
--4005    10-JAN-2015  20-JAN-2015  jan data int 1  jan data 2
--4005    21-JAN-2015  31-JAN-2015                  jan data 2

--Test Case 6 (4006)
--T1  ||------------|
--T2  |-------------|

insert into t1 values (4006, date'2015-01-01', date'2015-01-01', 'jan data a 1');
insert into t1 values (4006, date'2015-01-02', date'2015-01-31', 'jan data b 1');
insert into t2 values (4006, date'2015-01-01', date'2015-01-31', 'jan data 2');

--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4006    01-JAN-2015  01-JAN-2015  jan data a 1    jan data 2
--4006    02-JAN-2015  31-JAN-2015  jan data b 1    jan data 2

--Test Case 7 (4007)
--T1  |-------------|
--T2  ||------------|

insert into t1 values (4007, date'2015-01-01', date'2015-01-31', 'jan data 1');
insert into t2 values (4007, date'2015-01-01', date'2015-01-01', 'jan data a 2');
insert into t2 values (4007, date'2015-01-02', date'2015-01-31', 'jan data b 2');

--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4007    01-JAN-2015  01-JAN-2015  jan data 1      jan data a 2
--4007    02-JAN-2015  31-JAN-2015  jan data 1      jan data b 2

--Test Case 8 (4008)
--T1  |-------------|
--T2    |---||---||---|
 
insert into t1 values (4008, date'2015-01-01', date'2015-01-31', 'jan data 1');
insert into t2 values (4008, date'2015-01-09', date'2015-01-17', 'jan data a 2');
insert into t2 values (4008, date'2015-01-18', date'2015-01-26', 'jan data b 2');
insert into t2 values (4008, date'2015-01-27', date'2015-02-04', 'jan-feb data 2');
 
--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4008    01-JAN-2015  08-JAN-2015  jan data 1
--4008    09-JAN-2015  17-JAN-2015  jan data 1      jan data a 2
--4008    18-JAN-2015  26-JAN-2015  jan data 1      jan data b 2
--4008    27-JAN-2015  31-JAN-2015  jan data 1      jan-feb data 2
--4008    01-FEB-2015  04-FEB-2015                  jan-feb data 2

--Test Case 9 (4009)
--T1    |---||---||---|
--T2  |-------------|
 
insert into t1 values (4009, date'2015-01-09', date'2015-01-17', 'jan data a 1');
insert into t1 values (4009, date'2015-01-18', date'2015-01-26', 'jan data b 1');
insert into t1 values (4009, date'2015-01-27', date'2015-02-04', 'jan-feb data 1');
insert into t2 values (4009, date'2015-01-01', date'2015-01-31', 'jan data 2');
 
--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4009    01-JAN-2015  08-JAN-2015                  jan data 2
--4009    09-JAN-2015  17-JAN-2015  jan data a 1    jan data 2
--4009    18-JAN-2015  26-JAN-2015  jan data b 1    jan data 2
--4009    27-JAN-2015  31-JAN-2015  jan-feb data 1  jan data 2
--4009    01-FEB-2015  04-FEB-2015  jan-feb data 1

EDIT 15/10/2015: added Kim Berg Hansen’s test cases

UPDATE 15/10/2015: solutions (spoiler alert!)


Trivia question: Why does the 2nd Drop Work?

I have an Oracle database (10g or 11g, doesn’t matter) with an empty schema. I am the only user on the system and I only have one session open.

I create a single table (FOO) and run a few simple SQL commands. Then, I run this, twice in a row without any intermediate steps:

DROP TABLE foo;

The first time, some errors are raised and the table is not dropped.

The second time, it succeeds – the table is dropped without error.

Q. What did I do to cause this behaviour (which is 100% reproducible) – and what errors were reported?

A. Congratulations to Kirill Leontiev, Matthias Rogel, Charles Hooper and Mette who all found the answer!

Spoiler Alert: Don’t read the comments if you don’t want to know the answer straight away.


DATE is not a date!

Consider:

  • DATE is not a date.
  • DATE '2012-06-22' is not a date.
  • CURRENT_DATE is not a date.
  • SYSDATE is not a date.
  • TRUNC(SYSDATE) is not a date.
  • TO_DATE('22/06/2012','DD/MM/YYYY') is not a date.

Oracle SQL does not have a native date datatype.

Explanation: the datatype called “DATE” actually represents a Date+Time. It always has a time portion. For example, the literal expression DATE '2012-06-22'has a time component that means midnight, to the nearest second. It is only by convention that a DATE with a time of 00:00:00 is used to represent the whole day, rather than exactly midnight; but this convention only works if the developer carefully writes his code to ensure this is true.

A fragment of the Antikythera mechanism.

It is theorized that the ancients solved “once and for all” the problem with dates vs. times.

To an experienced Oracle developer, this is not a problem, it feels perfectly natural to say “DATE” but internally to be thinking in terms of time values. It’s easy to forget that we’re even doing this subconsciously. To an outsider (e.g. business analysts, testers, Java developers), it can be confusing – and it is often not immediately obvious that the confusion exists. It’s not until you start explaining why SYSDATE <= DATE '2012-06-22' evaluates to FALSE, even if today is the 22nd day of the month of June in the year of our Lord 2012 that you realise they have been labouring under a false assumption: that a DATE is a date, that it represents the full 24-hour period that a normal human would call “22 June 2012”.

If I was invited to change the name of just one thing in Oracle (and everyone was willing to make all the changes necessary to their code to accommodate my whim) it would be to change “DATE” to “DATETIME”.

I ask you: is there anything else in Oracle that confuses outsiders more often than this misnomer?

P.S. Now for a freebie: here is a summary of a number of transformations that may be done to remove the TRUNC function call around a date column (a and b are all of type DATE):

 TRUNC(a) =  TRUNC(b)  =>  (a BETWEEN TRUNC(b) AND TRUNC(b)+0.99999)
 TRUNC(a) <  TRUNC(b)  =>  a < TRUNC(b)
 TRUNC(a) <= TRUNC(b)  =>  a < TRUNC(b)+1
 TRUNC(a) >  TRUNC(b)  =>  a >= TRUNC(b)+1
 TRUNC(a) >= TRUNC(b)  =>  a >= TRUNC(b)
 TRUNC(a) =  b         =>  (a BETWEEN b AND b+0.99999
                            AND b = TRUNC(b))
 TRUNC(a) <  b         =>  (a < TRUNC(b)+1
                            AND NOT (a=TRUNC(a) AND b=TRUNC(b)))
 TRUNC(a) <= b         =>  a < TRUNC(b)+1
 TRUNC(a) >  b         =>  a >= TRUNC(b)+1
 TRUNC(a) >= b         =>  (a >= TRUNC(b)
                            AND NOT (b=TRUNC(b)))

Data dictionary quiz question

This is a totally unfair quiz question (for anyone who isn’t intimately acquainted with the Oracle data dictionary views). There, I’ve warned you. I would have got this one wrong, myself.

Which of the following queries (if any) will run without error?

SELECT * FROM dba_tab_privs WHERE owner = 'SCOTT';
SELECT * FROM all_tab_privs WHERE owner = 'SCOTT';
SELECT * FROM user_tab_privs WHERE owner = 'SCOTT';

Now, don’t go and try running these in your database until after you’ve written down what you think the answers are. That would be cheating 🙂

.

.

.

EDIT

Ok, now to break the suspense for all the other readers (both of you), who couldn’t be bothered testing it out for yourself.

Statement 1. The query on DBA_TAB_PRIVS will succeed – assuming you have the necessary privileges on the DBA* views. This view does include the column OWNER.

Statement 3. The query on USER_TAB_PRIVS, unlike what I and several others might assume,  will succeed.  Many data dictionary views, such as USER_TABLES, omit the OWNER column – which makes sense, since it is expected that it will be simply the currently-logged-in user. But for user_tab_privs, this column is provided, for good reason – because the table you have a privilege on may very well be owned by another schema.

Statement 2. The query on ALL_TAB_PRIVS, in order to be consistent with DBA_TAB_PRIVS and USER_TAB_PRIVS, should have the OWNER column, by rights. But, just to make things interesting, the column is called TABLE_SCHEMA instead. So, my query would fail.


Difference Between Two Timestamps

Easy, right? Perhaps not as straightforward as you’d think.

Method #1: use EXTRACT

extract(day from (x-y))*24*60*60
+ extract(hour from (x-y))*60*60
+ extract(minute from (x-y))*60
+ extract(second from (x-y))

Method #2: use CAST

( CAST( x AS DATE ) - CAST( y AS DATE ) ) * 86400

The difference? Method #2 is faster (my tests indicate faster by a factor of 3), but does not include fractional seconds. Method #1 is a bit slower, but includes fractions of a second. See the SO link for details.


INSERT-VALUES vs. INSERT-SELECT-FROM-DUAL

There’s no difference between the effects of the following two statements, are there:

INSERT INTO mytable (col1, col2) VALUES ('hello','world');
INSERT INTO mytable (col1, col2) SELECT 'hello', 'world' FROM DUAL;

Well, as it turns out, it is possible for the first statement to succeed where the second statement would fail – in the presence of a suitably crafted Before Insert trigger, the second will raise “ORA-04091 table is mutating, trigger/function may not see it”:

http://oraclequirks.blogspot.com/2010/09/ora-04091-table-stringstring-is.html


A fascinating SQL problem

Can you solve this problem in SQL – i.e. without resorting to a procedural solution?

SQL combine multiple identifiers to create a group id for duplicate records

“I have a data extract with three different identifiers: A, B, C
Each identifier may appear in more than one row, and each row may have one or more of these three identifiers (i.e the column is populated or null).
I want to be able to group all records that have any combination of either A, B or C in common and assign them the same group id.
Extract table showing what the eventual groups should be:

A    | B    | C    | Group
====   ====   ====   =====
p      NULL   NULL   1
p      r      NULL   1
q      NULL   NULL   2
NULL   r      NULL   1
NULL   NULL   s      2
q      NULL   s      2

So, the input data is a table with three columns (A, B, and C), some of which are NULL. The output is a third column, “Group”, which will be assigned a number which classifies the row into a “group”. Each group will be distinct in that none of its members will have a value in A, B or C that appears in any row in any other group.

This question is fascinating because it cannot be solved, I believe, without some form of iteration. If I get the row (p) along with (q), they are in two different groups; but, if I add the row (p,q), all of a sudden my original rows are now in the same group along with the new row.

The solution will probably have to examine each row in consideration with the entire record set – an operation of O(n^2), if my understanding of CS theory is correct. I suspect a solution using at least a CTE and/or the MODEL clause will be required.

UPDATE:
An elegant solution, using a hierarchical query and Oracle’s CONNECT_BY_ROOT function, has been posted by Vincent Malgrat.


One insert or multiple inserts?


A colleague asked me a trick* question today which I failed 🙂

* whether it was a “trick” question is probably in the eye of the beholder, though…

“What are the differences, if any, between the following two approaches to inserting multiple rows (assume v1 and v2 have different values):

INSERT INTO mytable (a)
SELECT :v1 FROM DUAL;
INSERT INTO mytable (a)
SELECT :v2 FROM DUAL;

OR:

INSERT INTO mytable (a)
SELECT :v1 FROM DUAL
UNION
SELECT :v2 FROM DUAL;

I quickly answered:

  1. The first approach requires one extra parse;
  2. The second approach requires a Sort Distinct (albeit of only two rows)
  3. A UNION ALL would be better, which would not require a Sort, nor would require the extra parse.

My colleague responded, there’s one very important thing I missed: Triggers! The first approach would execute the BEFORE STATEMENT and AFTER STATEMENT triggers once for each row. The second approach would only execute these triggers once.

What’s sad is that the application we’re working on has row-level logic in the BEFORE/AFTER statement triggers. If we try to optimise the code to insert more than one row in one INSERT statement, the application only runs the row-level logic for the first row inserted. Bad code! Very very bad!


Weird SQL Results

Chris Taylor raised a good question.

Here is the interesting part (to me) – Sometimes it returns 1 row, sometimes it returns more than 1 row, and sometimes it returns no rows. To my mind, “trunc(dbms_random.value(1,500)” should return a random value between 1 & 500 every time and just one value.  Is this weird, or is it just me???

Basically he’s wondering why querying a table on a randomly-chosen unique identifier does not always return exactly 1 row.

At first glance it seems nonsensical and looks like a bug, but really, it’s not – in fact, Oracle is doing exactly the right thing (anecdotally, a colleague has told me that SQL Server gets this particular behaviour wrong).

A simple test case: the inner query generates exactly 100 rows, with the numbers 1 to 100. The outer query then queries this result set, with the intention of picking one of those rows only:

select * from (
  select rownum r from dual connect by level <= 100
)
where r = trunc(dbms_random.value(1,100));

However, what happens here is that Oracle knows that dbms_random.value is not deterministic; and the predicates (WHERE clause) must be evaluated once for each row returned from the FROM clause. Since dbms_random.value is not deterministic, Oracle knows it must re-evaluate it for each row, separately – which means it is comparing a different number for each row returned.

So, for example, Oracle looks at row 1, with the value 1. It generates a random number, e.g. 12, and so the predicate evaluates to FALSE, and the row is not returned. Oracle then looks at row 2, which has the value 2. It generates a random number, e.g. 2, and so the predicate evaluates to TRUE, and the row is returned. It does this for each row until the 100th is evaluated and then the query stops.

If none of the 100 random numbers happen to coincide with the values returned from the table, then the query will return no rows.

Quiz question: what is the theoretical maximum number of rows the above query could ever return?

To fix it so that Oracle only evaluates the selection criteria once, we just need to wrap it in another subquery:

select * from (
  select rownum r from dual connect by level <= 100
)
where r = (select trunc(dbms_random.value(1,100)) from dual);

A good entry-level interview question

I think this would make a reasonable question for would-be database developers:

What are the differences between the following statements?

UPDATE mytable SET mycolumn = NULL;
ALTER TABLE mytable MODIFY (mycolumn NULL);

If they can’t answer this correctly, they don’t get to stage two.