SQL Problem: Merge record history tables

A colleague asked me if I could help solve a problem in SQL, and I was grateful because I love solving problems in SQL! So we sat down, built a test suite, and worked together to solve it.

I will first present the problem and invite you to have a go at solving it for yourself; tomorrow I will post the solution we came up with. I think our solution is reasonably simple and elegant, but I’m certain there are other solutions as well, some of which may be more elegant or efficient.

I have two “record history” tables, from different data sources, that record the history of changes to two sets of columns describing the same entity. Here are the example tables:

T1 (deptid, from_date, to_date, data1)
T2 (deptid, from_date, to_date, data2)

So T1 records the changes in the column “data1” over time for each deptid; and T2 records the changes in the column “data2” over time for each deptid.

Assumptions/Constraints on T1 and T2:

  • All columns are not Null.
  • All dates are just dates (no time values).
  • All date ranges are valid (i.e. from_date <- to_date).
  • Within each table, all date ranges for a particular deptid are gap-free and have no overlaps.

We want to produce a query that returns the following structure:

T3 (deptid, from_date, to_date, data1, data2)

Where the dates for a particular deptid overlap between T1 and T2, we want to see the corresponding data1 and data2 values. Where the dates do not overlap (e.g. where we have a value for data1 in T1 but not in T2), we want to see the corresponding data1 but NULL for data2; and vice versa.

Below is the setup and test cases you can use if you wish. Alternatively, you can use the SQL Fiddle here.

create table t1
  (deptid number not null
  ,from_date date not null
  ,to_date date not null
  ,data1 varchar2(30) not null
  ,primary key (deptid, from_date)
  );

create table t2
  (deptid number not null
  ,from_date date not null
  ,to_date date not null
  ,data2 varchar2(30) not null
  ,primary key (deptid, from_date)
  );

--Test Case 1 (4001)
--T1  |-------------||-------------|
--T2         |-------------||-------------||-------------|

insert into t1 values (4001, date'2015-01-01', date'2015-01-31', 'jan data 1');
insert into t1 values (4001, date'2015-02-01', date'2015-02-28', 'feb data 1');
insert into t2 values (4001, date'2015-01-15', date'2015-02-14', 'jan-feb data 2');
insert into t2 values (4001, date'2015-02-15', date'2015-03-14', 'feb-mar data 2');
insert into t2 values (4001, date'2015-03-15', date'2015-04-14', 'mar-apr data 2');

--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4001    01-JAN-2015  14-JAN-2015  jan data 1
--4001    15-JAN-2015  31-JAN-2015  jan data 1      jan-feb data 2
--4001    01-FEB-2015  14-FEB-2015  feb data 1      jan-feb data 2
--4001    15-FEB-2015  28-FEB-2015  feb data 1      feb-mar data 2
--4001    01-MAR-2015  14-MAR-2015                  feb-mar data 2
--4001    15-MAR-2015  14-APR-2015                  mar-apr data 2

--Test Case 2 (4002)
--T1         |-------------||-------------||-------------|
--T2  |-------------||-------------|

insert into t1 values (4002, date'2015-01-15', date'2015-02-14', 'jan-feb data 1');
insert into t1 values (4002, date'2015-02-15', date'2015-03-14', 'feb-mar data 1');
insert into t1 values (4002, date'2015-03-15', date'2015-04-14', 'mar-apr data 1');
insert into t2 values (4002, date'2015-01-01', date'2015-01-31', 'jan data 2');
insert into t2 values (4002, date'2015-02-01', date'2015-02-28', 'feb data 2');

--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4002    01-JAN-2015  14-JAN-2015                  jan data 2
--4002    15-JAN-2015  31-JAN-2015  jan-feb data 1  jan data 2
--4002    01-FEB-2015  14-FEB-2015  jan-feb data 1  feb data 2
--4002    15-FEB-2015  28-FEB-2015  feb-mar data 1  feb data 2
--4002    01-MAR-2015  14-MAR-2015  feb-mar data 1
--4002    15-MAR-2015  14-APR-2015  mar-apr data 1

--Test Case 3 (4003)
--T1  |-------------|
--T2  |-------------|

insert into t1 values (4003, date'2015-01-01', date'2015-01-31', 'jan data 1');
insert into t2 values (4003, date'2015-01-01', date'2015-01-31', 'jan data 2');

--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4003    01-JAN-2015  31-JAN-2015  jan data 1      jan data 2

--Test Case 4 (4004)
--T1  |-------------|
--T2      |-----|

insert into t1 values (4004, date'2015-01-01', date'2015-01-31', 'jan data 1');
insert into t2 values (4004, date'2015-01-10', date'2015-01-20', 'jan data int 2');

--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4004    01-JAN-2015  09-JAN-2015  jan data 1
--4004    10-JAN-2015  20-JAN-2015  jan data 1      jan data int 2
--4004    21-JAN-2015  31-JAN-2015  jan data 1

--Test Case 5 (4005)
--T1      |-----|
--T2  |-------------|

insert into t1 values (4005, date'2015-01-10', date'2015-01-20', 'jan data int 1');
insert into t2 values (4005, date'2015-01-01', date'2015-01-31', 'jan data 2');

--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4005    01-JAN-2015  09-JAN-2015                  jan data 2
--4005    10-JAN-2015  20-JAN-2015  jan data int 1  jan data 2
--4005    21-JAN-2015  31-JAN-2015                  jan data 2

--Test Case 6 (4006)
--T1  ||------------|
--T2  |-------------|

insert into t1 values (4006, date'2015-01-01', date'2015-01-01', 'jan data a 1');
insert into t1 values (4006, date'2015-01-02', date'2015-01-31', 'jan data b 1');
insert into t2 values (4006, date'2015-01-01', date'2015-01-31', 'jan data 2');

--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4006    01-JAN-2015  01-JAN-2015  jan data a 1    jan data 2
--4006    02-JAN-2015  31-JAN-2015  jan data b 1    jan data 2

--Test Case 7 (4007)
--T1  |-------------|
--T2  ||------------|

insert into t1 values (4007, date'2015-01-01', date'2015-01-31', 'jan data 1');
insert into t2 values (4007, date'2015-01-01', date'2015-01-01', 'jan data a 2');
insert into t2 values (4007, date'2015-01-02', date'2015-01-31', 'jan data b 2');

--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4007    01-JAN-2015  01-JAN-2015  jan data 1      jan data a 2
--4007    02-JAN-2015  31-JAN-2015  jan data 1      jan data b 2

--Test Case 8 (4008)
--T1  |-------------|
--T2    |---||---||---|
 
insert into t1 values (4008, date'2015-01-01', date'2015-01-31', 'jan data 1');
insert into t2 values (4008, date'2015-01-09', date'2015-01-17', 'jan data a 2');
insert into t2 values (4008, date'2015-01-18', date'2015-01-26', 'jan data b 2');
insert into t2 values (4008, date'2015-01-27', date'2015-02-04', 'jan-feb data 2');
 
--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4008    01-JAN-2015  08-JAN-2015  jan data 1
--4008    09-JAN-2015  17-JAN-2015  jan data 1      jan data a 2
--4008    18-JAN-2015  26-JAN-2015  jan data 1      jan data b 2
--4008    27-JAN-2015  31-JAN-2015  jan data 1      jan-feb data 2
--4008    01-FEB-2015  04-FEB-2015                  jan-feb data 2

--Test Case 9 (4009)
--T1    |---||---||---|
--T2  |-------------|
 
insert into t1 values (4009, date'2015-01-09', date'2015-01-17', 'jan data a 1');
insert into t1 values (4009, date'2015-01-18', date'2015-01-26', 'jan data b 1');
insert into t1 values (4009, date'2015-01-27', date'2015-02-04', 'jan-feb data 1');
insert into t2 values (4009, date'2015-01-01', date'2015-01-31', 'jan data 2');
 
--Expected Output:
--DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
--======  ===========  ===========  ==============  ==============
--4009    01-JAN-2015  08-JAN-2015                  jan data 2
--4009    09-JAN-2015  17-JAN-2015  jan data a 1    jan data 2
--4009    18-JAN-2015  26-JAN-2015  jan data b 1    jan data 2
--4009    27-JAN-2015  31-JAN-2015  jan-feb data 1  jan data 2
--4009    01-FEB-2015  04-FEB-2015  jan-feb data 1

EDIT 15/10/2015: added Kim Berg Hansen’s test cases

UPDATE 15/10/2015: solutions (spoiler alert!)

Auto-convert field to uppercase
SQL Problem Solved: Merge History Tables

Comments

  1. with dates as (
      select distinct deptid, dte from (
        select deptid, from_date, to_date+1 to_date from t1
        union all
        select deptid, from_date, to_date+1 to_date from t2
      )
      unpivot(dte for col in(from_date, to_date))
    )
    select * from (
      select distinct a.deptid, dte from_date,
      lead(dte) over(partition by a.deptid order by dte) - 1 to_date,
      data1, data2 from dates a
      left join t1 b on a.deptid = b.deptid and a.dte between b.from_date and b.to_date
      left join t2 c on a.deptid = c.deptid and a.dte between c.from_date and c.to_date
    )
    where data1 is not null or data2 is not null
    order by 1,2;
    

    Note – SQL:2011 says TO_DATE should be exclusive, not inclusive. Oracle 12c agrees.

    An addition to the requirement would be to merge contiguous date ranges with the same date. The “start of group” method or 12c MATCH_RECOGNIZE could help there.

    Best regards,
    Stew Ashton
    stewashton.wordpress.com

    • T. Bhgoeswara Prasad
      14 February 2016 - 10:21 pm

      stewashton,

      I am new to oracle development . I need your help get the below result . Please help me.

      We have two tables A, B . In A we have two columns one is FROM_DATE and TO_ DATE.

      that is

      FROM_DATE TO_ DATE

      10-NOV-2015 31-DEC-9999

      In B table

      id Valid_from Valid_to

      1 10-NOV-2015 31-DEC-2015
      2 11-NOV-2016 31-DEC-2016

      The dates should be in B table as between A table dates . That exists already. Here the requirement is we have missed some dates between the A table from date and to date. In B table we have from 10-NOV_2015 to 31-DEC-2015 after that we have missed 1-1-2016 to 10-nov-2016. Then we have dates in b table from 11-NOV-2016 to 31-DEC-2016 . Again we have missed the dates after 31-DEC-2016 . that is from 1-1-2017 to 31-DEC-9999. The final expected out should be like below. Please help me how to fetch this.

      id Valid_from Valid_to

      1 10-NOV-2015 31-DEC-2015
      2 01-01-2016 10-NOV-2016 ——> Please how to fetch this date.
      3 11-NOV-2016 31-DEC-2016
      4 01-01-2017 31-DEC-9999 ——–> please how to fetch this date.

      • Hi T. Bhgoeswara Prasad,

        Well, I’m not stewashton 🙂 but this might help – I had to fiddle with the dates somewhat because of the “31/12/9999” date:

        with dates as ( 
          select distinct dte from ( 
            select from_date, to_date+case when to_date&lt;date'9999-12-31' then 1 else 0 end to_date from tableA 
            union all 
            select valid_from as from_date, valid_to+case when valid_to&lt;date'9999-12-31' then 1 else 0 end as to_date from tableB 
          ) 
          unpivot(dte for col in(from_date, to_date)) 
        ) 
        select row_number() over (order by from_date) as id, from_date, to_date-case when to_date&lt;date'9999-12-31' then 1 else 0 end as to_date from ( 
          select distinct dte from_date, 
          lead(dte) over(order by dte) to_date 
          from dates a 
          left join tableA b on a.dte between b.from_date and b.to_date 
          left join (select valid_from as from_date, valid_to as to_date from tableB) c on a.dte between c.from_date and c.to_date 
        ) 
        where from_date is not null and to_date is not null 
        order by 1,2
        

        http://livesql.oracle.com/apex/livesql/s/cvkntrdbeh68d5kl0jphm2trq

  2. Kim Berg Hansen
    14 October 2015 - 8:12 pm

    It’s not elegant, on the contrary rather ugly, but it’ll work with a single scan of both tables 😉

    with v1 as (
       select t.deptid
            , case r.l
                 when 1 then t.from_date
                 when 2 then date '0001-01-01'
                 when 3 then t.max_to + 1
              end from_date
            , case r.l
                 when 1 then t.to_date
                 when 2 then t.min_from -1
                 when 3 then date '9999-01-01'
              end to_date
            , case r.l
                 when 1 then t.data1
              end data1
         from (
          select deptid
               , from_date
               , to_date
               , data1
               , row_number() over (partition by deptid order by from_date) rn
               , min(from_date) over (partition by deptid) min_from
               , max(to_date) over (partition by deptid) max_to
            from t1
         ) t, lateral( 
             select level l
               from dual
             connect by level <= case t.rn when 1 then 3 else 1 end
         ) r
    ), v2 as (
       select t.deptid
            , case r.l
                 when 1 then t.from_date
                 when 2 then date '0001-01-01'
                 when 3 then t.max_to + 1
              end from_date
            , case r.l
                 when 1 then t.to_date
                 when 2 then t.min_from -1
                 when 3 then date '9999-01-01'
              end to_date
            , case r.l
                 when 1 then t.data2
              end data2
         from (
          select deptid
               , from_date
               , to_date
               , data2
               , row_number() over (partition by deptid order by from_date) rn
               , min(from_date) over (partition by deptid) min_from
               , max(to_date) over (partition by deptid) max_to
            from t2
         ) t, lateral( 
             select level l
               from dual
             connect by level <= case t.rn when 1 then 3 else 1 end
         ) r
    )
    select v1.deptid
         , greatest(v1.from_date, v2.from_date) from_date
         , least(v1.to_date, v2.to_date) to_date
         , v1.data1
         , v2.data2
      from v1
      join v2
          on v2.deptid = v1.deptid
         and v2.to_date >= v1.from_date
         and v2.from_date <= v1.to_date
     where v2.data2 is not null or v1.data1 is not null
     order by v1.deptid
            , greatest(v1.from_date, v2.from_date)
    /
    

    I added an extra test case because some of the things I tried I was uncertain whether it would handle multiple overlaps:

    --Test Case 8 (4008)
    --T1  |-------------|
    --T2    |---||---||---|
    
    insert into t1 values (4008, date'2015-01-01', date'2015-01-31', 'jan data 1');
    insert into t2 values (4008, date'2015-01-09', date'2015-01-17', 'jan data a 2');
    insert into t2 values (4008, date'2015-01-18', date'2015-01-26', 'jan data b 2');
    insert into t2 values (4008, date'2015-01-27', date'2015-02-04', 'jan-feb data 2');
    
    --Expected Output:
    --DEPTID  FROM_DATE    TO_DATE      DATA1           DATA2
    --======  ===========  ===========  ==============  ==============
    --4008    01-JAN-2015  08-JAN-2015  jan data 1
    --4008    09-JAN-2015  17-JAN-2015  jan data 1      jan data a 2
    --4008    18-JAN-2015  26-JAN-2015  jan data 1      jan data b 2
    --4008    27-JAN-2015  31-JAN-2015  jan data 1      jan-feb data 2
    --4008    01-FEB-2015  04-FEB-2015                  jan-feb data 2
    

    And I get this output:

        DEPTID FROM_DATE   TO_DATE     DATA1           DATA2          
    ---------- ----------- ----------- --------------- ---------------
          4001 01-JAN-2015 14-JAN-2015 jan data 1                     
          4001 15-JAN-2015 31-JAN-2015 jan data 1      jan-feb data 2 
          4001 01-FEB-2015 14-FEB-2015 feb data 1      jan-feb data 2 
          4001 15-FEB-2015 28-FEB-2015 feb data 1      feb-mar data 2 
          4001 01-MAR-2015 14-MAR-2015                 feb-mar data 2 
          4001 15-MAR-2015 14-APR-2015                 mar-apr data 2 
          4002 01-JAN-2015 14-JAN-2015                 jan data 2     
          4002 15-JAN-2015 31-JAN-2015 jan-feb data 1  jan data 2     
          4002 01-FEB-2015 14-FEB-2015 jan-feb data 1  feb data 2     
          4002 15-FEB-2015 28-FEB-2015 feb-mar data 1  feb data 2     
          4002 01-MAR-2015 14-MAR-2015 feb-mar data 1                 
          4002 15-MAR-2015 14-APR-2015 mar-apr data 1                 
          4003 01-JAN-2015 31-JAN-2015 jan data 1      jan data 2     
          4004 01-JAN-2015 09-JAN-2015 jan data 1                     
          4004 10-JAN-2015 20-JAN-2015 jan data 1      jan data int 2 
          4004 21-JAN-2015 31-JAN-2015 jan data 1                     
          4005 01-JAN-2015 09-JAN-2015                 jan data 2     
          4005 10-JAN-2015 20-JAN-2015 jan data int 1  jan data 2     
          4005 21-JAN-2015 31-JAN-2015                 jan data 2     
          4006 01-JAN-2015 01-JAN-2015 jan data a 1    jan data 2     
          4006 02-JAN-2015 31-JAN-2015 jan data b 1    jan data 2     
          4007 01-JAN-2015 01-JAN-2015 jan data 1      jan data a 2   
          4007 02-JAN-2015 31-JAN-2015 jan data 1      jan data b 2   
          4008 01-JAN-2015 08-JAN-2015 jan data 1                     
          4008 09-JAN-2015 17-JAN-2015 jan data 1      jan data a 2   
          4008 18-JAN-2015 26-JAN-2015 jan data 1      jan data b 2   
          4008 27-JAN-2015 31-JAN-2015 jan data 1      jan-feb data 2 
          4008 01-FEB-2015 04-FEB-2015                 jan-feb data 2 
    
    28 rows selected.
    

    As far as I can tell it gives the expected output in all cases.

    • Thanks for the additional test case, it revealed an issue with my version!

    • I’ve modified it a bit to work in 11g, and it works; see what you think: http://sqlfiddle.com/#!4/58678/2

      • Kim Berg Hansen
        15 October 2015 - 3:22 pm

        Yup, looks like good backport 😉

        Only 12c specific issue in my version is, that I use LATERAL to correlate the CONNECT BY row generator and make it only generate as many rows as needed (1 or 3). Your method of filtering the rowgen in join works fine too, and it’ll work from version 10.

Leave a Reply

Your email address will not be published / Required fields are marked *