A simple data ETL method – nothin’ but SQL

My client has decided to design and build a completely new replacement system for an aging system running on Oracle Forms 6i on Oracle 8. The new system will have a web frontend, backed by Hibernate (don’t get me started) on top of an Oracle 11gR1 database. Crucially, due to changes to business practices and legislation, the new system has been designed “from scratch”, including a new data model.

My task is to write the ETL scripts which will take the data from the legacy database (an Oracle 8i schema), transform it to meet the requirements of the new model, and load it. If you’re looking at building scripts to transform data from one system to another, the method I used might be helpful for you too.

Making it more complicated is their desire that the data move be executed in two stages – (1) before the switch-over, transform and load all “historical” data; (2) at go-live, transform and load all “current” data, as well as any modifications to “historical” data.

Since the fundamental business being supported by this system hasn’t changed very much, the old and new data models have a lot in common – the differences between them are not very complex. In addition, the data volume is not that great (coming from someone who’s worked with terabyte-scale schemas) – the biggest table only had 2 million rows. For these reasons, the purchase of any specialised ETL tools was not considered. Instead, they asked me to write the ETL as scripts that can just be run on the database.

These scripts must be re-runnable: they should be able to be run without modification to pick up any changes in the legacy data, and automatically work out how to merge the changes into the new schema.

The first step for me was to analyse both data models, and work out a mapping for the data between them. The team for the project had a very good idea of what the tables in the new model meant (since they had designed it), but there was no-one available to me to explain how the old data model worked. It was down to me to learn how the legacy data model worked – by exploring it at the database level, examining the source for the forms and reports, and in some cases by talking to the users.

The outcome of this analysis was two spreadsheets: one was a list of every table and column in the legacy database, and the other was a list of every table and column in the new database. For each table in the legacy database, I recorded which table (or tables) the data would be migrated to in the new schema, or an explanation if the data could be safely disregarded. For each table in the new schema, I recorded which table (or tables) in the legacy database would feed into it. In the end, eleven of the tables in the new schema would be loaded.

Then, for each table in the legacy and new schemas, I worked through each column, identifying what it meant, and how it would be mapped from the old to the new. In some cases, the mapping was very 1:1 – perhaps some column names were different, or code values different, but relatively simple. In other cases, the mapping would require a more complex transformation, prehaps based on multiple tables. For example, both systems had a table named “ADDRESS” which stored street or postal addresses; in the old system, this table was a child table to the “PARTY” table; so PARTY was 1:M to ADDRESS. In the new model, however, there was a master “ADDRESS” table which was intended to store any particular address once and only once; the relationship of PARTY to ADDRESS is M:M. De-duplication of addresses hasn’t come up yet but it’s going to be fun when it does 🙂

Thankfully, in no cases was the mapping so complicated that I couldn’t envisage how it could be done using relatively simple SQL.

Once the spreadsheets were filled, I was finally able to start coding!

In order to meet the requirements, my scripts must:

  1. INSERT rows in the new tables based on any data in the source that hasn’t already been created in the destination
  2. UPDATE rows in the new tables based on any data in the source that has already been inserted in the destination
  3. DELETE rows in the new tables where the source data has been deleted

Now, instead of writing a whole lot of INSERT, UPDATE and DELETE statements, I thought “surely MERGE would be both faster and better” – and in fact, that has turned out to be the case. By writing all the transformations as MERGE statements, I’ve satisfied all the criteria, while also making my code very easily modified, updated, fixed and rerun. If I discover a bug or a change in requirements, I simply change the way the column is transformed in the MERGE statement, and re-run the statement. It then takes care of working out whether to insert, update or delete each row.

My next step was to design the architecture for my custom ETL solution. I went to the dba with the following design, which was approved and created for me:

  1. create two new schemas on the new 11g database: LEGACY and MIGRATE
  2. take a snapshot of all data in the legacy database, and load it as tables in the LEGACY schema
  3. grant read-only on all tables in LEGACY to MIGRATE
  4. grant CRUD on all tables in the target schema to MIGRATE.

All my scripts will run as the MIGRATE user. They will read the data from the LEGACY schema (without modifying) and load it into intermediary tables in the MIGRATE schema. Each intermediary table takes the structure of a target table, but adds additional columns based on the legacy data. This means that I can always map from legacy data to new data, and vice versa.

For example, in the legacy database we have a table:

LEGACY.BMS_PARTIES(
 par_id             NUMBER        PRIMARY KEY,
 par_domain         VARCHAR2(10)  NOT NULL,
 par_first_name     VARCHAR2(100) ,
 par_last_name      VARCHAR2(100),
 par_dob            DATE,
 par_business_name  VARCHAR2(250),
 created_by         VARCHAR2(30)  NOT NULL,
 creation_date      DATE          NOT NULL,
 last_updated_by    VARCHAR2(30),
 last_update_date   DATE)

In the new model, we have a new table that represents the same kind of information:

NEW.TBMS_PARTY(
 party_id           NUMBER(9)     PRIMARY KEY,
 party_type_code    VARCHAR2(10)  NOT NULL,
 first_name         VARCHAR2(50),
 surname            VARCHAR2(100),
 date_of_birth      DATE,
 business_name      VARCHAR2(300),
 db_created_by      VARCHAR2(50)  NOT NULL,
 db_created_on      DATE          DEFAULT SYSDATE NOT NULL,
 db_modified_by     VARCHAR2(50),
 db_modified_on     DATE,
 version_id         NUMBER(12)    DEFAULT 1 NOT NULL)

This was the simplest transformation you could possibly think of – the mapping from one to the other is 1:1, and the columns almost mean the same thing.

The solution scripts start by creating an intermediary table:

MIGRATE.TBMS_PARTY(
 old_par_id         NUMBER        PRIMARY KEY,
 party_id           NUMBER(9)     NOT NULL,
 party_type_code    VARCHAR2(10)  NOT NULL,
 first_name         VARCHAR2(50),
 surname            VARCHAR2(100),
 date_of_birth      DATE,
 business_name      VARCHAR2(300),
 db_created_by      VARCHAR2(50),
 db_created_on      DATE,
 db_modified_by     VARCHAR2(50),
 db_modified_on     DATE,
 deleted            CHAR(1))

You’ll notice that the intermediary table has the same columns of the new table (except for VERSION_ID, which will just be 1), along with the minimum necessary to link each row back to the source data – the primary key from the source table, PAR_ID.

You might also notice that there is no unique constraint on PARTY_ID – this is because we needed to do some merging and de-duplication on the party info. I won’t go into that here, but the outcome is that for a single PARTY_ID might be mapped from more than one OLD_PAR_ID.

The second step is the E and T parts of “ETL”: I query the legacy table, transform the data right there in the query, and insert it into the intermediary table. However, since I want to be able to re-run this script as often as I want, I wrote this as a MERGE statement:

MERGE INTO MIGRATE.TBMS_PARTY dest
USING (
  SELECT par_id            AS old_par_id,
         par_id            AS party_id,
         CASE par_domain
           WHEN 'P' THEN 'PE' /*Person*/
           WHEN 'O' THEN 'BU' /*Business*/
         END               AS party_type_code,
         par_first_name    AS first_name,
         par_last_name     AS surname,
         par_dob           AS date_of_birth,
         par_business_name AS business_name,
         created_by        AS db_created_by,
         creation_date     AS db_created_on,
         last_updated_by   AS db_modified_by,
         last_update_date  AS db_modified_on
  FROM   LEGACY.BMS_PARTIES s
  WHERE NOT EXISTS (
    SELECT null
    FROM   MIGRATE.TBMS_PARTY d
    WHERE  d.old_par_id = s.par_id
    AND    (d.db_modified_on = s.last_update_date
            OR (d.db_modified_on IS NULL
               AND s.last_update_date IS NULL))
    )
  ) src
ON (src.OLD_PAR_ID = dest.OLD_PAR_ID)
WHEN MATCHED THEN UPDATE SET
  party_id        = src.party_id        ,
  party_type_code = src.party_type_code ,
  first_name      = src.first_name      ,
  surname         = src.surname         ,
  date_of_birth   = src.date_of_birth   ,
  business_name   = src.business_name   ,
  db_created_by   = src.db_created_by   ,
  db_created_on   = src.db_created_on   ,
  db_modified_by  = src.db_modified_by  ,
  db_modified_on  = src.db_modified_on
WHEN NOT MATCHED THEN INSERT VALUES  (
  src.old_par_id      ,
  src.party_id        ,
  src.party_type_code ,
  src.first_name      ,
  src.surname         ,
  src.date_of_birth   ,
  src.business_name   ,
  src.db_created_by   ,
  src.db_created_on   ,
  src.db_modified_by  ,
  src.db_modified_on  ,
  NULL                );

You’ll notice that all the transformation logic happens right there in a single SELECT statement. This is an important part of how this system works – every transformation is defined in one place and one place only. If I need to change the logic for any column, all I have to do is update it in one place, and re-run the MERGE.

This is a simple example; for some of the tables, the SELECT statement is quite complex.

(Warning: you’ll note that I’ve omitted the column list from the INSERT clause; this can be dangerous if you’re not in complete control of the column order like I am for this particular table)

There is a follow-up UPDATE statement that for a couple of thousand records, changes the PARTY_ID to a different value; in effect, this performs the de-duplication.

Next, we look for any rows that have been deleted:

UPDATE MIGRATE.TBMS_PARTY dest
SET    deleted = 'Y'
WHERE  deleted IS NULL
AND    NOT EXISTS   (
  SELECT null
  FROM   LEGACY.BMS_PARTIES src
  WHERE  src.par_id = dest.old_par_id);

The idea is that the data in the MIGRATE table is *exactly* what we will insert, unmodified, into the target schema. In a year’s time, we could go back to this MIGRATE schema and see what we actually inserted when the system went live. In addition, we’ll be able to go back to the LEGACY schema and see exactly how the data looked in the old system; and we’ll be able to use tables like MIGRATE.TBMS_PARTY to map back-and-forth between the old and new systems.

The final stage of the process is the “L” of “ETL”. This, again, uses a MERGE statement:

MERGE INTO NEW.TBMS_PARTY dest
USING (
  SELECT *
  FROM   MIGRATE.TBMS_PARTY s
  WHERE  s.party_id = s.old_par_id /*i.e. not a duplicate*/
  AND    (s.deleted IS NOT NULL
          OR NOT EXISTS (
            SELECT null
            FROM   NEW.TBMS_PARTY d
            WHERE  d.party_id = s.party_id
            AND    (d.db_modified_on = s.db_modified_on
                    OR (d.db_modified_on IS NULL
                        AND s.db_modified_on IS NULL))
           ) )  
  ) src
ON (src.party_id = dest.party_id)
WHEN MATCHED THEN UPDATE SET
  party_type_code = src.party_type_code ,
  first_name      = src.first_name      ,
  surname         = src.surname         ,
  date_of_birth   = src.date_of_birth   ,
  business_name   = src.business_name   ,
  db_created_by   = src.db_created_by   ,
  db_created_on   = src.db_created_on   ,
  db_modified_by  = src.db_modified_by  ,
  db_modified_on  = src.db_modified_on
DELETE WHERE (src.deleted IS NOT NULL)
WHEN NOT MATCHED THEN INSERT  (
  party_id        ,
  party_type_code ,
  first_name      ,
  surname         ,
  date_of_birth   ,
  business_name   ,
  db_created_by   ,
  db_created_on   ,
  db_modified_by  ,
  db_modified_on  )
VALUES  (
  src.party_type_code ,
  src.first_name      ,
  src.surname         ,
  src.date_of_birth   ,
  src.business_name   ,
  src.db_created_by   ,
  src.db_created_on   ,
  src.db_modified_by  ,
  src.db_modified_on  )
LOG ERRORS
INTO MIGRATE.ERR$_TBMS_PARTY
REJECT LIMIT UNLIMITED;

A few things to note here:

  • The SELECT clause brings back each row from the intermediary table that has not been merged to a new record (by the way, those records are needed because they are used when transforming PAR_ID values in child tables) or that has not been modified since it was previously loaded.
  • The MERGE inserts any new rows, updates all columns for modified rows, and deletes rows that have been marked for deletion.
  • NO transformation of data happens here.
  • If any data fails any validation, the MERGE logs the error and continues, using a table created using this:
BEGIN
DBMS_ERRLOG.create_error_log('NEW.TBMS_PARTY',
  err_log_table_name  => 'ERR$_TBMS_PARTY',
  err_log_table_owner => 'MIGRATE');
END;

I can then query this error table to see if there were any problems, e.g.:

SELECT ORA_ERR_OPTYP$, ORA_ERR_MESG$, COUNT(*)
FROM MIGRATE.ERR$_TBMS_PARTY
GROUP BY ORA_ERR_OPTYP$, ORA_ERR_MESG$;

A common issue is a failed check constraint, e.g. where the old system failed to validate something correctly. We’d then go back and either change the transformation to work around the problem, or send the data back to the business and ask them to fix it in the source.

Each stage of this ETL solution can be restarted and re-run. In fact, that’s what we will be doing; a few weeks prior to go-live, we’ll get a preliminary extract of the old system into the LEGACY schema, and run all my scripts. Then, at go-live, when the old system is taken down, we’ll wipe the LEGACY schema and refresh it from Prod. We will then re-run the scripts to take changes through.

All the scripts for each table had the same structure: one script to create the intermediary table; one script to do the merge into the intermediary table; and one script to merge into the final destination. With the exception of the SELECT statement in the first merge script, which differed greatly for each table, these scripts were very similar, so I started by generating them all. For this I used queries on the data dictionary to generate all the SELECT lists and x = y lists, and after a bit of work I had a complete set of ETL scripts which just needed me to go in and make up the SELECT statement for the transformation.

For this case, a relatively simple data migration problem, this method seems to have worked well. It, or a variation on it, might very well work for you too.


“It works, just not for the reason you think”

A somewhat mistaken attempt to format a number left-padded with up to six zeroes, made me cringe a little:

SELECT LPAD(number_column, 6, 000000) FROM ...

Yes, it does work (the domain of number_column is limited to positive integers), but not for the reason the writer thought, I think.

(Hint: what is the data type of the third parameter to LPAD?)


Count All Rows in a Schema

I wrote this simple SQL*Plus script to loop through all the user’s tables and report how many rows were in them at this point in time. Normally I might estimate this information based on the table statistics, but in this case I had to know if any rows had been inserted since the last stats were gathered. Also, I preferred not to just run the stats gathering job. Obviously, it doesn’t pick up any uncommitted data.

It doesn’t list the tables which have zero (0) rows, but it report how many tables it counted, and the total number of rows across the entire schema.

If run on an 11g database, it will skip the count on tables which have had no segment created yet.

WARNING: this script is suitable for development environments only. Use at your own risk.

SET SERVEROUT ON FEEDBACK OFF
DECLARE
  c  INTEGER;
  tc INTEGER := 0;
  tr INTEGER := 0;
BEGIN
$IF DBMS_DB_VERSION.VER_LE_10 $THEN
  DBMS_OUTPUT.enable(1000000);
$ELSE
  DBMS_OUTPUT.put_line('Only tables with segment created are counted.');
$END
  FOR r IN (
    SELECT table_name
    FROM   user_tables
$IF DBMS_DB_VERSION.VER_LE_10 $THEN
$ELSE
    WHERE  segment_created = 'YES'
$END
    ORDER BY table_name
    )
    LOOP
    tc := tc + 1;
    BEGIN
      EXECUTE IMMEDIATE REPLACE(
        'SELECT COUNT(*) FROM "#TABLE#"'
        ,'#TABLE#',r.table_name)
        INTO c;
      tr := tr + c;
      IF c > 0 THEN
        DBMS_OUTPUT.put_line(
          RPAD(r.table_name,31,' ')
          || TO_CHAR(c,'9G999G999G990'));
      END IF;
    EXCEPTION
      WHEN OTHERS THEN
        DBMS_OUTPUT.put_line(r.table_name);
        DBMS_OUTPUT.put_line(DBMS_UTILITY.format_error_stack);
    END;
  END LOOP;
  DBMS_OUTPUT.put_line('counted '
    || tc || ' table(s)');
  DBMS_OUTPUT.put_line('total rows '
    || TO_CHAR(tr,'9G999G999G990'));
END;
/
SET SERVEROUT OFF FEEDBACK ON

Example output:

SQL> @count_all_tables
Only tables with segment created are counted.
counted 1 table(s)
total rows              0

This tells me that there are no rows yet, and of all the tables, only one has had a segment created for it (e.g. perhaps an INSERT was attempted but rolled back, or the row was deleted). I’m sure you’ll be able to adapt the script to suit your obsessive record-counting needs.


Difference Between Two Timestamps

Easy, right? Perhaps not as straightforward as you’d think.

Method #1: use EXTRACT

extract(day from (x-y))*24*60*60
+ extract(hour from (x-y))*60*60
+ extract(minute from (x-y))*60
+ extract(second from (x-y))

Method #2: use CAST

( CAST( x AS DATE ) - CAST( y AS DATE ) ) * 86400

The difference? Method #2 is faster (my tests indicate faster by a factor of 3), but does not include fractional seconds. Method #1 is a bit slower, but includes fractions of a second. See the SO link for details.


INSERT-VALUES vs. INSERT-SELECT-FROM-DUAL

There’s no difference between the effects of the following two statements, are there:

INSERT INTO mytable (col1, col2) VALUES ('hello','world');
INSERT INTO mytable (col1, col2) SELECT 'hello', 'world' FROM DUAL;

Well, as it turns out, it is possible for the first statement to succeed where the second statement would fail – in the presence of a suitably crafted Before Insert trigger, the second will raise “ORA-04091 table is mutating, trigger/function may not see it”:

http://oraclequirks.blogspot.com/2010/09/ora-04091-table-stringstring-is.html


Generate DML/DDL/QUERY from SQL – the easy way

This is a comment on jagatheesh.ramakrishnan@oracle.com’s article “Generate DML/DDL/QUERY from SQL” – comments are not enabled on their blog so here’s my addendum.

The following query is offered as a means of generating a script to ONLINE all datafiles in a database:

select 'ALTER DATABASE DATAFILE '''|| name || ''' online ;'
from v$datafile;

I suggest an alternative method, which is both easier to write and easier to maintain:

select REPLACE(q'[
ALTER DATABASE DATAFILE '#NAME#' online ;
]','#NAME#',name) from v$datafile;

This way, the syntax within the DDL is unmuddied by the syntax required by the query to generate it. It’s immediately obvious that only single quotes will surround the name of the datafile in the generated DDL.

If  you’re on a pre-10g database, you can still use this method, but you’ll need to revert to the old quote-escape:

select REPLACE('
ALTER DATABASE DATAFILE ''#NAME#'' online ;
','#NAME#',name) from v$datafile;

Never satisfied

So I followed the great advice here to use the new COLLECT function in Oracle 10g to solve the common string-aggregation-in-SQL problem. I chose this solution over the others because it sorts the results as well, which was a user requirement. This is because the COLLECT function optionally takes an ORDER BY clause – even though the 10g documentation forgot to mention this. The 11g documentation was updated to include the additional syntax option, as well as the fact that COLLECT supports the DISTINCT (or UNIQUE) keyword as well to remove duplicates – testing indicates that this works in 10g as well.

This means that if I define a suitable type varchar2_ntt and a suitable function ntt_to_string, I can write a query like this:

SELECT dept,
       ntt_to_string(
          CAST(
            COLLECT(ename ORDER BY ename)
          AS varchar2_ntt)
       ) AS ename_list
FROM emp
GROUP BY dept;

That works fine. But now I want to combine this with some analytic functions, like this:

SELECT DISTINCT
       dept,
       ntt_to_string(
          CAST(
            COLLECT(ename ORDER BY ename)
          AS varchar2_ntt)
       ) AS ename_list,
       FIRST_VALUE(ename)
          OVER (PARTITION BY dept
                ORDER BY sal DESC) AS topdog
FROM emp;

This doesn’t work because (a) the COLLECT requires a GROUP BY; and (b) the analytic function cannot be used along with a GROUP BY. What I’d expect to be able to do is use an analytic version of the COLLECT:

SELECT DISTINCT
       dept,
       ntt_to_string(
          CAST(
            COLLECT(ename
                    PARTITION BY dept
                    ORDER BY ename)
          AS varchar2_ntt)
       ) AS ename_list,
       FIRST_VALUE(ename)
          OVER (PARTITION BY dept
                ORDER BY sal DESC) AS topdogFROM emp;

Us SQL developers are never satisfied, are we?

At this stage my best solution for this is to use a CTE:

WITH q AS (SELECT dept, ename, sal FROM emp)
SELECT DISTINCT
       dept,
       (SELECT ntt_to_string(
                 CAST(
                   COLLECT(q2.ename ORDER BY q2.ename)
                 AS varchar2_ntt)
               )
        FROM q q2
        WHERE q2.dept = q.dept
       ) AS ename_list,
       FIRST_VALUE(ename)
          OVER (PARTITION BY dept
                ORDER BY sal DESC) AS topdog
FROM q;

Purge all Recyclebins without SYSDBA

Want to purge all the recyclebins for all users in your database?

Can’t (or don’t want to) log in as each user individually to run this, one by one:

PURGE RECYCLEBIN

You could log in as SYSDBA and run:

PURGE DBA_RECYCLEBIN

But, what if you don’t have SYSDBA access? (Yes, this has happened to me, with a client-supplied VM – they told me the SYSTEM password but not the SYS password)

You can use the PURGE TABLESPACE x USER y option – run this to generate the commands to run, then run the output as a script:

SELECT DISTINCT
       REPLACE(REPLACE('PURGE TABLESPACE "#TS#" USER "#OWNER#";'
       ,'#TS#',ts_name)
       ,'#OWNER#',owner)
FROM dba_recyclebin
WHERE ts_name IS NOT NULL;
PURGE TABLESPACE "USER_DATA" USER "SCOTT";
PURGE TABLESPACE "USER_DATA" USER "LARRY";

WARNING: this command will result in irreversible loss of data for all users on your database.

Tested on Oracle 10gR2.


A fascinating SQL problem

Can you solve this problem in SQL – i.e. without resorting to a procedural solution?

SQL combine multiple identifiers to create a group id for duplicate records

“I have a data extract with three different identifiers: A, B, C
Each identifier may appear in more than one row, and each row may have one or more of these three identifiers (i.e the column is populated or null).
I want to be able to group all records that have any combination of either A, B or C in common and assign them the same group id.
Extract table showing what the eventual groups should be:

A    | B    | C    | Group
====   ====   ====   =====
p      NULL   NULL   1
p      r      NULL   1
q      NULL   NULL   2
NULL   r      NULL   1
NULL   NULL   s      2
q      NULL   s      2

So, the input data is a table with three columns (A, B, and C), some of which are NULL. The output is a third column, “Group”, which will be assigned a number which classifies the row into a “group”. Each group will be distinct in that none of its members will have a value in A, B or C that appears in any row in any other group.

This question is fascinating because it cannot be solved, I believe, without some form of iteration. If I get the row (p) along with (q), they are in two different groups; but, if I add the row (p,q), all of a sudden my original rows are now in the same group along with the new row.

The solution will probably have to examine each row in consideration with the entire record set – an operation of O(n^2), if my understanding of CS theory is correct. I suspect a solution using at least a CTE and/or the MODEL clause will be required.

UPDATE:
An elegant solution, using a hierarchical query and Oracle’s CONNECT_BY_ROOT function, has been posted by Vincent Malgrat.


Negative, Captain

Seen in the wild:

... WHERE substr(amount,0, 1) != '-'

If you wanted to query a table of monetary transactions for any refunds (i.e. where the transaction amount is negative), how would you do it? Perhaps you’d think about avoiding problems that might occur if the default number format were to change, hm?

(before you say it: no, there is no index on amount, so it wasn’t a misguided attempt to avoid an index access path…)