Non-Overlapping Dates Constraint

If you have a table that represents time-varying info, e.g. with From and To date/time columns, you have a few options with regards to the problem of overlapping dates:

1. Check for overlapping dates in the application layer.

2. Use an off-the-shelf product to generate the appropriate triggers, e.g. Oracle CDM*RuleFrame or Toon Koppelaars’  RuleGen.

3. Roll your own, in the database.

4. Use a different data model that can use a unique constraint.

5. Forget about it.

One reason it’s difficult is that this is an example of a cross-row constraint, i.e. one that cannot merely be checked for the current row by itself. Oracle supports a few cross-row constraints, i.e. Primary Key, Unique and Foreign Key constraints; but it doesn’t natively support arbitrary assertions, which would allow us to easily declare this sort of constraint.

The real challenge comes from the fact that Oracle is a multi-user system and concurrent sessions cannot see the uncommitted data from other sessions; so some form of serialization will be required to ensure that when one session wishes to insert/update the data for a particular entity, no other session is allowed to start working on the same entity until the first session commits (or issues a rollback).

The problem is not new; it’s been around for a long time, and tripped many a new (and old) programmer.

There are two problems with option #1 (code in the application layer): firstly, you have to repeat the code for each different type of client (e.g. you might have a Java UI on one side, as well as some batch script somewhere else); secondly, usually the programmer involved will not understand the concurrency problem mentioned above and will not take it into account.

Option #2 is probably the best, most of the time. The solution is implemented at the database level, and is more likely to work correctly and efficiently.

Option #4 (change the data model) involves not storing the From and To dates, but instead dividing up all time ranges into discrete chunks, and each row represents a single chunk of time. This solution is valid if the desired booking ranges are at predictable discrete ranges, e.g. daily. You can then use an ordinary unique constraint to ensure that each chunk of time is only booked by one entity at any one time. This is the solution described here.

Option #5 (forget about it) is also a viable option, in my opinion. Basically it entails designing the rest of the application around the fact that overlapping date ranges might exist in the table – e.g. a report might simply merge the date ranges together prior to output.

Option #3, where you implement the triggers yourself on the database, has the same advantage as Option #2, where it doesn’t matter which application the data is coming from, the constraint will hold true. However, you have to be really careful because it’s much easier to get it wrong than it is to get right, due to concurrency.

I hear you scoffing, “Triggers?!?”. I won’t comment except to refer you to this opinion, which I couldn’t say it better myself: The fourth use-case for Triggers.

There is another Option #3 using a materialized view instead of triggers; I’ll describe this alternative at the end of this post.

So, here is a small example showing how an overlapping date constraint may be implemented. It is intentionally simple to illustrate the approach: it assumes that the From and To dates cannot be NULL, and its rule for detecting overlapping dates requires that the dates not overlap at all, to the nearest second.

  1. Create the tables
    CREATE TABLE room
      (room_no NUMBER NOT NULL
      ,CONSTRAINT room_pk PRIMARY KEY (room_no)
      );
    
    CREATE TABLE room_booking
      (room_no NUMBER NOT NULL
      ,booked_from DATE NOT NULL
      ,booked_to DATE NOT NULL
      ,CONSTRAINT room_booking_pk
         PRIMARY KEY (room_no, booked_from)
      ,CONSTRAINT room_booking_fk
         FOREIGN KEY (room_no) REFERENCES room (room_no)
      );
    

  2. Create the validation trigger (note – I’ve used an Oracle 11g compound trigger here, but it could easily be rewritten for earlier versions by using two triggers + a database package):
    CREATE OR REPLACE TRIGGER room_booking_trg
      FOR INSERT OR UPDATE OF room_no, booked_from, booked_to
        ON room_booking
      COMPOUND TRIGGER
    
      TYPE room_no_array IS TABLE OF CHAR(1)
        INDEX BY BINARY_INTEGER;
    
      room_nos room_no_array;
    
    PROCEDURE lock_room (room_no IN room.room_no%TYPE) IS
      dummy room.room_no%TYPE;
    BEGIN
      SELECT r.room_no
      INTO dummy
      FROM room r
      WHERE r.room_no = lock_room.room_no
      FOR UPDATE;
    END lock_room;
    
    PROCEDURE validate_room (room_no IN room.room_no%TYPE) IS
      overlapping_booking EXCEPTION;
      dummy CHAR(1);
    BEGIN
      -- check for overlapping date/time ranges
      BEGIN
        SELECT 'X' INTO dummy
        FROM room_booking rb1
            ,room_booking rb2
        WHERE rb1.room_no = validate_room.room_no
        AND rb2.room_no = validate_room.room_no
        AND rb1.booked_from != rb2.booked_from
        AND (
             rb1.booked_from BETWEEN rb2.booked_from
                                 AND rb2.booked_to
             OR
             rb1.booked_to BETWEEN rb2.booked_from
                               AND rb2.booked_to
            )
        AND ROWNUM = 1;
        RAISE overlapping_booking;
      EXCEPTION
        WHEN NO_DATA_FOUND THEN
          -- good, no constraint violations
          NULL;
      END;
    EXCEPTION
      WHEN overlapping_booking THEN
        RAISE_APPLICATION_ERROR(-20000,
          'Overlapping booking for room #' || room_no);
    END validate_room;
    
    PROCEDURE validate_rooms IS
      room_no room.room_no%TYPE;
    BEGIN
      room_no := room_nos.FIRST;
      LOOP
        EXIT WHEN room_no IS NULL;
        validate_room (room_no);
        room_no := room_nos.NEXT(room_no);
      END LOOP;
      room_nos.DELETE;
    EXCEPTION
      WHEN OTHERS THEN
        room_nos.DELETE;
        RAISE;
    END validate_rooms;
    
    BEFORE EACH ROW IS
    BEGIN
      -- lock the header record (so other sessions
      -- can't modify the bookings for this room
      -- at the same time)
      lock_room(:NEW.room_no);
      -- remember the room_no to validate later
      room_nos(:NEW.room_no) := 'Y';
    END BEFORE EACH ROW;
    
    AFTER STATEMENT IS
    BEGIN
      validate_rooms;
    END AFTER STATEMENT;
    
    END room_booking_trg;
    /
    

That’s all you need. The trigger locks the header record for the room, so only one session can modify the bookings for a particular room at any one time. If you don’t have a table like “room” in your database that you can use for this purpose, you could use DBMS_LOCK instead (similarly to that proposed in the OTN forum discussion here).

It would not be difficult to adapt this example for alternative requirements, e.g. where the From and To dates may be NULL, or where the overlapping criteria should allow date/time ranges that coincide at their endpoints (e.g. so that the date ranges (1-Feb-2000 to 2-Feb-2000) and (2-Feb-2000 to 3-Feb-2000) would not be considered to overlap). You’d just need to modify the comparison in the query in validate_room to take these requirements into account.

Test case #1

INSERT INTO room (room_no) VALUES (101);
INSERT INTO room (room_no) VALUES (201);
INSERT INTO room (room_no) VALUES (301);

INSERT INTO room_booking (room_no, booked_from, booked_to)
VALUES (101, DATE '2000-01-01', DATE '2000-01-02' - 0.00001);

INSERT INTO room_booking (room_no, booked_from, booked_to)
VALUES (101, DATE '2000-01-02', DATE '2000-01-03' - 0.00001);

INSERT INTO room_booking (room_no, booked_from, booked_to)
VALUES (201, DATE '2000-02-01', DATE '2000-02-05' - 0.00001);

Expected: no errors

Test case #2

INSERT INTO room_booking (room_no, booked_from, booked_to)
VALUES (201, DATE '2000-02-01', DATE '2000-02-02' - 0.00001);

INSERT INTO room_booking (room_no, booked_from, booked_to)
VALUES (201, DATE '2000-02-02', DATE '2000-02-04' - 0.00001);

INSERT INTO room_booking (room_no, booked_from, booked_to)
VALUES (201, DATE '2000-02-03', DATE '2000-02-05' - 0.00001);

INSERT INTO room_booking (room_no, booked_from, booked_to)
VALUES (201, DATE '2000-02-03', DATE '2000-02-06' - 0.00001);

INSERT INTO room_booking (room_no, booked_from, booked_to)
VALUES (201, DATE '2000-01-31', DATE '2000-02-01');

INSERT INTO room_booking (room_no, booked_from, booked_to)
VALUES (201, DATE '2000-01-31', DATE '2000-02-02' - 0.00001);

INSERT INTO room_booking (room_no, booked_from, booked_to)
VALUES (201, DATE '2000-01-31', DATE '2000-02-06' - 0.00001);

INSERT INTO room_booking (room_no, booked_from, booked_to)
VALUES (201, DATE '2000-02-05' - 0.00001, DATE '2000-02-06' - 0.00001);

UPDATE room_booking SET booked_to = '2000-01-02' - 0.00001
WHERE room_no = 101 AND booked_from = DATE '2000-01-02';

Expected: constraint violation on each statement

Test case #3

in session #1:

INSERT INTO room_booking (room_no, booked_from, booked_to)
VALUES (301, DATE '2000-01-01', DATE '2000-02-01' - 0.00001);

in session #2:

INSERT INTO room_booking (room_no, booked_from, booked_to)
VALUES (301, DATE '2000-01-15', DATE '2000-01-16' - 0.00001);

Expected: session #2 will wait until session #1 issues a COMMIT or ROLLBACK. If session #1 COMMITs, session #2 will then report a constraint violation. If session #2 rolls back, session #2 will complete without error.

The No-Trigger option #3: Materialized View

This is similar to a solution proposed by Rob Van Wijk. It uses a constraint on a materialized view to stop overlapping date ranges.

So, instead of the trigger, you would do something like this:

CREATE MATERIALIZED VIEW LOG ON room_booking WITH ROWID;

CREATE MATERIALIZED VIEW room_booking_date_ranges
  REFRESH FORCE ON COMMIT
  AS SELECT 'X' AS dummy
     FROM room_booking rb1
         ,room_booking rb2
     WHERE rb1.room_no = rb2.room_no
     AND rb1.booked_from != rb2.booked_from
     AND (
          rb1.booked_from BETWEEN rb2.booked_from
                              AND rb2.booked_to
          OR
          rb1.booked_to BETWEEN rb2.booked_from
                            AND rb2.booked_to
         );

ALTER TABLE room_booking_date_ranges
  ADD CONSTRAINT no_overlapping_dates_ck
  CHECK ( dummy = 'Z' );

The nice thing about this solution is that it is simpler to code, and seems more “declarative” in nature. Also, you don’t have to worry about concurrency at all.

The constraint is checked at COMMIT-time when the materialized view is refreshed; so it behaves like a deferred constraint, which may be an advantage for some scenarios.

I believe it may perform better than the trigger-based option when large volumes of data are inserted or updated; however it may perform worse than the trigger-based option when you have lots of small transactions. This is because, unfortunately, the query here cannot be a “REFRESH FAST ON COMMIT” (if you know how this could be changed into a REFRESH FAST MV, please let me know!).

 

What do you think? If you see any potential issues with the above solutions please feel free to comment.

EDIT 30/8: added some more test cases


Fabian Pascal is back and Debunking again

This book takes pride of place on my bookshelf. Highly recommended reading for anyone in the database industry.

If you haven’t seen Fabian Pascal’s blog before, it’s because he’s only just started it – but he’ll be publishing new material, as well as articles previously published at Database Debunkings, infamous for his fundamental, no-holds-barred, uncompromising take on what the Relational Model is, what it isn’t, and what that means for all professionals who design databases.

It was with sadness that I saw the site go relatively static over the past years, and to see it being revived is a fresh blast of cool air in a world that continues to be inundated by fads and misconceptions. Of particular note was the “THE VOCIFEROUS IGNORANCE HALL OF SHAME“… I’m looking forward to seeing the old vigorous debates that will no doubt be revived or rehashed.

The pure view of the Relational model of data is, perhaps, too idealistic for some – impractical for day-to-day use in a SQL-dominated world. Personally, I’ve found (although I cannot pretend to be an expert, in any sense, on this topic) that starting from a fundamentally pure model, unconstrained by physical limitations, conceived at an almost ideal, Platonic level, allows me to discover the simplest, most provably “correct” solution for any data modelling problem. At some stage I have to then “downgrade” it to a form that is convenient and pragmatic for implementation in a SQL database like Oracle; in spite of this, having that logical design in the back of my head helps to highlight potential inconsistencies or data integrity problems that must then be handled by the application.

That this situation is, in fact, not the best of all possible worlds, is something that we can all learn and learn again. Have a look, and see what you think: dbdebunk.blogspot.com.au.


DATE is not a date!

Consider:

  • DATE is not a date.
  • DATE '2012-06-22' is not a date.
  • CURRENT_DATE is not a date.
  • SYSDATE is not a date.
  • TRUNC(SYSDATE) is not a date.
  • TO_DATE('22/06/2012','DD/MM/YYYY') is not a date.

Oracle SQL does not have a native date datatype.

Explanation: the datatype called “DATE” actually represents a Date+Time. It always has a time portion. For example, the literal expression DATE '2012-06-22'has a time component that means midnight, to the nearest second. It is only by convention that a DATE with a time of 00:00:00 is used to represent the whole day, rather than exactly midnight; but this convention only works if the developer carefully writes his code to ensure this is true.

A fragment of the Antikythera mechanism.

It is theorized that the ancients solved “once and for all” the problem with dates vs. times.

To an experienced Oracle developer, this is not a problem, it feels perfectly natural to say “DATE” but internally to be thinking in terms of time values. It’s easy to forget that we’re even doing this subconsciously. To an outsider (e.g. business analysts, testers, Java developers), it can be confusing – and it is often not immediately obvious that the confusion exists. It’s not until you start explaining why SYSDATE <= DATE '2012-06-22' evaluates to FALSE, even if today is the 22nd day of the month of June in the year of our Lord 2012 that you realise they have been labouring under a false assumption: that a DATE is a date, that it represents the full 24-hour period that a normal human would call “22 June 2012”.

If I was invited to change the name of just one thing in Oracle (and everyone was willing to make all the changes necessary to their code to accommodate my whim) it would be to change “DATE” to “DATETIME”.

I ask you: is there anything else in Oracle that confuses outsiders more often than this misnomer?

P.S. Now for a freebie: here is a summary of a number of transformations that may be done to remove the TRUNC function call around a date column (a and b are all of type DATE):

 TRUNC(a) =  TRUNC(b)  =>  (a BETWEEN TRUNC(b) AND TRUNC(b)+0.99999)
 TRUNC(a) <  TRUNC(b)  =>  a < TRUNC(b)
 TRUNC(a) <= TRUNC(b)  =>  a < TRUNC(b)+1
 TRUNC(a) >  TRUNC(b)  =>  a >= TRUNC(b)+1
 TRUNC(a) >= TRUNC(b)  =>  a >= TRUNC(b)
 TRUNC(a) =  b         =>  (a BETWEEN b AND b+0.99999
                            AND b = TRUNC(b))
 TRUNC(a) <  b         =>  (a < TRUNC(b)+1
                            AND NOT (a=TRUNC(a) AND b=TRUNC(b)))
 TRUNC(a) <= b         =>  a < TRUNC(b)+1
 TRUNC(a) >  b         =>  a >= TRUNC(b)+1
 TRUNC(a) >= b         =>  (a >= TRUNC(b)
                            AND NOT (b=TRUNC(b)))

Constrain a table to only 1 row

I needed a table that could only ever have one row – if anyone tried to insert a second row they’d get an error.

CREATE UNIQUE INDEX only_one_row_allowed ON mytable (1);

Testing it:

INSERT INTO mytable VALUES ('x');
ORA-00001: unique constraint (SCOTT.ONLY_ONE_ROW_ALLOWED) violated


DEFAULT NULL NULL

Found myself running the following command today:

ALTER TABLE mytable
MODIFY (mycolumn DEFAULT NULL NULL);

I know – I’m easily entertained 🙂


3 Reasons to Hate Hibernate

Warning: this is a rant.

This is just a collection of observations of Hibernate, from the perspective of an Oracle developer/”DBA”. I’m aware of some of the benefits of using Hibernate to shield Java developers from having to know anything about the database or the SQL language, but sometimes it seems to me that we might generally be better off if they were required to learn a little about what’s going on “underneath the hood”. (Then I remind myself that it’s my job to help them get the most out of the database the client spent so much money getting.)

So, here are my gripes about Hibernate – just getting them off my chest so I can put them to bed.

Disclaimer: I know every Hibernate aficionado will jump in with “but it’s easy to fix that, all you have to do is…” but these are generalizations only.

Exhibit A: Generic Query Generators

As soon as I’d loaded all the converted data into the dev and test instances, we started hitting silly performance issues. A simple search on a unique identifier would take 20-30 seconds to return at first, then settle down to 4-8 seconds a pop. Quite rightly, everyone expected these searches to be virtually instant.

The culprit was usually a query like this:

select count(*) as y0_
from XYZ.SOME_TABLE this_
inner join XYZ.SOME_CHILD_TABLE child1_
on this_.PARENT_ID=child1_.PARENT_ID
where lower(this_.UNIQUE_IDENTIFIER) like :1
order by child1_.COLH asc, child1_.COLB asc, this_.ANOTHER_COL desc

What’s wrong with this query, you might ask?

Issue 1: Case-insensitive searches by default

Firstly, it is calling LOWER() on the unique identifier, which will never contain any alphabetic characters, so case-insensitive searches will never be required – and so it will not use the unique index on that column. Instead of forcing the developers to think about whether case-insensitive searches are required or not for each column, it allows them to simply blanket the whole system with these – and quite often no-one will notice until the system goes into UAT or even Prod and someone actually decides to test searching on that particular column, and decides that waiting for half a minute is unacceptable. It’s quite likely that for some cases even this won’t occur, and these poorly performing queries (along with their associated load on the database server) will be used all the time, and people will complain about the general poor performance of the database.

Issue 2: Count first, then re-query for the data

Secondly, it is doing a COUNT(*) on a query which will immediately after be re-issued in order to get the actual data.  I’d much prefer that the developers were writing the SQL by hand. That way, it’d be a trivial matter to ask them to get rid of the needless COUNT(*) query; and if they simply must show a total record count on the page, add a COUNT(*) OVER () to the main query – thus killing two birds with one efficient stone.

Exhibit B: Magical Class Generators (tables only)

Issue 3: No views, no procedures, no functions

When someone buys Hibernate, they might very well ask: is it possible to call an Oracle procedure or function with this product? And the answer is, of course, “yes”. Sure, you can do anything you want!

The day the Java developers peel off the shrinkwrap, the first thing they try is creating a Java class based on a single table. With glee they see it automagically create all the member attributes and getter/setter methods, and with no manual intervention required they can start coding the creation, modification and deletion of records using this class, which takes care of all the dirty SQL for them.

Then, the crusty old Oracle developer/”DBA” comes along and says: “It’d be better if you could use this API I’ve lovingly crafted in a PL/SQL package – everything you need is in there, and you’ll be shielded from any complicated stuff we might need to put in the database now or later. All you have to do is call these simple procedures and functions.” And the Java developer goes “sure, no problem” – until they discover that Hibernate cannot automatically create the same kind of class they’ve already gotten accustomed to.

“What, we actually need to read the function/procedure definition and hand-code all the calls to them? No sir, not happening.” After all, they bought Hibernate to save them all that kind of work, and who’s going to blame them?

So, you say, “Ok, no problem, we’ll wrap the API calls with some simple views, backed by instead-of triggers.” But then they hit another wall – Hibernate can’t tell from a view definition how that view relates to other views or tables.

The end result is that all the Java code does is access tables directly. And you get the kind of queries (and worse) that you saw in Exhibit “A” above.

There. I feel so much better already.

/rant


MERGE to table with Virtual Columns + Error Log Table = ORA-03113

This is just in case you come across this. It appears to be an Oracle bug. I don’t know if it’s been fixed in a later release. I haven’t isolated a simple repeatable test case yet, so I don’t know if the scenario here includes all the relevant details or not.

Version: Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 – 64bit Production

Scenario:

  • MERGE into a table that has one or more virtual columns.
  • The MERGE statement has a “LOG ERRORS” clause.
  • The error log table is a standard one created using DBMS_ERRLOG.create_error_log.

Other factors that may or may not be involved are:

  • the target table is in a different schema
  • the merge statement queries another table with the same name in the current schema
  • the merge statement includes WHEN MATCHED THEN UPDATE, DELETE WHERE, WHEN NOT MATCHED THEN INSERT cluases
  • the merge statement has a WHERE clause on the WHEN NOT MATCHED clause

Result:

ORA-03113: end-of-file on communication channel and/or ORA-03114: not connected to ORACLE

This happens whether the MERGE is executed or just explain-plan’ed – so it shouldn’t be related to the actual data in either the source or destination tables.

Workaround:

After creating the log table, but before the merge, drop the columns from the error log that are virtual columns in the target table.


Quick and simple frequency analysis

I use this simple query quite often when exploring the data in a table in any Oracle database (from Oracle v8 onwards):

select q.*, 100 * ratio_to_report(c) over () rtr
from (select distinct v, count(*) over (partition by v) c from (
select MYCOLUMN v from MYTABLE
)) q order by c desc;

Just substitute the table name for “MYTABLE” and the column you’re interested in for “MYCOLUMN”. This gives a frequency analysis of values, e.g.:

V         C       RTR
========  ======  =============
INACTIVE  401001  92.9254049544
ACTIVE    30529   7.0745950455

V is the value from the column. C is the count of how many times that value appeared. RTR is the % ratio to the total. The first row indicates the most popular value.

If it’s a very large table and you want quicker results, you can run the analysis over a smaller sample easily, just by adding the SAMPLE keyword:

...
select MYCOLUMN v from MYTABLE SAMPLE(1)
...

SPOD for a query

I have two queries that need to be executed by a PL/SQL program. Both of them are quite complex, and both of them have a large section which is identical – because they are different views of the same underlying source data.

One option is to expand out both queries in full, e.g.:

Query 1:

 SELECT <complicated expressions>
 FROM (
       <large complicated query>
      ), <other tables>
 WHERE <complicated predicates>;

Query 2:

 SELECT <different complicated expressions>
 FROM (
       <large complicated query>
      ), <other different tables>
 WHERE <different complicated predicates>;

I don’t like the fact that my <large complicated query> is repeated in full in both cursor definitions. I’d rather have one place where that subquery is defined, because it should remain the same for both queries, since they are supposed to be different views of the same underlying data.

Another option is to create a view on the <large complicated query>, and refer to that in both queries. This is a perfectly acceptable option, and one which I often use. The only downside is if there are any parameters that need to be “fed” to the view. One way is for the view to expose the parameter as a column in the view, and for the calling query to simply query it on that column. This is not always the most efficient method, however, depending on the complexity of the view and how well Oracle can “push down” the predicate into the view at execution time. Another solution to the parameter problem is to use a user-defined context as described here.

The other downside which I don’t like for this case is that the view moves the query away from the package – I’d prefer to have the definitions close together and maintained in one location.

The solution which I used in this case is a pipelined function. For example:

 FUNCTION large_complicated_query
   RETURN source_data_table_type
   PIPELINED IS
   rc source_data_type;
 BEGIN
   FOR r IN (<large complicated query>) LOOP
     rc.col1 := r.col1;
     rc.col2 := r.col2;
     -- etc.
     PIPE ROW (rc);
   END LOOP;
   RETURN;
 END;

Now, the two queries in my package can re-use it like this:

 SELECT <complicated expressions>
 FROM TABLE(my_package.large_complicated_query)
      ,<other tables>
 WHERE <complicated predicates>;

In the package spec I have:

 -- *** dev note: for internal use only ***
 TYPE source_data_type IS
   RECORD (col1 col1_data_type, etc....);
 TYPE source_data_type_table IS TABLE OF source_data_type;
 FUNCTION large_complicated_query
   RETURN source_data_table_type PIPELINED;
 -- *** ******************************* ***

Because the pipelined function is going to be called by SQL (in fact, two queries defined in the same package), its declaration must also be added to the package spec.

In the package body, I use private global variable(s) to hold the parameter for the large complicated query.

When the queries are run, the global variable(s) must first be set to the required parameter. The queries are run, then the global variables are cleared.

The pipelined function is deliberately not useful to other processes – if a developer tried to call it, they’d get no results because they can’t set the parameters (since they are declared as private globals).

A downside to this approach is that the optimizer will not be able to optimize the entire queries “as a whole” – it will execute the entire query in the pipelined function (at least, until the calling queries decide to stop fetching from it). For my case, however, this is not a problem. The entire process runs in less than a second – and this is 10 times faster than it needs to be. In other words, in this case maintainability is more important than performance.

There may be other ways to do this (in fact, I’m quite sure there are), but this way worked for me.


Psychology and Large Tables

Today a project manager asked me about a change to a query being implemented for a search function on our 11gR2 database. His concern was that the new query had a plan that involved several full table scans; whereas the old version used primarily index range scans.

The query used to look something like this:

SELECT score(1) AS score, a.*
FROM (SELECT *
      FROM entity_details ed
      JOIN associations a
      ON ed.entity_id = a.entity_id
      LEFT JOIN addresses ad
      ON ed.entity_id = ad.entity_id
     ) a
WHERE CONTAINS(a.entity_name,
  :criterion, 1) > 0
ORDER BY a.entity_name;

Its plan looked like this:

SELECT STATEMENT ALL_ROWS cost=52 card=13
- SORT ORDER BY cost=52 card=13
  - NESTED LOOPS OUTER cost=51 card=13
    - NESTED LOOPS cost=31 card=12
      - TABLE ACCESS BY INDEX ROWID entity_details cost=7 card=12
        - DOMAIN INDEX entity_name_ic cost=4
      - TABLE ACCESS BY INDEX ROWID associations cost=2 card=1
        - INDEX RANGE SCAN asso_entity_i cost=1 card=1
    - TABLE ACCESS BY INDEX ROWID addresses cost=2 card=1
      - INDEX RANGE SCAN address_entity_fk_i cost=1 card=1

The new query had an additional predicate:

SELECT score(1) AS score, a.*
FROM (SELECT *
      FROM entity_details ed
      JOIN associations a ON ed.entity_id = a.entity_id
      LEFT JOIN addresses ad ON ed.entity_id = ad.entity_id
     ) a
WHERE CONTAINS(a.entity_name, :criterion, 1) > 0
OR a.entity_name LIKE '%' || UPPER(:criterion) || '%'
ORDER BY a.entity_name;

The query plan for the original query involved index range scans on all the tables; whereas the query plan for the new query involved full table scans on associations and addresses.

SELECT STATEMENT ALL_ROWS cost=348 card=1269
- SORT ORDER BY cost=348 card=1269
  - HASH JOIN OUTER cost=347 card=1269
    - HASH JOIN cost=244 card=1187
      - TABLE ACCESS BY INDEX ROWID entity_details cost=107 card=1187
        - BITMAP CONVERSION TO ROWIDS
          - BITMAP OR
            - BITMAP CONVERSION FROM ROWIDS
              - SORT ORDER BY
                - INDEX RANGE SCAN entity_name_i cost=4
            - BITMAP CONVERSION FROM ROWIDS
              - SORT ORDER BY
                - DOMAIN INDEX entity_name_ic cost=4
      - TABLE ACCESS FULL association cost=136 card=2351
    - TABLE ACCESS FULL addresses cost=102 card=18560

Initial testing in dev revealed no noticeable performance difference between the two queries, so he was just concerned about the impact of the full table scans on the system.

As you can see, the new plan was still using the domain index, as well as using the ordinary index on entity_name; concatenating the two sets of ROWIDs (BITMAP OR) and then accessing the table as before. Previously, the cardinality estimate for just the CONTAINS predicate was 12 (side note: I’m curious as to how predicates using context indexes are costed – anyone know any details or references?); now, the total cardinality estimate for entity_details is 1187. The cardinality estimate if we just did the LIKE predicate is 1176 (which is simply 5% of the number of rows in the table). The higher cardinality has pushed the rest of the query away from index accesses for association and addresses, towards hash joins and full table scans on those tables.

If I override the cardinality estimate with a lower figure, e.g.

SELECT /*+CARDINALITY(a.ed 50)*/ score(1) AS score, a.* ...

the query plan changes into a typical nested-loops-with-index-access one:

SELECT STATEMENT ALL_ROWS cost=295 card=53
- SORT ORDER BY cost=295 card=53
  - NESTED LOOPS OUTER cost=294 card=53
    - NESTED LOOPS cost=207 card=50
      - TABLE ACCESS BY INDEX ROWID entity_details cost=107 card=50
        - BITMAP CONVERSION TO ROWIDS
          - BITMAP OR
            - BITMAP CONVERSION FROM ROWIDS
              - SORT ORDER BY
                - INDEX RANGE SCAN entity_name_i cost=4
            - BITMAP CONVERSION FROM ROWIDS
              - SORT ORDER BY
                - DOMAIN INDEX entity_name_ic cost=4
      - TABLE ACCESS BY INDEX ROWID association cost=2 card=1
        - INDEX RANGE SCAN asso_entity_i cost=1 card=1
    - TABLE ACCESS BY INDEX ROWID addresses cost=2 card=1
      - INDEX RANGE SCAN address_entity_fk_i cost=1 card=1

Substituting various cardinalities reveals the “tipping point” for this particular instance (with its particular set of statistics and optimizer parameters) to be around 50.

The cardinality estimates for the full table scans should be the major giveaway: they’re all less than 20,000. In fact, the total number of records in each of these tables does not exceed 25,000, and our dev instance has a full set of data.

I advised this manager to not worry about the new plan. These tables are unlikely to grow beyond 100,000 (they only hold records for about 25,000 associations, entered over the past 5 years) for the expected life of the product, and the entire tables fit in under 500 blocks. With a db_file_multiblock_read_count of 128, it’s likely that the first query of the day will load all the blocks of the tables into the buffer cache with just a dozen or so reads.

If this query were to use index range scans plus table accesses for each search, the performance would only become marginally better (and probably imperceptibly so), at the cost of slower queries on the occasions when users enter poor search criteria. Whereas, with full table scans, even with the worst search criteria, they will typically get their results in less than 5 seconds anyway.

We’re so accustomed to tables like “entities” and “addresses” having millions or tens of millions of rows, so instinctively recoil from full table scans on them; but, in this instance at least, to Oracle, these tables are tiny – for which full table scans are often better.