Category: SQL

Quick and simple frequency analysis

I use this simple query quite often when exploring the data in a table in any Oracle database (from Oracle v8 onwards):

select q.*, 100 * ratio_to_report(c) over () rtr
from (select distinct v, count(*) over (partition by v) c from (
select MYCOLUMN v from MYTABLE
)) q order by c desc;

Just substitute the table name for “MYTABLE” and the column you’re interested in for “MYCOLUMN”. This gives a frequency analysis of values, e.g.:

V         C       RTR
========  ======  =============
INACTIVE  401001  92.9254049544
ACTIVE    30529   7.0745950455

V is the value from the column. C is the count of how many times that value appeared. RTR is the % ratio to the total. The first row indicates the most popular value.

If it’s a very large table and you want quicker results, you can run the analysis over a smaller sample easily, just by adding the SAMPLE keyword:

...
select MYCOLUMN v from MYTABLE SAMPLE(1)
...

Add business days

It starts out as a fairly simple, innocent business requirement. Create a report to list records meeting some criteria, one of which is:

“List only records where today’s date is more than 35 business days after the due date from the record.”

When you delve deeper you find that querying the table with “DUE_DATE + 35 < SYSDATE” is not going to cut it – “business days” do not include weekends. You might start with something similar to this. But even that’s not good enough, because business days should not include public holidays. How do you code that?

So, here’s my solution.

1. We need to know what days are public holidays for the region. In our case this application is only applicable for a single region, so we use a simple table:

CREATE TABLE holidays (holiday_date DATE PRIMARY KEY);

We create a simple form for users to enter new holidays every year, and give someone the job of making sure it’s up-to-date every year when the public holidays are announced.

2. Create a view that lists all non-business days – i.e. list all weekends and public holidays. To retain reasonable performance, we limit our solution to dates in the years 2000 to 2050.

CREATE VIEW non_business_days AS
SELECT TO_DATE('01012000','DDMMYYYY') + ROWNUM * 7
       AS day -- Saturdays 2000 to 2050
FROM DUAL CONNECT BY LEVEL <= 2661
UNION ALL
SELECT to_date('02012000','DDMMYYYY') + ROWNUM * 7
       AS day -- Sundays 2000 to 2050
FROM DUAL CONNECT BY LEVEL <= 2661
UNION ALL
SELECT holiday_date FROM holidays;

3. Now, when we need to take a date and add x business days to it, we query this table to find all the non-business-days that are applicable, e.g.:

SELECT day
      ,COUNT(*) OVER (ORDER BY day
                      ROWS BETWEEN UNBOUNDED PRECEDING
                      AND CURRENT ROW)
       AS count_so_far
      ,(day - p_date) AS base_days
FROM   NON_BUSINESS_DAYS
WHERE  day > p_date;

If you run this query and examine each row in order of day, if you take base_days and subtract count_so_far, when the result is less than x, then base_days – count_so_far is the number of extra days we need to add to the holiday’s date to give us the answer. You’ll find this logic in the function below.

In our final solution, we’ll also need to UNION in the date parameter as well, for the case where there are no holidays between the starting date and the number of business days requested.

Here’s our function to take any date (at least, any date between 2000 and 2050) and add x business days (positive or negative):

FUNCTION add_working_days (p_date IN DATE, p_working_days IN NUMBER)
RETURN DATE IS
  l_date DATE;
BEGIN

  IF p_date IS NULL OR p_working_days IS NULL THEN
    RETURN NULL;
  END IF;

  IF p_working_days != TRUNC(p_working_days) THEN
    RAISE_APPLICATION_ERROR(-20000,
      'add_working_days: cannot handle fractional p_working_days ('
      || p_working_days || ')');
  END IF;

  IF p_working_days > 0 THEN

    SELECT MAX(day + p_working_days - (base_days - count_so_far))
    INTO l_date
    FROM (SELECT day
                ,COUNT(*) OVER (ORDER BY day
                                ROWS BETWEEN UNBOUNDED PRECEDING
                                AND CURRENT ROW)
                 AS count_so_far
                ,(day - p_date) AS base_days
          FROM NON_BUSINESS_DAYS
          WHERE day > p_date
          UNION
          SELECT p_date, 0, 0 FROM DUAL
         )
    WHERE base_days - count_so_far < p_working_days;

  ELSIF p_working_days < 0 THEN

    SELECT MIN(day - (ABS(p_working_days) - (base_days - count_so_far)))
    INTO l_date
    FROM (SELECT day
                ,COUNT(*) OVER (ORDER BY day DESC
                                ROWS BETWEEN UNBOUNDED PRECEDING
                                AND CURRENT ROW)
                 AS count_so_far
                ,(p_date - day) AS base_days
          FROM NON_BUSINESS_DAYS
          WHERE day < p_date
          UNION
          SELECT p_date, 0, 0 FROM DUAL
         )
    WHERE base_days - count_so_far < ABS(p_working_days);

  ELSE

    l_date := p_date;

  END IF;

  RETURN l_date;
END add_working_days;

Test cases (these are some public holidays in Western Australia):

insert into holidays values (to_date('27/12/2010','DD/MM/YYYY');
insert into holidays values (to_date('28/12/2010','DD/MM/YYYY');
insert into holidays values (to_date('03/01/2011','DD/MM/YYYY');
insert into holidays values (to_date('26/01/2011','DD/MM/YYYY');

— Expected: 06/01/2011

select cls_util.add_working_days(to_date('13/12/2010','DD/MM/YYYY')
                                ,15) from dual;

— Expected: 31/01/2011

select cls_util.add_working_days(to_date('25/01/2011','DD/MM/YYYY')
                                ,3) from dual;

— Expected: 13/12/2010

select cls_util.add_working_days(to_date('06/01/2011','DD/MM/YYYY')
                                ,-15) from dual;

— Expected: 25/01/2011

select cls_util.add_working_days(to_date('31/01/2011','DD/MM/YYYY')
                                ,-3) from dual;

SPOD for a query

I have two queries that need to be executed by a PL/SQL program. Both of them are quite complex, and both of them have a large section which is identical – because they are different views of the same underlying source data.

One option is to expand out both queries in full, e.g.:

Query 1:

 SELECT <complicated expressions>
 FROM (
       <large complicated query>
      ), <other tables>
 WHERE <complicated predicates>;

Query 2:

 SELECT <different complicated expressions>
 FROM (
       <large complicated query>
      ), <other different tables>
 WHERE <different complicated predicates>;

I don’t like the fact that my <large complicated query> is repeated in full in both cursor definitions. I’d rather have one place where that subquery is defined, because it should remain the same for both queries, since they are supposed to be different views of the same underlying data.

Another option is to create a view on the <large complicated query>, and refer to that in both queries. This is a perfectly acceptable option, and one which I often use. The only downside is if there are any parameters that need to be “fed” to the view. One way is for the view to expose the parameter as a column in the view, and for the calling query to simply query it on that column. This is not always the most efficient method, however, depending on the complexity of the view and how well Oracle can “push down” the predicate into the view at execution time. Another solution to the parameter problem is to use a user-defined context as described here.

The other downside which I don’t like for this case is that the view moves the query away from the package – I’d prefer to have the definitions close together and maintained in one location.

The solution which I used in this case is a pipelined function. For example:

 FUNCTION large_complicated_query
   RETURN source_data_table_type
   PIPELINED IS
   rc source_data_type;
 BEGIN
   FOR r IN (<large complicated query>) LOOP
     rc.col1 := r.col1;
     rc.col2 := r.col2;
     -- etc.
     PIPE ROW (rc);
   END LOOP;
   RETURN;
 END;

Now, the two queries in my package can re-use it like this:

 SELECT <complicated expressions>
 FROM TABLE(my_package.large_complicated_query)
      ,<other tables>
 WHERE <complicated predicates>;

In the package spec I have:

 -- *** dev note: for internal use only ***
 TYPE source_data_type IS
   RECORD (col1 col1_data_type, etc....);
 TYPE source_data_type_table IS TABLE OF source_data_type;
 FUNCTION large_complicated_query
   RETURN source_data_table_type PIPELINED;
 -- *** ******************************* ***

Because the pipelined function is going to be called by SQL (in fact, two queries defined in the same package), its declaration must also be added to the package spec.

In the package body, I use private global variable(s) to hold the parameter for the large complicated query.

When the queries are run, the global variable(s) must first be set to the required parameter. The queries are run, then the global variables are cleared.

The pipelined function is deliberately not useful to other processes – if a developer tried to call it, they’d get no results because they can’t set the parameters (since they are declared as private globals).

A downside to this approach is that the optimizer will not be able to optimize the entire queries “as a whole” – it will execute the entire query in the pipelined function (at least, until the calling queries decide to stop fetching from it). For my case, however, this is not a problem. The entire process runs in less than a second – and this is 10 times faster than it needs to be. In other words, in this case maintainability is more important than performance.

There may be other ways to do this (in fact, I’m quite sure there are), but this way worked for me.

Psychology and Large Tables

Today a project manager asked me about a change to a query being implemented for a search function on our 11gR2 database. His concern was that the new query had a plan that involved several full table scans; whereas the old version used primarily index range scans.

The query used to look something like this:

SELECT score(1) AS score, a.*
FROM (SELECT *
      FROM entity_details ed
      JOIN associations a
      ON ed.entity_id = a.entity_id
      LEFT JOIN addresses ad
      ON ed.entity_id = ad.entity_id
     ) a
WHERE CONTAINS(a.entity_name,
  :criterion, 1) > 0
ORDER BY a.entity_name;

Its plan looked like this:

SELECT STATEMENT ALL_ROWS cost=52 card=13
- SORT ORDER BY cost=52 card=13
  - NESTED LOOPS OUTER cost=51 card=13
    - NESTED LOOPS cost=31 card=12
      - TABLE ACCESS BY INDEX ROWID entity_details cost=7 card=12
        - DOMAIN INDEX entity_name_ic cost=4
      - TABLE ACCESS BY INDEX ROWID associations cost=2 card=1
        - INDEX RANGE SCAN asso_entity_i cost=1 card=1
    - TABLE ACCESS BY INDEX ROWID addresses cost=2 card=1
      - INDEX RANGE SCAN address_entity_fk_i cost=1 card=1

The new query had an additional predicate:

SELECT score(1) AS score, a.*
FROM (SELECT *
      FROM entity_details ed
      JOIN associations a ON ed.entity_id = a.entity_id
      LEFT JOIN addresses ad ON ed.entity_id = ad.entity_id
     ) a
WHERE CONTAINS(a.entity_name, :criterion, 1) > 0
OR a.entity_name LIKE '%' || UPPER(:criterion) || '%'
ORDER BY a.entity_name;

The query plan for the original query involved index range scans on all the tables; whereas the query plan for the new query involved full table scans on associations and addresses.

SELECT STATEMENT ALL_ROWS cost=348 card=1269
- SORT ORDER BY cost=348 card=1269
  - HASH JOIN OUTER cost=347 card=1269
    - HASH JOIN cost=244 card=1187
      - TABLE ACCESS BY INDEX ROWID entity_details cost=107 card=1187
        - BITMAP CONVERSION TO ROWIDS
          - BITMAP OR
            - BITMAP CONVERSION FROM ROWIDS
              - SORT ORDER BY
                - INDEX RANGE SCAN entity_name_i cost=4
            - BITMAP CONVERSION FROM ROWIDS
              - SORT ORDER BY
                - DOMAIN INDEX entity_name_ic cost=4
      - TABLE ACCESS FULL association cost=136 card=2351
    - TABLE ACCESS FULL addresses cost=102 card=18560

Initial testing in dev revealed no noticeable performance difference between the two queries, so he was just concerned about the impact of the full table scans on the system.

As you can see, the new plan was still using the domain index, as well as using the ordinary index on entity_name; concatenating the two sets of ROWIDs (BITMAP OR) and then accessing the table as before. Previously, the cardinality estimate for just the CONTAINS predicate was 12 (side note: I’m curious as to how predicates using context indexes are costed – anyone know any details or references?); now, the total cardinality estimate for entity_details is 1187. The cardinality estimate if we just did the LIKE predicate is 1176 (which is simply 5% of the number of rows in the table). The higher cardinality has pushed the rest of the query away from index accesses for association and addresses, towards hash joins and full table scans on those tables.

If I override the cardinality estimate with a lower figure, e.g.

SELECT /*+CARDINALITY(a.ed 50)*/ score(1) AS score, a.* ...

the query plan changes into a typical nested-loops-with-index-access one:

SELECT STATEMENT ALL_ROWS cost=295 card=53
- SORT ORDER BY cost=295 card=53
  - NESTED LOOPS OUTER cost=294 card=53
    - NESTED LOOPS cost=207 card=50
      - TABLE ACCESS BY INDEX ROWID entity_details cost=107 card=50
        - BITMAP CONVERSION TO ROWIDS
          - BITMAP OR
            - BITMAP CONVERSION FROM ROWIDS
              - SORT ORDER BY
                - INDEX RANGE SCAN entity_name_i cost=4
            - BITMAP CONVERSION FROM ROWIDS
              - SORT ORDER BY
                - DOMAIN INDEX entity_name_ic cost=4
      - TABLE ACCESS BY INDEX ROWID association cost=2 card=1
        - INDEX RANGE SCAN asso_entity_i cost=1 card=1
    - TABLE ACCESS BY INDEX ROWID addresses cost=2 card=1
      - INDEX RANGE SCAN address_entity_fk_i cost=1 card=1

Substituting various cardinalities reveals the “tipping point” for this particular instance (with its particular set of statistics and optimizer parameters) to be around 50.

The cardinality estimates for the full table scans should be the major giveaway: they’re all less than 20,000. In fact, the total number of records in each of these tables does not exceed 25,000, and our dev instance has a full set of data.

I advised this manager to not worry about the new plan. These tables are unlikely to grow beyond 100,000 (they only hold records for about 25,000 associations, entered over the past 5 years) for the expected life of the product, and the entire tables fit in under 500 blocks. With a db_file_multiblock_read_count of 128, it’s likely that the first query of the day will load all the blocks of the tables into the buffer cache with just a dozen or so reads.

If this query were to use index range scans plus table accesses for each search, the performance would only become marginally better (and probably imperceptibly so), at the cost of slower queries on the occasions when users enter poor search criteria. Whereas, with full table scans, even with the worst search criteria, they will typically get their results in less than 5 seconds anyway.

We’re so accustomed to tables like “entities” and “addresses” having millions or tens of millions of rows, so instinctively recoil from full table scans on them; but, in this instance at least, to Oracle, these tables are tiny – for which full table scans are often better.

Infinite Query

This is the query that never ends,
It just goes on and on, my friends.
Some people started fetching not knowing what it was,
And now they can’t stop fetching forever just because…

This is the query that never ends,

CREATE TYPE number_table_type IS TABLE OF NUMBER;

CREATE FUNCTION row_generator
RETURN number_table_type
PIPELINED IS
BEGIN
  LOOP
    FOR i IN 1..100 LOOP
      PIPE ROW (i);
    END LOOP;
  END LOOP;
  RETURN;
END;

SELECT * FROM TABLE(row_generator);

…inspired by…