WITH With an IN; or, A Reason to Refactor

A work colleague needed to make a change to a report and came up against a brick wall. He knew what he wanted to express in SQL, but Oracle wouldn’t accept his syntax.

The original query used a WITH clause like this (note, I’ve removed a great deal of irrelevant detail here, the actual query had a lot of other stuff going on, but this will do to illustrate this point):

WITH q AS
(SELECT expensive_function1(:id) idpart1
,expensive_function2(:id) idpart2
FROM DUAL)
SELECT t.*
FROM q, some_table t
WHERE q.idpart1 = t.idpart1
AND q.idpart2 = t.idpart2;

This works well because the two expensive functions are run only once in the query, and the results are used to probe some_table on its compound key.

The requirements had changed, however, and my colleague needed to change it so that instead of calling the expensive functions, it now got multiple keys from another table. His first cut looked like this:

SELECT t.*
FROM some_table t
WHERE (t.idpart1, t.idpart2) IN
(SELECT idpart1, idpart2
FROM driving_table
WHERE id = :id);

This worked fine, but my colleague felt that the WITH clause should be retained, e.g.:

WITH q AS
(SELECT idpart1, idpart2
FROM driving_table
WHERE id = :id)
SELECT t.*
FROM some_table t
WHERE (t.idpart1, t.idpart2) IN q;

However, Oracle doesn’t like this and raises ORA-00920: invalid relational operator when the statement is parsed.

To my mind this alternative formulation was not required, but I was interested to know what this error message meant, and why can’t we use a WITH subquery as the target for an IN statement?

As it is, however, the Oracle documentation explains why this formulation is not allowed – look up the “SELECT” command in the SQL Reference 10g Release 2 (10.2), search for “WITH”, and under “subquery_factoring_clause” we read:

“Restrictions on Subquery Factoring: This clause is subject to the following restrictions:

In a compound query with set operators, you cannot use the query_name for any of the component queries, but you can use the query_name in the FROM clause of any of the component queries.”

In other words, we can refer to a WITH subquery in a FROM clause but not as part of a set operator like IN.

In the case of my colleague, I recommended he use this much simpler formulation, which works just fine for this report:

SELECT t.*
FROM some_table t, driving_table d
WHERE t.idpart1 = q.idpart1
AND t.idpart2 = q.idpart2;

This shows how, when maintaining existing code, sometimes you have to go back to basics and consider whether the change to the requirements (in this case, a seemingly minor change) means that the code can be refactored to give an appreciable benefit. Caveat: this is not always the case: making large changes to existing code always carries the risk of introducing more defects.


Learn more about Oracle by examining obscure SQL

How many features can you use in one SQL query to solve a problem? (that is, how many necessary features do you need without deliberately obfuscating your code)

In this rather educational example we see (1) XML, (2) UTL_RAW.reverse and UTL_RAW.cast_to_varchar2, and (3) the MODEL clause: “How to convert a number from any base to any base in SQL” (Frank Zhou).

Unfortunately the query crashes with ORA-03113 end-of-file on communication channel on my database (10.2.0.3), but I won’t hold that against him 🙂

Personally, I’d implement this using a PL/SQL function, but that’s just me…

Earlier Frank showed a narrower example (Octal/Hex to Binary) using a similar technique, except using a hierarchical query and sys_connect_by_path instead of the MODEL clause.


LIKE with wildcard (%) at start CAN use an index

A tester for our client raised a problem where searches like ‘ABC%’ would perform satisfactorily, but searches like ‘%ABC’ would time out. Of course, the reason is that the index on the column cannot be used if a wildcard is at the start of the string, so a full table scan is always performed.

I think we’re going to accept the slow response for now, but a possible solution that we might consider in future is to use a function-based index on the REVERSE() function. Its drawback is that it requires modifying the code.

First, some test data:

CREATE TABLE jka AS
SELECT ROWNUM id, dbms_random.string('x',10) v
FROM dual CONNECT BY LEVEL <= 10000;

Now, create an ordinary index:

CREATE INDEX jka_normal ON jka (v);

BEGIN
  dbms_stats.gather_table_stats
    (USER
    ,'JKA'
    ,estimate_percent=>100
    ,cascade=>TRUE);
END;

This query can use a range scan on jka_normal:

SELECT * FROM jka WHERE v LIKE 'ABC%';

But this query will use a full table scan (can’t use the index):

SELECT * FROM jka WHERE v LIKE '%ABC';

Now, create a function-based index (not to be confused with a REVERSE INDEX):

CREATE INDEX jka_reverse ON jka(REVERSE(v));

BEGIN
  dbms_stats.gather_table_stats
    (USER
    ,'JKA'
    ,estimate_percent=>100
    ,cascade=>TRUE);
END;

This query can use a range scan on jka_reverse:

SELECT * FROM jka WHERE REVERSE(v) LIKE REVERSE('%ABC');

Edit: looks like Richard Foote beat me to it, in a discussion of reverse indexes (which unfortunately don’t contribute anything useful to this particular problem) – I did a quick search on Google without results, but Foote’s article must not have been indexed yet or too far down the list.


Will Oracle use my regexp function-based index?

It was mentioned at the AUSOUG conference by one of the speakers that he couldn’t get the database to use a function-based index based on the regexp functions. I thought this was a little strange so decided to try for myself.


SQL> select * from v$version;
Oracle Database 10g Express Edition Release 10.2.0.1.0

Create a test table and gather stats on it:

SQL> create table testtable as
select rownum rn,
dbms_random.string(‘a’,10) string10,
rpad(‘x’,2000,’x’) padding
from all_objects
where rownum <= 5000;

SQL> exec dbms_stats.gather_table_stats(user,
‘TESTTABLE’, cascade=>TRUE);

This is our test query using regexp_instr:

SQL> explain plan for
select string10, regexp_instr(string10, ‘XE’)
from testtable
where regexp_instr(string10, ‘XE’) > 0;

Id | Operation | Name | Rows | Bytes | Cost
0 | SELECT STATEMENT | | 243 | 475K| 465
* 1 | TABLE ACCESS FULL| TESTTABLE | 243 | 475K| 465

1 – filter( REGEXP_INSTR (“STRING10”,’XE’)>0)
– dynamic sampling used for this statement

How many rows are actually returned by this query?

SQL> select string10, regexp_instr(string10, ‘XE’)
from testtable
where regexp_instr(string10, ‘XE’) > 0;

STRING10 REGEXP_INSTR(STRING10,’XE’)
———- —————————
fwXEKwoDhG 3
rSmdOXEkeu 6
WXEbrrjXcW 2
qWEThNXEBO 7
XEQtOwjOCW 1
MRXEoAicUQ 3
IpECtZbjXE 9
LdmXjyePXE 9
gXEaiGrjSX 2
PoqtvdGcXE 9
ZunAgePXXE 9
evXEFhBpzX 3
ZxkXstXEwJ 7
yVpjHzXECY 7
FkaPMpXEgR 7
RuXUnXEQSO 6
OooRCjXXEK 8
XzceiWPXEr 8
XEZlpdNMhG 1
jJjzfXEdDw 6
CowyXEuHDm 5

21 rows selected

Hmmm… there should be a better way.
Let’s try a simple function-based index to speed things up:

SQL> create index idx_regexp on testtable
(regexp_instr(string10, ‘XE’));

Will this query use the index?

SQL> explain plan for
select string10, regexp_instr(string10, ‘XE’)
from testtable
where regexp_instr(string10, ‘XE’) > 0;

Id | Operation | Name | Rows | Bytes | Cost
0 | SELECT STATEMENT | | 4444 | 48884 | 465
* 1 | TABLE ACCESS FULL| TESTTABLE | 4444 | 48884 | 465

1 – filter( REGEXP_INSTR (“STRING10”,’XE’)>0)

Why not? The calculated cost with the index was higher than with a full table scan.

SQL> explain plan for
select /*+ index(testtable idx_regexp) */ string10, regexp_instr(string10, ‘XE’)
from testtable
where regexp_instr(string10, ‘XE’) > 0;

Id | Operation | Name | Rows | Bytes | Cost
0 | SELECT STATEMENT | | 4444 | 48884 | 1512
1 | TABLE ACCESS BY INDEX ROWID| TESTTABLE | 4444 | 48884 | 1512
* 2 | INDEX RANGE SCAN | IDX_REGEXP | 4444 | | 10

2 – access( REGEXP_INSTR (“STRING10”,’XE’)>0)

Why was the cost higher? Let’s look at what would be in that index…

SQL> select regexp_instr(string10, ‘XE’) from testtable;

REGEXP_INSTR(STRING10,’XE’)
—————————
0
0
0
0
0
1
0
0
0
0
… (lots of zeroes and the occasional positive integer) …
0
0
0

5000 rows selected

The index has a row for each block in the table. No histogram on it, so it doesn’t know that most of the index is zeroes.
What if we just want stuff from the index?

SQL> explain plan for
select regexp_instr(string10, ‘XE’)
from testtable
where regexp_instr(string10, ‘XE’) > 0;

Id | Operation | Name | Rows | Bytes | Cost
0 | SELECT STATEMENT | | 4444 | 48884 | 4
* 1 | INDEX FAST FULL SCAN| IDX_REGEXP | 4444 | 48884 | 4

1 – filter( REGEXP_INSTR (“STRING10”,’XE’)>0)

Yes, that uses the index. So how do we get the rest of the data from the table? Let’s try something else.

SQL> drop index idx_regexp;

We can take advantage of the fact that NULLS are not stored in an index by converting any zeroes (i.e. the regular expression didn’t match) to NULL:

SQL> create index idx_regexp_better on testtable
(CASE WHEN regexp_instr(string10, ‘XE’) > 0
THEN regexp_instr(string10, ‘XE’)
ELSE NULL END);

SQL> exec dbms_stats.gather_table_stats(user, ‘TESTTABLE’, cascade=>TRUE);

Will our query use the new improved index?

SQL> explain plan for
select string10, regexp_instr(string10, ‘XE’)
from testtable
where regexp_instr(string10, ‘XE’) > 0;

Id | Operation | Name | Rows | Bytes | Cost
0 | SELECT STATEMENT | | 250 | 2750 | 465
* 1 | TABLE ACCESS FULL| TESTTABLE | 250 | 2750 | 465

1 – filter( REGEXP_INSTR (“STRING10”,’XE’)>0)

No. Why? Because it can’t use the index, even if we try to force it with a hint. The optimiser doesn’t know it can modify my expression into the one that was used to build the index.

SQL> explain plan for
select /*+ index(testtable idx_regexp_better)*/
string10, regexp_instr(string10, ‘XE’)
from testtable
where regexp_instr(string10, ‘XE’) > 0;

Id | Operation | Name | Rows | Bytes | Cost
0 | SELECT STATEMENT | | 250 | 2750 | 465
* 1 | TABLE ACCESS FULL| TESTTABLE | 250 | 2750 | 465

1 – filter( REGEXP_INSTR (“STRING10”,’XE’)>0)

What if we copy the expression from the index into the query? Yes, that’s better.

SQL> explain plan for
select string10, regexp_instr(string10, ‘XE’)
from testtable
where CASE WHEN regexp_instr(string10, ‘XE’) > 0
THEN regexp_instr(string10, ‘XE’)
ELSE NULL END > 0;

Id | Operation | Name | Rows | Bytes | Cost
0 | SELECT STATEMENT | | 21 | 231 | 22
1 | TABLE ACCESS BY INDEX ROWID| TESTTABLE | 21 | 231 | 22
* 2 | INDEX RANGE SCAN | IDX_REGEXP_BETTER | 21 | | 1

2 – access(CASE WHEN REGEXP_INSTR (“STRING10”,’XE’)>0 THEN REGEXP_INSTR(“STRING10”,’XE’) ELSE NULL END >0)

Brilliant!


Easy way to crash a Oracle 10.2 session

This query crashes the session when it’s run or parsed.
I haven’t checked if this appears in other versions or not.


SQL> conn scott/tiger@orcl
Connected.

SQL> select * from v$version;
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 – 64bi
PL/SQL Release 10.2.0.3.0 – Production
CORE 10.2.0.3.0 Production
TNS for Solaris: Version 10.2.0.3.0 – Production
NLSRTL Version 10.2.0.3.0 – Production

SQL> explain plan for with q as
2 (select ‘x’ d from dual
3 union all
4 select ‘y’ from dual)
5 select d from q
6 union all
7 select d from q
8 order by dbms_random.value;
explain plan for with q as (select ‘x’ d from dual
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel

The crashes happened when I added the “order by dbms_random.value”. The problem seems to be in the optimiser since I get the same results whether I run the query or just explain it. Possibly something to do with the optimiser wrongly assuming the random function is deterministic?

There’s a simple workaround, so I’m not too worried. I just change it to this and it works fine:


with q as
(select ‘x’ d from dual
union all
select ‘y’ from dual)
select d, dbms_random.value o from q
union all
select d, dbms_random.value o from q
order by o;


Do you know the difference between USER_TAB_COLS & USER_TAB_COLUMNS?

I just noticed the difference – I was wondering why I couldn’t see the virtual columns (i.e. columns that support function-based indexes) on a table, and it was because I was querying USER_TAB_COLUMNS. Once I used USER_TAB_COLS instead there they were.

The difference is noted in the documentation (Oracle 10g Release 2 (10.2)) for USER_TAB_COLS and ALL_TAB_COLS but not for USER_TAB_COLUMNS or ALL_TAB_COLUMNS, unfortunately.


Generic Audit

Looking at the column comments on this table, I can sympathise with the poor soul who after painstakingly describing 100 columns just decided that it just wasn’t worth it…

table audit_history (column comments in quotes):

id number(18) “Unique identifier”
aud_table_name varchar2(30) “audit record table”
audit_action varchar2(50) “audit action e.g. update”
aud_timestamp date “timestamp of the audit”
aud_user_id varchar2(100) “user performing the action”
aud_col_1 varchar2(4000) “audit column 1”
aud_col_2 varchar2(4000) “audit column 2”
aud_col_3 varchar2(4000) “audit column 3”
… (etc.) …
aud_col_99 varchar2(4000) “audit column 99”
aud_col_100 varchar2(4000) “audit column 100”
aud_col_101 varchar2(4000)
aud_col_102 varchar2(4000)
… (etc.) …
aud_col_139 varchar2(4000)
aud_col_140 varchar2(4000)

Tip: Don’t let anyone even think about using this kind of design for their change history auditing requirements. It might look elegant to them, but it is not good. Just, don’t.


Generating test data that matches existing data

I’ve had to create test data a number of times, including data for tables that had mandatory foreign keys to existing tables. It was not feasible to just create new master rows for my test data; I wanted to refer to a random sample of existing data; but the code that generates the test data had to perform reasonably well, even though it had to pick out some random values from a very large table.

Solution? A combination of the new 10g SAMPLE operator, and DBMS_RANDOM. To illustrate:


(create a “very large table”)
SQL> create table t as
2 select rownum n, dbms_random.string(‘a’,30) v
3 from all_objects;

Table created.

SQL> select count(*) from t;

COUNT(*)
———-
40981

(get a random sample from the table)
SQL> select n, substr(v,1,30) from t sample(0.01)
2 order by dbms_random.value;

N SUBSTR(V,1,30)
———- ——————————
11852 xSsdmFtGqkymbKCFoZwUzNxpJAPwaV
8973 RGyNjqMfVayKdiKFGvLYuAFYUpIbCw
25295 eJJtoieSWtzUTIZXCbOLzmdmWHHPOy
297 hiTxUPYKzWKAjFRYTTfJSSCuOwGGmG
1924 yZucJWgkFviAIeXiSCuNeUuDjClvxt
40646 wMTumPxfBMoAcNtVMptoPchILHTXJa

6 rows selected.

SQL> set serveroutput on

(Get a single value chosen at random)
SQL> declare
2 cursor cur_t is
3 select n from t sample(0.01)
4 order by dbms_random.value;
5 l number;
6 begin
7 open cur_t;
8 fetch cur_t into l;
9 dbms_output.put_line(l);
10 close cur_t;
11* end;
SQL> /
21098

PL/SQL procedure successfully completed.


My test code would open the cursor, fetch as many values as it needed, and then close it. If the cursor ran out of values (e.g. the sample was too small for the desired amount of test data, which varied), my code just re-opened the cursor to fetch another set of random values from the large table.

The reason I sort the sample by dbms_random.value is so that if I only want one value, it is not weighted towards rows found nearer the start of the table.

Note: If I didn’t really care about the sample being picked at random from throughout the table, I could have just selected from the table “where rownum < n".



Quick Quiz – Sequence NEXTVAL and CURRVAL

Answer this in your head before trying it out or looking it up. Assume the following script is run in a single session on an Oracle database:

CREATE SEQUENCE seq START WITH 1 INCREMENT BY 1;

CREATE TABLE t (a INTEGER, b INTEGER);

-- Statement #1
INSERT INTO t VALUES (seq.NEXTVAL, seq.NEXTVAL);

-- Statement #2
INSERT INTO t VALUES (seq.CURRVAL, seq.NEXTVAL);

-- Statement #3
INSERT INTO t VALUES (seq.CURRVAL, seq.CURRVAL);

Which of the following is/are true?

  1. The inserted rows will be {1,2}, {2,3} and {3,3}.
  2. The inserted rows will be {1,1}, {2,2} and {2,2}.
  3. The inserted rows will be {1,1}, {1,2} and {2,2}.
  4. Statements #2 and #3 will each raise “ORA-08002: sequence SEQ.CURRVAL is not yet defined in this session”.