DATE is not a date!

Consider:

  • DATE is not a date.
  • DATE '2012-06-22' is not a date.
  • CURRENT_DATE is not a date.
  • SYSDATE is not a date.
  • TRUNC(SYSDATE) is not a date.
  • TO_DATE('22/06/2012','DD/MM/YYYY') is not a date.

Oracle SQL does not have a native date datatype.

Explanation: the datatype called “DATE” actually represents a Date+Time. It always has a time portion. For example, the literal expression DATE '2012-06-22'has a time component that means midnight, to the nearest second. It is only by convention that a DATE with a time of 00:00:00 is used to represent the whole day, rather than exactly midnight; but this convention only works if the developer carefully writes his code to ensure this is true.

A fragment of the Antikythera mechanism.

It is theorized that the ancients solved “once and for all” the problem with dates vs. times.

To an experienced Oracle developer, this is not a problem, it feels perfectly natural to say “DATE” but internally to be thinking in terms of time values. It’s easy to forget that we’re even doing this subconsciously. To an outsider (e.g. business analysts, testers, Java developers), it can be confusing – and it is often not immediately obvious that the confusion exists. It’s not until you start explaining why SYSDATE <= DATE '2012-06-22' evaluates to FALSE, even if today is the 22nd day of the month of June in the year of our Lord 2012 that you realise they have been labouring under a false assumption: that a DATE is a date, that it represents the full 24-hour period that a normal human would call “22 June 2012”.

If I was invited to change the name of just one thing in Oracle (and everyone was willing to make all the changes necessary to their code to accommodate my whim) it would be to change “DATE” to “DATETIME”.

I ask you: is there anything else in Oracle that confuses outsiders more often than this misnomer?

P.S. Now for a freebie: here is a summary of a number of transformations that may be done to remove the TRUNC function call around a date column (a and b are all of type DATE):

 TRUNC(a) =  TRUNC(b)  =>  (a BETWEEN TRUNC(b) AND TRUNC(b)+0.99999)
 TRUNC(a) <  TRUNC(b)  =>  a < TRUNC(b)
 TRUNC(a) <= TRUNC(b)  =>  a < TRUNC(b)+1
 TRUNC(a) >  TRUNC(b)  =>  a >= TRUNC(b)+1
 TRUNC(a) >= TRUNC(b)  =>  a >= TRUNC(b)
 TRUNC(a) =  b         =>  (a BETWEEN b AND b+0.99999
                            AND b = TRUNC(b))
 TRUNC(a) <  b         =>  (a < TRUNC(b)+1
                            AND NOT (a=TRUNC(a) AND b=TRUNC(b)))
 TRUNC(a) <= b         =>  a < TRUNC(b)+1
 TRUNC(a) >  b         =>  a >= TRUNC(b)+1
 TRUNC(a) >= b         =>  (a >= TRUNC(b)
                            AND NOT (b=TRUNC(b)))

Code I Regret: Refactoring as Penance

Recently I refactored some PL/SQL for sending emails – code that I wrote way back in 2004. The number of “WTF“‘s per minute has not been too high; however, I’ve cringed more times than I’d like…

1. Overly-generic parameter types

When you send an email, it will have at least one recipient, and it may have many recipients. However, no email will have more than one sender. Yet, I wrote the package procedure like this:

TYPE address_type IS RECORD
 (name          VARCHAR2(100)
 ,email_address VARCHAR2(200)
 );
TYPE address_list_type IS TABLE OF address_type
 INDEX BY BINARY_INTEGER;
PROCEDURE send
 (i_sender     IN address_list_type
 ,i_recipients IN address_list_type
 ,i_subject    IN VARCHAR2
 ,i_message    IN VARCHAR2
 );

Why I didn’t have i_sender be a simple address_type, I can’t remember. Internally, the procedure only looks at i_sender(1) – if a caller were to pass in a table of more than one sender, it raises an exception.

2. Functional programming to avoid local variables

Simple is best, and there’s nothing wrong with using local variables. I wish I’d realised these facts when I wrote functions like this:

FUNCTION address
 (i_name          IN VARCHAR2
 ,i_email_address IN VARCHAR2
 ) RETURN address_list_type;
FUNCTION address
 (i_address       IN address_list_type
 ,i_name          IN VARCHAR2
 ,i_email_address IN VARCHAR2
 ) RETURN address_list_type;

All that so that callers can avoid *one local variable*:

EMAIL_PKG.send
 (i_sender     => EMAIL_PKG.address('joe','joe@company.com')
 ,i_recipients => EMAIL_PKG.address(
                  EMAIL_PKG.address(
                  'jill', 'jill@company.com')
                 ,'bob', 'bob@company.com')
 ,i_subject    => 'hello'
 ,i_message    => 'world'
 );

See what I did there with the recipients? Populating an array on the fly with just function calls. Smart eh? But rather useless, as it turns out; when we need to send multiple recipients, it’s usually populated within a loop of unknown sized, so this method doesn’t work anyway.

Go ahead – face your past and dig up some code you wrote 5 years ago or more. I think, if you don’t go “WTF!” every now and then, you probably haven’t learned anything or improved yourself in the intervening years. Just saying 🙂


Alexandria: May Cause Addiction

Ever since I downloaded the Alexandria PL/SQL library, I haven’t been able to put it down. Just recently I decided I wanted to serve up a whole lot of media files directly from Amazon’s S3 simple storage service, instead of serving them from within my EC2 (elastic compute) instance. They were just wasting my linux server’s time responding to http requests.

So, I quickly wrote the following code to transfer them:

DECLARE
  l_blob BLOB;
BEGIN
  /* initialise my AWS session */
  ALEX.amazon_aws_auth_pkg.init
    ( 'yyy-my-aws-id-yyy'
    , 'xxx-not-telling-xxx'
    , p_gmt_offset => -8);
  FOR rec IN (
    SELECT id, filename, mime_type, location
    FROM myfiles
    WHERE location = 'http://myserver/media/'
  ) LOOP
    /* read the file from its current location */
    l_blob := ALEX.http_util_pkg.get_blob_from_url
      (rec.location || rec.filename);
    IF DBMS_LOB.getLength(l_blob) > 0 THEN
      /* upload the file to Amazon S3 */
      ALEX.amazon_aws_s3_pkg.new_object
        ( 'mybucket'
        , rec.filename
        , l_blob
        , rec.mime_type
        , ALEX.amazon_aws_s3_pkg.g_acl_public_read);
      UPDATE myfiles
      SET location = 'http://mybucket.s3-ap-southeast-1.amazonaws.com/'
      WHERE id = rec.id;
      COMMIT;
    END IF;
  END LOOP;
END;

After a short while, all the files had been copied across to my bucket on S3, and my table updated so that my web site now points people’s browsers to the new location for those files.

Of course, I could have used UTL_FILE to read the files from disk, but then I’d have to first create a directory, and write a loop to read the file in chunks into the BLOB. Why bother with all that when I can just call http_util_pkg.get_blog_from_url and get it all in one go?

That’s the trouble with powerful utilities like Alexandria: they’re too easy to use, make tasks like this trivial, and you start finding all sorts of uses for them. All of a sudden, Alexandria is your hammer, and the world is full of nails.

See also: this quick intro to using Alexandria’s API for Amazon S3.


Generating unique identifiers with “SELECT MAX(id) + 1”

Normally, when you see code like this in a production system, you should duck your head and run:

SELECT NVL( MAX( id ), 0 ) + 1
INTO   :new_id
FROM   mytable;

What’s wrong with this code?

I hope the first answer that rolls off your tongue has something to do with concurrency – i.e. two sessions that run this around the same time will not see uncommitted rows from each other, and so are likely to try to insert rows with conflicting identifiers.

I hope the second answer that you might mention has to do with performance – even considering there’s a unique index on the column, this code will still need to read at least one index block to get the latest ID (assuming the query optimiser chooses to do a MIN/MAX index scan so that it doesn’t have to scan the entire index before returning a result). In a high load system this cost might be unacceptable.

Of course, the first problem (concurrency) could be solved by serializing access to the “get the next ID” function, e.g. with a DBMS_LOCK. We all know, however, that there’s no sane reason to serialize this when Oracle already provides a perfectly good mechanism for generating unique IDs, with virtually no serialization – sequences.

CREATE SEQUENCE my_id_seq;
SELECT my_id_seq.NEXTVAL INTO :new_id FROM DUAL;

Sequences have the benefits of guaranteeing uniqueness, and if their cache setting is set appropriately, will add a negligible amount of overhead for serialization.

Problem solved. Easy, right? I bet you’re wondering why I added the word “Normally” to my first sentence in this post….

Question: When might using “SELECT MAX(id) + 1” ever be an acceptable source of unique identifiers?

Answer: Global Temporary tables.

If I’ve inserted any rows into a global temporary table, by definition no other session can see my data, so the first consideration, concurrency, is not an issue.

Also, if I’m not expecting to ever insert many rows into my global temporary table, I can be reasonably confident that performance will not be an issue either. Plus, if I put an index on the ID column, that query will be quite inexpensive.

Conclusion: if you are using global temporary tables, you don’t have to use sequences to generate unique identifiers for them. I’m not saying you shouldn’t, of course – a sequence may be faster, and may even lead to simpler code in some cases – but in other cases you might decide to forego a sequence – one less object, with perhaps its role grants and synonyms, to deploy.

Now, of course, you have to ask yourself, why query the table at all? Why not store that latest ID in a private global variable in a package? In fact, we can create a simple package to replace the sequence, e.g.:

CREATE OR REPLACE PACKAGE my_table_pkg IS
FUNCTION next_id RETURN my_table.id%TYPE;
END my_table_pkg;
CREATE OR REPLACE PACKAGE BODY my_table_pkg IS
  g_latest_id my_table.id%TYPE;
FUNCTION next_id RETURN my_table.id%TYPE IS
  BEGIN
    g_latest_id := NVL(g_latest_id, 0) + 1;
    RETURN g_latest_id;
  END next_id;
END my_table_pkg;

Well, now you know what to do. Whenever you need to generate a unique set of identifiers for a global temporary table, you’ve got a choice of options: sequence, package variable, or a “max(id)+1” query.


Constrain a table to only 1 row

I needed a table that could only ever have one row – if anyone tried to insert a second row they’d get an error.

CREATE UNIQUE INDEX only_one_row_allowed ON mytable (1);

Testing it:

INSERT INTO mytable VALUES ('x');
ORA-00001: unique constraint (SCOTT.ONLY_ONE_ROW_ALLOWED) violated


DEFAULT NULL NULL

Found myself running the following command today:

ALTER TABLE mytable
MODIFY (mycolumn DEFAULT NULL NULL);

I know – I’m easily entertained 🙂


Designing a PL/SQL API – BOOLEAN or CHAR?

A simple question: you’re designing an API to be implemented as a PL/SQL package, and you don’t (yet) know the full extent to which your API may be used, so you want to cover a reasonable variety of possible usage cases.

You have a function that will return a BOOLEAN – i.e. TRUE or FALSE (or perhaps NULL). Should you implement it this way, or should you return some other kind of value – e.g. a CHAR – e.g. ‘Y’ for TRUE or ‘N’ for FALSE; or how about a NUMBER – e.g. 1 for TRUE or 0 for FALSE?

This debate has raged since 2002, and probably earlier – e.g. http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:6263249199595

Well, if I use a BOOLEAN, it makes the code simple and easy to understand – and callers can call my function in IF and WHILE statements without having to compare the return value to anything. However, I can’t call the function from a SQL statement, which can be annoyingly restrictive.

If I use a CHAR or NUMBER, I can now call the function from SQL, and store it in a table – but it makes the code just a little more complicated – now, the caller has to trust that I will ONLY return the values agreed on. Also, there is no way to formally restrict the values as agreed – I’d have to just document them in the package comments. I can help by adding some suitable constants in the package spec, but note that Oracle Forms cannot refer to these constants directly. Mind you, if the value is being stored in a table, a suitable CHECK constraint would be a good idea.

Perhaps a combination? Have a function that returns BOOLEAN, and add wrapper functions that converts a BOOLEAN into a ‘Y’ or ‘N’ as appropriate? That might be suitable.

Personally, though, I hate the NUMBER (1 or 0) idea for PL/SQL. That’s so C-from-the-1970’s. Who codes like that anymore?


3 Reasons to Hate Hibernate

Warning: this is a rant.

This is just a collection of observations of Hibernate, from the perspective of an Oracle developer/”DBA”. I’m aware of some of the benefits of using Hibernate to shield Java developers from having to know anything about the database or the SQL language, but sometimes it seems to me that we might generally be better off if they were required to learn a little about what’s going on “underneath the hood”. (Then I remind myself that it’s my job to help them get the most out of the database the client spent so much money getting.)

So, here are my gripes about Hibernate – just getting them off my chest so I can put them to bed.

Disclaimer: I know every Hibernate aficionado will jump in with “but it’s easy to fix that, all you have to do is…” but these are generalizations only.

Exhibit A: Generic Query Generators

As soon as I’d loaded all the converted data into the dev and test instances, we started hitting silly performance issues. A simple search on a unique identifier would take 20-30 seconds to return at first, then settle down to 4-8 seconds a pop. Quite rightly, everyone expected these searches to be virtually instant.

The culprit was usually a query like this:

select count(*) as y0_
from XYZ.SOME_TABLE this_
inner join XYZ.SOME_CHILD_TABLE child1_
on this_.PARENT_ID=child1_.PARENT_ID
where lower(this_.UNIQUE_IDENTIFIER) like :1
order by child1_.COLH asc, child1_.COLB asc, this_.ANOTHER_COL desc

What’s wrong with this query, you might ask?

Issue 1: Case-insensitive searches by default

Firstly, it is calling LOWER() on the unique identifier, which will never contain any alphabetic characters, so case-insensitive searches will never be required – and so it will not use the unique index on that column. Instead of forcing the developers to think about whether case-insensitive searches are required or not for each column, it allows them to simply blanket the whole system with these – and quite often no-one will notice until the system goes into UAT or even Prod and someone actually decides to test searching on that particular column, and decides that waiting for half a minute is unacceptable. It’s quite likely that for some cases even this won’t occur, and these poorly performing queries (along with their associated load on the database server) will be used all the time, and people will complain about the general poor performance of the database.

Issue 2: Count first, then re-query for the data

Secondly, it is doing a COUNT(*) on a query which will immediately after be re-issued in order to get the actual data.  I’d much prefer that the developers were writing the SQL by hand. That way, it’d be a trivial matter to ask them to get rid of the needless COUNT(*) query; and if they simply must show a total record count on the page, add a COUNT(*) OVER () to the main query – thus killing two birds with one efficient stone.

Exhibit B: Magical Class Generators (tables only)

Issue 3: No views, no procedures, no functions

When someone buys Hibernate, they might very well ask: is it possible to call an Oracle procedure or function with this product? And the answer is, of course, “yes”. Sure, you can do anything you want!

The day the Java developers peel off the shrinkwrap, the first thing they try is creating a Java class based on a single table. With glee they see it automagically create all the member attributes and getter/setter methods, and with no manual intervention required they can start coding the creation, modification and deletion of records using this class, which takes care of all the dirty SQL for them.

Then, the crusty old Oracle developer/”DBA” comes along and says: “It’d be better if you could use this API I’ve lovingly crafted in a PL/SQL package – everything you need is in there, and you’ll be shielded from any complicated stuff we might need to put in the database now or later. All you have to do is call these simple procedures and functions.” And the Java developer goes “sure, no problem” – until they discover that Hibernate cannot automatically create the same kind of class they’ve already gotten accustomed to.

“What, we actually need to read the function/procedure definition and hand-code all the calls to them? No sir, not happening.” After all, they bought Hibernate to save them all that kind of work, and who’s going to blame them?

So, you say, “Ok, no problem, we’ll wrap the API calls with some simple views, backed by instead-of triggers.” But then they hit another wall – Hibernate can’t tell from a view definition how that view relates to other views or tables.

The end result is that all the Java code does is access tables directly. And you get the kind of queries (and worse) that you saw in Exhibit “A” above.

There. I feel so much better already.

/rant


Write REF CURSOR to file

This is just a very simple procedure that makes writing a lot of files using UTL_FILE much simpler. It can be used to easily create simple CSVs or fixed-length files.

I’ve shown it here in a package body; it’s up to you to create a package spec.

To use it, all you need to do is create a query that concatenates all the data into a single string up to 4000 characters long. If you’re writing a CSV you need to concatenate commas and quotes as appropriate.

CREATE OR REPLACE
PACKAGE BODY packagename AS
-- If no records are found in the cursor, no file is created.
PROCEDURE write_cursor_to_file
  (outputdir  IN VARCHAR2
  ,filename   IN VARCHAR2
  ,headerline IN VARCHAR2
  ,refcursor  IN SYS_REFCURSOR
  ) IS
  -- increase to make faster but use more memory;
  -- decrease to use less memory but run slower
  BATCHSIZE CONSTANT INTEGER := 100;
  TYPE varr_type IS TABLE OF VARCHAR2(4000) INDEX BY BINARY_INTEGER;
  varr varr_type;
  outf UTL_FILE.FILE_TYPE;
BEGIN
  -- note: don't open the file unless we actually get some
  -- records back from the cursor
  LOOP
    FETCH write_cursor_to_file.refcursor
    BULK COLLECT INTO varr
    LIMIT BATCHSIZE;
    EXIT WHEN varr.COUNT = 0;
    -- We have some records to write. Have we opened the file yet?
    IF NOT UTL_FILE.IS_OPEN (outf) THEN
      outf := UTL_FILE.fopen
        (file_location => write_cursor_to_file.outputdir
        ,file_name     => write_cursor_to_file.filename
        ,open_mode     => 'A'
        ,max_linesize  => 4000);
      UTL_FILE.put_line (outf, write_cursor_to_file.headerline);
    END IF;
    -- Write the batch of records to the file
    FOR i IN 1..varr.COUNT LOOP
      UTL_FILE.put_line (outf, varr(i));
    END LOOP;
  END LOOP;
  CLOSE write_cursor_to_file.refcursor;
  IF UTL_FILE.IS_OPEN (outf) THEN
    UTL_FILE.fclose (outf);
  END IF;
END write_cursor_to_file;
PROCEDURE sample IS
  OUTPUTDIR  CONSTANT VARCHAR2(100) := 'MY_DIR_NAME';
  FILENAME   CONSTANT VARCHAR2(100) := 'my_file_name.csv';
  HEADERLINE CONSTANT VARCHAR2(4000) := 'Name,Address,Date of Birth';
  refcursor SYS_REFCURSOR;
BEGIN
  OPEN refcursor FOR
    select '"' || name || '"'
        || ',"' || address || '"'
        || ',' || TO_CHAR(dob,'DD-Mon-YYYY')
    from persons;
  write_cursor_to_file
    (outputdir  => OUTPUTDIR
    ,filename   => FILENAME
    ,headerline => HEADERLINE
    ,refcursor  => refcursor
    );
END sample;
END packagename;
/

It appends to the file if it finds it. This makes it easy to write the result of several queries to the same file.

Note: if you’re on Oracle 8i or earlier, you’ll need to add a replacement for SYS_REFCURSOR, e.g. TYPE my_sys_refcursor IS REF CURSOR; either at the top of the package, or if you want to make the write_cursor_to_file procedure public, put the type definition in your package spec.


Googlebot, APEX Session IDs, and Cookies

Recently I had a bit of a saga with a public-facing website running on Oracle APEX (www.foothillschurch.org.au, if you’re curious) getting hammered by Googlebot. We went live with a new version of the site, but I’d forgotten to make sure that all the links set the Session ID to 0 (zero).

What is this session ID?

Every visit to an APEX application needs to have a session. Each unique session is recorded in the database, and keeps track of the state of all the variables for the various pages you visit. Normally, the session is identified by a Session ID which is embedded in the “p” parameter of the URL.

For example, if you visit the page:

http://www.foothillschurch.org.au/apex/f?p=102:1

You’ll notice that the “p” parameter only specifies the app ID (102) and the page ID (1). What apex does is responds with a 302 temporary redirect, that tells the client to redirect to a new URL containing a newly generated session ID, e.g.:

http://www.foothillschurch.org.au/apex/f?p=102:1:45164531548964:::::

Behind the scenes, it’s not just changing the URL – it’s also sending a cookie to the client to be used for subsequent calls. I’ll get back to this later.

Whenever you navigate around a normal apex site, the session ID gets copied into each link so that the user’s session is preserved. If you manually change or remove the session ID from the URL, apex will redirect you to a newly created session ID.

In case you’re wondering, there’s no significant security risk behind the exposed session ID – no-one can “hijack” your session, even if they copy your session ID directly. That’s because there’s a cookie behind the scenes with a secret value that must match up with the session ID, and only someone with sufficient access to the database server could get enough data to do that.

If you store the URL containing a session ID (e.g. in your bookmarks) and re-use it much later, your session will have expired – in which case APEX will create a new session, and 302 temporary redirect you to a new URL with the new session ID. Therefore, users can safely bookmark any page in apex.

But now we come, finally, to Googlebot, that little rascal. Now, we would like our public-facing site to be indexed by Google, so we need all the pages of the site that have relevant info to be crawlable.

The way Googlebot works, normally, is that it starts from a link to your page (e.g. on another website, or in a sitemap you submit to Google), e.g.

http://www.foothillschurch.org.au/apex/f?p=102:1

It checks that the URL is not forbidden by your robots.txt, and sends a request to your server for it. If the response is 200 OK and has a body, Googlebot indexes the page contents, extracts any links from it, and crawls them. Actually, it doesn’t crawl them straight away – it just adds them onto the end of a queue somewhere to be crawled later.

If the response is a 4xx (permanent error) or 5xx (temporary error), Googlebot seems to put the URL back on the queue for a few more goes before it gives up.

If the response is a 3xx redirect, and this is the kicker, Googlebot does not always perform the redirect straight away. It may take the new URL and just add it onto the end of the queue to be crawled later. It seems to me (based on what I’ve seen in my apache logs) that if the URL is different from the first, Googlebot will queue it up for later; but if the URL is identical, it will usually try it straight away.

You may see the problem here: Googlebot visits:

http://www.foothillschurch.org.au/apex/f?p=102:1

Our site creates a session, and responds with a 302 temporary redirect to:

http://www.foothillschurch.org.au/apex/f?p=102:1:48327482923832:::::

Googlebot dutifully notes this new URL and queues it up to crawl later. Meanwhile, our server is waiting patiently for it to get back, but it never does – so the session automatically expires. Much later, Googlebot visits:

http://www.foothillschurch.org.au/apex/f?p=102:1:48327482923832:::::

Our site sees the expired session, creates a new one, and responds with another 302 temporary redirect to:

http://www.foothillschurch.org.au/apex/f?p=102:1:9783829383342:::::

Googlebot dutifully notes this new URL and queues it up to crawl later, etc. etc. etc. Also, it’s not even as benign as that: each URL is not tried just once, but many many times (depending on what speed setting you’ve got the crawler on) – and every single time, our server responds with a brand-new, unique session ID. I hope you can now see why our little site crashed under the load – it quickly filled up the apache logs, it quickly filled up the debug logs in apex, and it quickly overflowed the poorly-configured archive log.

The way to solve this problem is to stop exposing these session IDs in the URL – and in Apex you can do that by setting the session ID to zero, e.g.:

http://www.foothillschurch.org.au/apex/f?p=CHURCH:1:0

Behind the scenes, Apex still creates a session, but whenever it generates a URL with #SESSION# it substitutes zero instead of the internal session ID. This method is great for people who wish to bookmark a page in an application that doesn’t require authentication. It also seems to work for the Googlebot crawler.

The above URL will still cause a 302 temporary redirect, however; apex will redirect it to:

http://www.foothillschurch.org.au/apex/f?p=CHURCH:1:0:::::

You might think that this final URL would stop the redirects, wouldn’t you? Well, it doesn’t. You can see what happens if you open this URL in Google Chrome in incognito mode. First, open an incognito window, then choose the Tools menu, Developer Tools. Select the Network tab. Then, paste the URL into the address bar and press Enter.

You will find that the first call to apex/f receives a 302 (temporary redirect). If you click this entry and choose Headers, you’d see something like this:

Request URL:http://www.foothillschurch.org.au/apex/f?p=CHURCH:1:0:::::
Request Method:GET
Status Code:302 Found
Request Headers:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Host:www.foothillschurch.org.au
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.187 Safari/535.1
Response Headers:
Cache-Control:max-age=0
Connection:Keep-Alive
Content-Length:0
Content-Type:text/html; charset=UTF-8
Date:Thu, 06 Oct 2011 22:54:10 GMT
Expires:Thu, 06 Oct 2011 22:54:10 GMT
Keep-Alive:timeout=10, max=100
Location:f?p=CHURCH:1:0:::::
Server:Oracle XML DB/Oracle Database
Set-Cookie:WWV_CUSTOM-F_5238514445419534_102=D6E147387BD4C9DA
WWV_PUBLIC_SESSION_102=2140144576372238
X-DB-Content-length:0

Notice that the request sent no cookies, and the response was a 302 including some cookies (WWV_CUSTOM-F_blablabla and WWV_PUBLIC_SESSION_102).

If you click on the next line (the second call to apex/f) and look at the Headers view, you’ll see this interaction instead:

Request URL:http://www.foothillschurch.org.au/apex/f?p=CHURCH:1:0:::::
Request Method:GET
Status Code:200 OK
Request Headers:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Cookie:WWV_CUSTOM-F_5238514445419534_102=D6E147387BD4C9DA; WWV_PUBLIC_SESSION_102=2140144576372238
Host:www.foothillschurch.org.au
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.187 Safari/535.1
Response Headers:
Cache-Control:max-age=0
Connection:Keep-Alive
Content-Type:text/html; charset=UTF-8
Date:Thu, 06 Oct 2011 22:54:10 GMT
Expires:Thu, 06 Oct 2011 22:54:10 GMT
Keep-Alive:timeout=10, max=99
Server:Oracle XML DB/Oracle Database
Transfer-Encoding:chunked
X-DB-Content-length:11291

This time, the request included the cookies, and the apex engine matched them against a valid and current session; so it responds with the desired 200 (OK) response and the body of the web page.

Gradually, as I was working all this out, I fixed individual problems one by one – turning off debug mode in the application, setting the crawler to a slower speed, and fixing the archive logging. I also added a rel=canonical link in the header of every page. However, the root cause was these URLs being tried by Googlebot, which were quickly escalating as time went by. I didn’t want to use robots.txt to stop it completely, although that might be a valid solution for some cases, because that would remove our Google listing.

I raised a question on the Google webmaster forum to see if there was some way to remove all these URLs from the Googlebot queue that had session IDs. The only way to remove URLs is to (a) add them to your robots.txt, and (b) submit individual URLs to be removed via Google webmaster tools.

In the end, with some help some contributors to the forum, I worked out how to set up Apache to stop these session IDs in their tracks, by doing a 301 (permanent redirect) to a URL with a 0 (zero) session ID. The magic words, added to my apache conf file, were:

RewriteCond %{QUERY_STRING} ^p=102:([a-z0-9]*):([0-9]*):(.*)$ [NC,OR]
RewriteCond %{QUERY_STRING} ^p=CHURCH:([a-z0-9]*):([1-9][0-9]*):(.*)$ [NC]
RewriteRule /apex/f /apex/f?p=CHURCH:%1:0:%3 [R=permanent,L]

What these do is look for any URL with a parameter p where the app ID is 102 or “CHURCH”, and redirects to a new URL that is identical except that it uses the app alias (CHURCH) and sets the session ID to 0.

Breaking it down:

RewriteCond introduces a condition for the following RewriteRule. It’s like an “IF” test.

%{QUERY_STRING} means we want to test the query string (i.e. the part of the URL following the ?).

^p=102:([a-z0-9]*):([0-9]*):(.*)$ this is the regular expression that is used to test for a match. In this case, it is looking for a query string that starts with “p=102:“, followed by a string of letters or digits (the page ID or alias), followed by “:“, followed by a string of digits (the session ID), followed by “:“, and ending with any string of characters. The parts in parentheses – i.e. ([a-z0-9]*) and ([0-9]*) and (.*) will be available for later reuse as substitution variables %1, %2 and %3.

^p=CHURCH:([a-z0-9]*):([1-9][0-9]*):(.*)$ is a similar regular expression, with a slightly different rule for the middle part (the session ID) – it only matches where the session ID is not zero – ([1-9][0-9]*) only matches a string of digits that starts with 1-9. This is because we don’t want our rewrite rule triggering if the app alias is already “CHURCH” and the session ID is already zero.

[NC] “not case sensitive” flag – so that “church” and “Church” will match as well.

[OR] “OR” flag – the condition should be “OR”-ed with the following condition. If either of the RewriteCond directives match, then we want to use the same RewriteRule.

RewriteRule directs the engine to modify the request URI.

/apex/f identifies the part of the request to be rewritten.

/apex/f?p=CHURCH:%1:0:%3 is the rewritten URL. %1 and %3 extract the page ID/alias, and the part of the query string following the session ID, respectively.

[R=permanent] “Redirect” flag – by default this would do a 302 (temporary) redirect, but we have specified “permanent” so that it will do a 301 permanent redirect.

[L] “Last” flag – no further rewrite rules (if any) should be applied.

It works! Everything is now back to sane levels, and I can now enjoy life.

UPDATE Oct 2020
Introduced in APEX 20.1 is the new Friendly URLs option which changes things a lot for this topic – Friendly URLs on a public page no longer require a session ID.
If the page is public and the URL has no session parameter, APEX immediately returns a 200 OK response with the page – which is perfect for Googlebot
.