Googlebot, APEX Session IDs, and Cookies

Recently I had a bit of a saga with a public-facing website running on Oracle APEX (www.foothillschurch.org.au, if you’re curious) getting hammered by Googlebot. We went live with a new version of the site, but I’d forgotten to make sure that all the links set the Session ID to 0 (zero).

What is this session ID?

Every visit to an APEX application needs to have a session. Each unique session is recorded in the database, and keeps track of the state of all the variables for the various pages you visit. Normally, the session is identified by a Session ID which is embedded in the “p” parameter of the URL.

For example, if you visit the page:

http://www.foothillschurch.org.au/apex/f?p=102:1

You’ll notice that the “p” parameter only specifies the app ID (102) and the page ID (1). What apex does is responds with a 302 temporary redirect, that tells the client to redirect to a new URL containing a newly generated session ID, e.g.:

http://www.foothillschurch.org.au/apex/f?p=102:1:45164531548964:::::

Behind the scenes, it’s not just changing the URL – it’s also sending a cookie to the client to be used for subsequent calls. I’ll get back to this later.

Whenever you navigate around a normal apex site, the session ID gets copied into each link so that the user’s session is preserved. If you manually change or remove the session ID from the URL, apex will redirect you to a newly created session ID.

In case you’re wondering, there’s no significant security risk behind the exposed session ID – no-one can “hijack” your session, even if they copy your session ID directly. That’s because there’s a cookie behind the scenes with a secret value that must match up with the session ID, and only someone with sufficient access to the database server could get enough data to do that.

If you store the URL containing a session ID (e.g. in your bookmarks) and re-use it much later, your session will have expired – in which case APEX will create a new session, and 302 temporary redirect you to a new URL with the new session ID. Therefore, users can safely bookmark any page in apex.

But now we come, finally, to Googlebot, that little rascal. Now, we would like our public-facing site to be indexed by Google, so we need all the pages of the site that have relevant info to be crawlable.

The way Googlebot works, normally, is that it starts from a link to your page (e.g. on another website, or in a sitemap you submit to Google), e.g.

http://www.foothillschurch.org.au/apex/f?p=102:1

It checks that the URL is not forbidden by your robots.txt, and sends a request to your server for it. If the response is 200 OK and has a body, Googlebot indexes the page contents, extracts any links from it, and crawls them. Actually, it doesn’t crawl them straight away – it just adds them onto the end of a queue somewhere to be crawled later.

If the response is a 4xx (permanent error) or 5xx (temporary error), Googlebot seems to put the URL back on the queue for a few more goes before it gives up.

If the response is a 3xx redirect, and this is the kicker, Googlebot does not always perform the redirect straight away. It may take the new URL and just add it onto the end of the queue to be crawled later. It seems to me (based on what I’ve seen in my apache logs) that if the URL is different from the first, Googlebot will queue it up for later; but if the URL is identical, it will usually try it straight away.

You may see the problem here: Googlebot visits:

http://www.foothillschurch.org.au/apex/f?p=102:1

Our site creates a session, and responds with a 302 temporary redirect to:

http://www.foothillschurch.org.au/apex/f?p=102:1:48327482923832:::::

Googlebot dutifully notes this new URL and queues it up to crawl later. Meanwhile, our server is waiting patiently for it to get back, but it never does – so the session automatically expires. Much later, Googlebot visits:

http://www.foothillschurch.org.au/apex/f?p=102:1:48327482923832:::::

Our site sees the expired session, creates a new one, and responds with another 302 temporary redirect to:

http://www.foothillschurch.org.au/apex/f?p=102:1:9783829383342:::::

Googlebot dutifully notes this new URL and queues it up to crawl later, etc. etc. etc. Also, it’s not even as benign as that: each URL is not tried just once, but many many times (depending on what speed setting you’ve got the crawler on) – and every single time, our server responds with a brand-new, unique session ID. I hope you can now see why our little site crashed under the load – it quickly filled up the apache logs, it quickly filled up the debug logs in apex, and it quickly overflowed the poorly-configured archive log.

The way to solve this problem is to stop exposing these session IDs in the URL – and in Apex you can do that by setting the session ID to zero, e.g.:

http://www.foothillschurch.org.au/apex/f?p=CHURCH:1:0

Behind the scenes, Apex still creates a session, but whenever it generates a URL with #SESSION# it substitutes zero instead of the internal session ID. This method is great for people who wish to bookmark a page in an application that doesn’t require authentication. It also seems to work for the Googlebot crawler.

The above URL will still cause a 302 temporary redirect, however; apex will redirect it to:

http://www.foothillschurch.org.au/apex/f?p=CHURCH:1:0:::::

You might think that this final URL would stop the redirects, wouldn’t you? Well, it doesn’t. You can see what happens if you open this URL in Google Chrome in incognito mode. First, open an incognito window, then choose the Tools menu, Developer Tools. Select the Network tab. Then, paste the URL into the address bar and press Enter.

You will find that the first call to apex/f receives a 302 (temporary redirect). If you click this entry and choose Headers, you’d see something like this:

Request URL:http://www.foothillschurch.org.au/apex/f?p=CHURCH:1:0:::::
Request Method:GET
Status Code:302 Found
Request Headers:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Host:www.foothillschurch.org.au
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.187 Safari/535.1
Response Headers:
Cache-Control:max-age=0
Connection:Keep-Alive
Content-Length:0
Content-Type:text/html; charset=UTF-8
Date:Thu, 06 Oct 2011 22:54:10 GMT
Expires:Thu, 06 Oct 2011 22:54:10 GMT
Keep-Alive:timeout=10, max=100
Location:f?p=CHURCH:1:0:::::
Server:Oracle XML DB/Oracle Database
Set-Cookie:WWV_CUSTOM-F_5238514445419534_102=D6E147387BD4C9DA
WWV_PUBLIC_SESSION_102=2140144576372238
X-DB-Content-length:0

Notice that the request sent no cookies, and the response was a 302 including some cookies (WWV_CUSTOM-F_blablabla and WWV_PUBLIC_SESSION_102).

If you click on the next line (the second call to apex/f) and look at the Headers view, you’ll see this interaction instead:

Request URL:http://www.foothillschurch.org.au/apex/f?p=CHURCH:1:0:::::
Request Method:GET
Status Code:200 OK
Request Headers:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Cookie:WWV_CUSTOM-F_5238514445419534_102=D6E147387BD4C9DA; WWV_PUBLIC_SESSION_102=2140144576372238
Host:www.foothillschurch.org.au
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.187 Safari/535.1
Response Headers:
Cache-Control:max-age=0
Connection:Keep-Alive
Content-Type:text/html; charset=UTF-8
Date:Thu, 06 Oct 2011 22:54:10 GMT
Expires:Thu, 06 Oct 2011 22:54:10 GMT
Keep-Alive:timeout=10, max=99
Server:Oracle XML DB/Oracle Database
Transfer-Encoding:chunked
X-DB-Content-length:11291

This time, the request included the cookies, and the apex engine matched them against a valid and current session; so it responds with the desired 200 (OK) response and the body of the web page.

Gradually, as I was working all this out, I fixed individual problems one by one – turning off debug mode in the application, setting the crawler to a slower speed, and fixing the archive logging. I also added a rel=canonical link in the header of every page. However, the root cause was these URLs being tried by Googlebot, which were quickly escalating as time went by. I didn’t want to use robots.txt to stop it completely, although that might be a valid solution for some cases, because that would remove our Google listing.

I raised a question on the Google webmaster forum to see if there was some way to remove all these URLs from the Googlebot queue that had session IDs. The only way to remove URLs is to (a) add them to your robots.txt, and (b) submit individual URLs to be removed via Google webmaster tools.

In the end, with some help some contributors to the forum, I worked out how to set up Apache to stop these session IDs in their tracks, by doing a 301 (permanent redirect) to a URL with a 0 (zero) session ID. The magic words, added to my apache conf file, were:

RewriteCond %{QUERY_STRING} ^p=102:([a-z0-9]*):([0-9]*):(.*)$ [NC,OR]
RewriteCond %{QUERY_STRING} ^p=CHURCH:([a-z0-9]*):([1-9][0-9]*):(.*)$ [NC]
RewriteRule /apex/f /apex/f?p=CHURCH:%1:0:%3 [R=permanent,L]

What these do is look for any URL with a parameter p where the app ID is 102 or “CHURCH”, and redirects to a new URL that is identical except that it uses the app alias (CHURCH) and sets the session ID to 0.

Breaking it down:

RewriteCond introduces a condition for the following RewriteRule. It’s like an “IF” test.

%{QUERY_STRING} means we want to test the query string (i.e. the part of the URL following the ?).

^p=102:([a-z0-9]*):([0-9]*):(.*)$ this is the regular expression that is used to test for a match. In this case, it is looking for a query string that starts with “p=102:“, followed by a string of letters or digits (the page ID or alias), followed by “:“, followed by a string of digits (the session ID), followed by “:“, and ending with any string of characters. The parts in parentheses – i.e. ([a-z0-9]*) and ([0-9]*) and (.*) will be available for later reuse as substitution variables %1, %2 and %3.

^p=CHURCH:([a-z0-9]*):([1-9][0-9]*):(.*)$ is a similar regular expression, with a slightly different rule for the middle part (the session ID) – it only matches where the session ID is not zero – ([1-9][0-9]*) only matches a string of digits that starts with 1-9. This is because we don’t want our rewrite rule triggering if the app alias is already “CHURCH” and the session ID is already zero.

[NC] “not case sensitive” flag – so that “church” and “Church” will match as well.

[OR] “OR” flag – the condition should be “OR”-ed with the following condition. If either of the RewriteCond directives match, then we want to use the same RewriteRule.

RewriteRule directs the engine to modify the request URI.

/apex/f identifies the part of the request to be rewritten.

/apex/f?p=CHURCH:%1:0:%3 is the rewritten URL. %1 and %3 extract the page ID/alias, and the part of the query string following the session ID, respectively.

[R=permanent] “Redirect” flag – by default this would do a 302 (temporary) redirect, but we have specified “permanent” so that it will do a 301 permanent redirect.

[L] “Last” flag – no further rewrite rules (if any) should be applied.

It works! Everything is now back to sane levels, and I can now enjoy life.

UPDATE Oct 2020
Introduced in APEX 20.1 is the new Friendly URLs option which changes things a lot for this topic – Friendly URLs on a public page no longer require a session ID.
If the page is public and the URL has no session parameter, APEX immediately returns a 200 OK response with the page – which is perfect for Googlebot
.

Are You Ready? Upcoming Conference in Perth
AUSOUG Perth Conference 2011 Day One

Comments

  1. Hi Jeff,

    Be careful here:
    “In case you’re wondering, there’s no significant security risk behind the exposed session ID – no-one can “hijack” your session, even if they copy your session ID directly. That’s because there’s a cookie behind the scenes with a secret value that must match up with the session ID, and only someone with sufficient access to the database server could get enough data to do that.”

    There IS a security risk on public pages and using collections. If you go to a link with a hard coded session ID it will not change it, it will use the same session ID. Collections are based on session ID so if one person entered data into the collection, the next user to browse with the same session ID will now be able to see the data.

    • Hi Quinny,

      Do you have any references or test cases that demonstrate that behaviour? If so, I’d think that would indeed be quite a big security risk which would get a lot of attention from oracle. However, as far as I can see Oracle Apex always responds with a redirect if the session doesn’t send the correct cookie that matches the given session ID.

      Jeff

  2. I haven’t set up any test cases specifically to show this.
    We actually had this happen because of a hard coded session ID in a link for a public page where users could enter their details. Once they submitted page 1 the collection was saved. If they clicked the back button or went back to edit their details on the first page, the collection was re-queried and they could sometimes pull up another user’s details.
    I guess because the session ID was already in the link, the cookie was created at that time with the hard-coded session ID.

    • It sounds like your application was not using v(‘SESSION’) to determine the session ID to save to the database table, but instead relying on the URL in some way? I may be wrong, but I’d first suspect a problem with the design of the application.

  3. Hi Jeff,

    There sure are some problems with the Apex session ID in the URL. I’m yet to see an advantage…
    Have you ever heard a good reason for Apex to even show the ID in the URL?

    I mean, it’s the cookie that counts. Right?

    Or have I missed something…

    Kind regards

    Tomas
    Stockholm, Sweden

    • Good question. Maybe it allows me to have two separate Apex sessions open in separate windows on the same computer?

  4. Hmmm…
    I think IE and FF both use the same cookie collection for all their tabs/windows.
    Even when starting a new instance of the browser, both will share cookies.
    The only way I’ve found to run two different sessions on one pc against the same Apex server is running one IE and one FF.

    Tomas

  5. Hi Jeffrey!
    Could you post full Viertual Host section of your http.conf ?

    • Hi Denis. I host several sites, so I have one VirtualHost section for each, more or less like this:

      <VirtualHost *:80>
          ServerName mydomain.com
          ServerAlias mydomain.com *.mydomain.com
          ServerAdmin webmaster@mydomain.com
          DocumentRoot /var/www/html
          ErrorLog logs/error_log
          CustomLog logs/access_log combined
      </VirtualHost>
  6. Is there a way to disable Session 0 in a corporate intranet environment??

    • Jeffrey Kemp
      17 May 2014 - 6:41 pm

      Hi Derek, not sure what you mean – if you have any links that go to session 0, you can change them to use the #SESSION# substitution variable instead. If users have bookmarks with session 0, they will be redirected to your login page if the page requires authentication. If users go to a public page, I’m not sure what the problem is with using session 0. Jeff

Leave a Reply to Anonymous Cancel reply

Your email address will not be published / Required fields are marked *