Googlebot, APEX Session IDs, and Cookies

Recently I had a bit of a saga with a public-facing website running on Oracle APEX (www.foothillschurch.org.au, if you’re curious) getting hammered by Googlebot. We went live with a new version of the site, but I’d forgotten to make sure that all the links set the Session ID to 0 (zero).

What is this session ID?

Every visit to an APEX application needs to have a session. Each unique session is recorded in the database, and keeps track of the state of all the variables for the various pages you visit. Normally, the session is identified by a Session ID which is embedded in the “p” parameter of the URL.

For example, if you visit the page:

http://www.foothillschurch.org.au/apex/f?p=102:1

You’ll notice that the “p” parameter only specifies the app ID (102) and the page ID (1). What apex does is responds with a 302 temporary redirect, that tells the client to redirect to a new URL containing a newly generated session ID, e.g.:

http://www.foothillschurch.org.au/apex/f?p=102:1:45164531548964:::::

Behind the scenes, it’s not just changing the URL – it’s also sending a cookie to the client to be used for subsequent calls. I’ll get back to this later.

Whenever you navigate around a normal apex site, the session ID gets copied into each link so that the user’s session is preserved. If you manually change or remove the session ID from the URL, apex will redirect you to a newly created session ID.

In case you’re wondering, there’s no significant security risk behind the exposed session ID – no-one can “hijack” your session, even if they copy your session ID directly. That’s because there’s a cookie behind the scenes with a secret value that must match up with the session ID, and only someone with sufficient access to the database server could get enough data to do that.

If you store the URL containing a session ID (e.g. in your bookmarks) and re-use it much later, your session will have expired – in which case APEX will create a new session, and 302 temporary redirect you to a new URL with the new session ID. Therefore, users can safely bookmark any page in apex.

But now we come, finally, to Googlebot, that little rascal. Now, we would like our public-facing site to be indexed by Google, so we need all the pages of the site that have relevant info to be crawlable.

The way Googlebot works, normally, is that it starts from a link to your page (e.g. on another website, or in a sitemap you submit to Google), e.g.

http://www.foothillschurch.org.au/apex/f?p=102:1

It checks that the URL is not forbidden by your robots.txt, and sends a request to your server for it. If the response is 200 OK and has a body, Googlebot indexes the page contents, extracts any links from it, and crawls them. Actually, it doesn’t crawl them straight away – it just adds them onto the end of a queue somewhere to be crawled later.

If the response is a 4xx (permanent error) or 5xx (temporary error), Googlebot seems to put the URL back on the queue for a few more goes before it gives up.

If the response is a 3xx redirect, and this is the kicker, Googlebot does not always perform the redirect straight away. It may take the new URL and just add it onto the end of the queue to be crawled later. It seems to me (based on what I’ve seen in my apache logs) that if the URL is different from the first, Googlebot will queue it up for later; but if the URL is identical, it will usually try it straight away.

You may see the problem here: Googlebot visits:

http://www.foothillschurch.org.au/apex/f?p=102:1

Our site creates a session, and responds with a 302 temporary redirect to:

http://www.foothillschurch.org.au/apex/f?p=102:1:48327482923832:::::

Googlebot dutifully notes this new URL and queues it up to crawl later. Meanwhile, our server is waiting patiently for it to get back, but it never does – so the session automatically expires. Much later, Googlebot visits:

http://www.foothillschurch.org.au/apex/f?p=102:1:48327482923832:::::

Our site sees the expired session, creates a new one, and responds with another 302 temporary redirect to:

http://www.foothillschurch.org.au/apex/f?p=102:1:9783829383342:::::

Googlebot dutifully notes this new URL and queues it up to crawl later, etc. etc. etc. Also, it’s not even as benign as that: each URL is not tried just once, but many many times (depending on what speed setting you’ve got the crawler on) – and every single time, our server responds with a brand-new, unique session ID. I hope you can now see why our little site crashed under the load – it quickly filled up the apache logs, it quickly filled up the debug logs in apex, and it quickly overflowed the poorly-configured archive log.

The way to solve this problem is to stop exposing these session IDs in the URL – and in Apex you can do that by setting the session ID to zero, e.g.:

http://www.foothillschurch.org.au/apex/f?p=CHURCH:1:0

Behind the scenes, Apex still creates a session, but whenever it generates a URL with #SESSION# it substitutes zero instead of the internal session ID. This method is great for people who wish to bookmark a page in an application that doesn’t require authentication. It also seems to work for the Googlebot crawler.

The above URL will still cause a 302 temporary redirect, however; apex will redirect it to:

http://www.foothillschurch.org.au/apex/f?p=CHURCH:1:0:::::

You might think that this final URL would stop the redirects, wouldn’t you? Well, it doesn’t. You can see what happens if you open this URL in Google Chrome in incognito mode. First, open an incognito window, then choose the Tools menu, Developer Tools. Select the Network tab. Then, paste the URL into the address bar and press Enter.

You will find that the first call to apex/f receives a 302 (temporary redirect). If you click this entry and choose Headers, you’d see something like this:

Request URL:http://www.foothillschurch.org.au/apex/f?p=CHURCH:1:0:::::
Request Method:GET
Status Code:302 Found
Request Headers:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Host:www.foothillschurch.org.au
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.187 Safari/535.1
Response Headers:
Cache-Control:max-age=0
Connection:Keep-Alive
Content-Length:0
Content-Type:text/html; charset=UTF-8
Date:Thu, 06 Oct 2011 22:54:10 GMT
Expires:Thu, 06 Oct 2011 22:54:10 GMT
Keep-Alive:timeout=10, max=100
Location:f?p=CHURCH:1:0:::::
Server:Oracle XML DB/Oracle Database
Set-Cookie:WWV_CUSTOM-F_5238514445419534_102=D6E147387BD4C9DA
WWV_PUBLIC_SESSION_102=2140144576372238
X-DB-Content-length:0

Notice that the request sent no cookies, and the response was a 302 including some cookies (WWV_CUSTOM-F_blablabla and WWV_PUBLIC_SESSION_102).

If you click on the next line (the second call to apex/f) and look at the Headers view, you’ll see this interaction instead:

Request URL:http://www.foothillschurch.org.au/apex/f?p=CHURCH:1:0:::::
Request Method:GET
Status Code:200 OK
Request Headers:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Cookie:WWV_CUSTOM-F_5238514445419534_102=D6E147387BD4C9DA; WWV_PUBLIC_SESSION_102=2140144576372238
Host:www.foothillschurch.org.au
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.187 Safari/535.1
Response Headers:
Cache-Control:max-age=0
Connection:Keep-Alive
Content-Type:text/html; charset=UTF-8
Date:Thu, 06 Oct 2011 22:54:10 GMT
Expires:Thu, 06 Oct 2011 22:54:10 GMT
Keep-Alive:timeout=10, max=99
Server:Oracle XML DB/Oracle Database
Transfer-Encoding:chunked
X-DB-Content-length:11291

This time, the request included the cookies, and the apex engine matched them against a valid and current session; so it responds with the desired 200 (OK) response and the body of the web page.

Gradually, as I was working all this out, I fixed individual problems one by one – turning off debug mode in the application, setting the crawler to a slower speed, and fixing the archive logging. I also added a rel=canonical link in the header of every page. However, the root cause was these URLs being tried by Googlebot, which were quickly escalating as time went by. I didn’t want to use robots.txt to stop it completely, although that might be a valid solution for some cases, because that would remove our Google listing.

I raised a question on the Google webmaster forum to see if there was some way to remove all these URLs from the Googlebot queue that had session IDs. The only way to remove URLs is to (a) add them to your robots.txt, and (b) submit individual URLs to be removed via Google webmaster tools.

In the end, with some help some contributors to the forum, I worked out how to set up Apache to stop these session IDs in their tracks, by doing a 301 (permanent redirect) to a URL with a 0 (zero) session ID. The magic words, added to my apache conf file, were:

RewriteCond %{QUERY_STRING} ^p=102:([a-z0-9]*):([0-9]*):(.*)$ [NC,OR]
RewriteCond %{QUERY_STRING} ^p=CHURCH:([a-z0-9]*):([1-9][0-9]*):(.*)$ [NC]
RewriteRule /apex/f /apex/f?p=CHURCH:%1:0:%3 [R=permanent,L]

What these do is look for any URL with a parameter p where the app ID is 102 or “CHURCH”, and redirects to a new URL that is identical except that it uses the app alias (CHURCH) and sets the session ID to 0.

Breaking it down:

RewriteCond introduces a condition for the following RewriteRule. It’s like an “IF” test.

%{QUERY_STRING} means we want to test the query string (i.e. the part of the URL following the ?).

^p=102:([a-z0-9]*):([0-9]*):(.*)$ this is the regular expression that is used to test for a match. In this case, it is looking for a query string that starts with “p=102:“, followed by a string of letters or digits (the page ID or alias), followed by “:“, followed by a string of digits (the session ID), followed by “:“, and ending with any string of characters. The parts in parentheses – i.e. ([a-z0-9]*) and ([0-9]*) and (.*) will be available for later reuse as substitution variables %1, %2 and %3.

^p=CHURCH:([a-z0-9]*):([1-9][0-9]*):(.*)$ is a similar regular expression, with a slightly different rule for the middle part (the session ID) – it only matches where the session ID is not zero – ([1-9][0-9]*) only matches a string of digits that starts with 1-9. This is because we don’t want our rewrite rule triggering if the app alias is already “CHURCH” and the session ID is already zero.

[NC] “not case sensitive” flag – so that “church” and “Church” will match as well.

[OR] “OR” flag – the condition should be “OR”-ed with the following condition. If either of the RewriteCond directives match, then we want to use the same RewriteRule.

RewriteRule directs the engine to modify the request URI.

/apex/f identifies the part of the request to be rewritten.

/apex/f?p=CHURCH:%1:0:%3 is the rewritten URL. %1 and %3 extract the page ID/alias, and the part of the query string following the session ID, respectively.

[R=permanent] “Redirect” flag – by default this would do a 302 (temporary) redirect, but we have specified “permanent” so that it will do a 301 permanent redirect.

[L] “Last” flag – no further rewrite rules (if any) should be applied.

It works! Everything is now back to sane levels, and I can now enjoy life.


Personal Blacklist for Google Search Results

Using Chrome? Sick of certain sites-I-won’t-name-here cluttering the search results when you google for technical issues?

http://chrome.google.com/webstore/detail/nolijncfnkgaikbjbdaogikpmpbdcdef

Installed. Already blocked a few sites. Beautiful.

Plus, they assure us that the aggregated results blacklisted domains may well be used to improve Google’s page ranking algorithm in the future…

btw – Read JoerT’s comment on the extension 🙂

EDIT: just noticed it doesn’t work in conjunction with AdBlock – if I enable AdBlock, the blacklist stops working.


Google indexes duplicate pages from my APEX site – problem solved

Problem: when Google indexes my APEX web site, it considers the following URLs to be different pages:

http://www.site.com/apex/f?p=100:1:1234567890::::
http://www.site.com/apex/f?p=100:1:0::::
http://www.site.com/apex/f?p=MYAPP:1:46346346346::::
http://www.site.com/apex/f?p=MYAPP:1:34634634636::::
http://www.site.com/apex/f?p=MYAPP:HOME:46346346346::::
http://www.site.com/apex/f?p=MYAPP:HOME:0::::

Notice how my application with ID 100 has an alias of MYAPP, and page 1 has an alias of HOME; also, more duplicates happen due to the session ID; all these URLs point to pretty much the same content, but Google indexes them all as separate pages.

Google provides two features that help webmasters solve the duplicate page problem.

Solution #1: Parameter Handling – not very useful (for us)

This solution involves telling Google which parameters to ignore when indexing URLs. This doesn’t help us with APEX, because apex only uses one parameter – “p”; if we were to tell Google to ignore the “p” parameter it would consider ALL pages in our site to be identical, which is not correct.

Solution #2: Specify Your Canonical – very useful!

Example:

<head>
<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />
</head>

This works nicely for us – for any page that we want we can tell Google what URL should be the “canonical” or “official” URL for that page. We can use this in our APEX applications in a number of ways. Each has advantages and disadvantages and YMMV, and it depends on how many different kinds of pages you have and whether you want the same canonical form for all pages, or if you want it customised for individual pages.

A. Custom canonical URL for each page.

This option will probably be the most generally useful, since some pages (e.g. multi-row paged results) won’t work so well with a canonical URL, so you’ll want to specify a canonical URL for just some key pages on your site.

To do this, go to the Page editor and edit the Page Attributes, edit the HTML Header and add the following:

<link rel="canonical" href="/apex/f?p=&APP_ID.:&APP_PAGE_ID.:0"/>

  • You can add the full URL instead of a relative one if you want, but note if you do that it must be on the same domain (e.g. if your site is www.mysite.com, you can’t have a canonical URL pointing to myothersite.com). Anyway, Google don’t mind if you use relative URLs here, so that’s what I do.
  • You don’t have to use the &APP_xxx. substitution variables if you don’t want to – e.g. you could specify another application or page entirely if that makes sense for your app.
  • If your application has an alias, you could use that as the canonical URL:

    <link rel="canonical" href="/apex/f?p=&APP_ALIAS.:&APP_PAGE_ID.:0"/>
    Unfortunately, if APEX has a substitution variable for the Page Alias, I don’t know what it is. UPDATE 2017: APEX now provides the substitution variable APP_PAGE_ALIAS as of APEX 5.0.

B. Global canonical URL for all pages in an application.

This option works well if you want all the pages to have the same form of canonical URL. Because we’ll use the &APP_PAGE_ID. substitution variable, it will still correctly give the correct URL for each page in the application.

To do this, go to the Shared Components, and open Themes. Open the theme in use by your application, then find the Page themes. Next to each Page Theme is a number that indicates how many pages use that Page Theme; those are the only ones you need to edit (although there’s nothing stopping you from editing all of them if you wish).

Click the Page Theme name to edit it. In the Header definition, add the canonical link – it must be inserted after the <head> tag, and prior to the </head> tag. For example:

<html lang="&BROWSER_LANGUAGE." xmlns:htmldb="http://htmldb.oracle.com">
<head>
<title>#TITLE#</title>
<link rel="stylesheet" href="#IMAGE_PREFIX#themes/theme_16/theme_V3.css" type="text/css" />
<!--[if IE]><link rel="stylesheet" href="#IMAGE_PREFIX#themes/theme_16/ie.css" type="text/css" /><![endif]-->
<link rel="canonical" href="/apex/f?p=&APP_ALIAS.:&APP_PAGE_ID.:0"/>
#HEAD#
</head>
<body #ONLOAD#>#FORM_OPEN#

Now, it’s important to test your changes thoroughly because many syntax errors you enter will not manifest in any obvious problems when browsing the site. Open your pages and View Source – check that the header section of the HTML includes the correct <link rel="canonical" ...> tag, and ensure that the URL resolves to the same page by copying it out and pasting it into your brower’s address bar.

Once that’s done that’s it! When Google next indexes your site it should honour your canonical URLs and remove duplicate pages from its indexes.