Tag: google

Googlebot, Apex Session IDs, and Cookies

Recently I had a bit of a saga with a public-facing website running on Oracle Apex (www.foothillschurch.org.au, if you’re curious) getting hammered by Googlebot. We went live with a new version of the site, but I’d forgotten to make sure that all the links set the Session ID to 0 (zero).

What is this session ID?

Every visit to an Apex application needs to have a session. Each unique session is recorded in the database, and keeps track of the state of all the variables for the various pages you visit. Normally, the session is identified by a Session ID which is embedded in the “p” parameter of the URL.

For example, if you visit the page:


You’ll notice that the “p” parameter only specifies the app ID (102) and the page ID (1). What apex does is responds with a 302 temporary redirect, that tells the client to redirect to a new URL containing a newly generated session ID, e.g.:


Behind the scenes, it’s not just changing the URL – it’s also sending a cookie to the client to be used for subsequent calls. I’ll get back to this later.

Whenever you navigate around a normal apex site, the session ID gets copied into each link so that the user’s session is preserved. If you manually change or remove the session ID from the URL, apex will redirect you to a newly created session ID.

In case you’re wondering, there’s no significant security risk behind the exposed session ID – no-one can “hijack” your session, even if they copy your session ID directly. That’s because there’s a cookie behind the scenes with a secret value that must match up with the session ID, and only someone with sufficient access to the database server could get enough data to do that.

If you store the URL containing a session ID (e.g. in your bookmarks) and re-use it much later, your session will have expired – in which case apex will create a new session, and 302 temporary redirect you to a new URL with the new session ID. Therefore, users can safely bookmark any page in apex.

But now we come, finally, to Googlebot, that little rascal. Now, we would like our public-facing site to be indexed by Google, so we need all the pages of the site that have relevant info to be crawlable.

The way Googlebot works, normally, is that it starts from a link to your page (e.g. on another website, or in a sitemap you submit to Google), e.g.


It checks that the URL is not forbidden by your robots.txt, and sends a request to your server for it. If the response is 200 OK and has a body, Googlebot indexes the page contents, extracts any links from it, and crawls them. Actually, it doesn’t crawl them straight away – it just adds them onto the end of a queue somewhere to be crawled later.

If the response is a 4xx (permanent error) or 5xx (temporary error), Googlebot seems to put the URL back on the queue for a few more goes before it gives up.

If the response is a 3xx redirect, and this is the kicker, Googlebot does not always perform the redirect straight away. It may take the new URL and just add it onto the end of the queue to be crawled later. It seems to me (based on what I’ve seen in my apache logs) that if the URL is different from the first, Googlebot will queue it up for later; but if the URL is identical, it will usually try it straight away.

Continue reading “Googlebot, Apex Session IDs, and Cookies”

Personal Blacklist for Google Search Results

Using Chrome? Sick of certain sites-I-won’t-name-here cluttering the search results when you google for technical issues?


Installed. Already blocked a few sites. Beautiful.

Plus, they assure us that the aggregated results blacklisted domains may well be used to improve Google’s page ranking algorithm in the future…

btw – Read JoerT’s comment on the extension :)

EDIT: just noticed it doesn’t work in conjunction with AdBlock – if I enable AdBlock, the blacklist stops working.

Google indexes duplicate pages from my Apex site – problem solved

Problem: when Google indexes my Apex web site, it considers the following URLs to be different pages:


Notice how my application with ID 100 has an alias of MYAPP, and page 1 has an alias of HOME; also, more duplicates happen due to the session ID; all these URLs point to pretty much the same content, but Google indexes them all as separate pages.

Google provides two features that help webmasters solve the duplicate page problem.

Solution #1: Parameter Handling – not very useful (for us)

This solution involves telling Google which parameters to ignore when indexing URLs. This doesn’t help us with Apex, because apex only uses one parameter – “p”; if we were to tell Google to ignore the “p” parameter it would consider ALL pages in our site to be identical, which is not correct.

Solution #2: Specify Your Canonical – very useful!


<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />

This works nicely for us – for any page that we want we can tell Google what URL should be the “canonical” or “official” URL for that page. We can use this in our Apex applications in a number of ways. Each has advantages and disadvantages and YMMV, and it depends on how many different kinds of pages you have and whether you want the same canonical form for all pages, or if you want it customised for individual pages.

A. Custom canonical URL for each page.

This option will probably be the most generally useful, since some pages (e.g. multi-row paged results) won’t work so well with a canonical URL, so you’ll want to specify a canonical URL for just some key pages on your site.

To do this, go to the Page editor and edit the Page Attributes, edit the HTML Header and add the following:

<link rel="canonical" href="/apex/f?p=&APP_ID.:&APP_PAGE_ID.:0"/>

  • You can add the full URL instead of a relative one if you want, but note if you do that it must be on the same domain (e.g. if your site is http://www.mysite.com, you can’t have a canonical URL pointing to myothersite.com). Anyway, Google don’t mind if you use relative URLs here, so that’s what I do.
  • You don’t have to use the &APP_xxx. substitution variables if you don’t want to – e.g. you could specify another application or page entirely if that makes sense for your app.
  • If your application has an alias, you could use that as the canonical URL:

    <link rel="canonical" href="/apex/f?p=&APP_ALIAS.:&APP_PAGE_ID.:0"/>

    Unfortunately, if Apex has a substitution variable for the Page Alias, I don’t know what it is.’

B. Global canonical URL for all pages in an application.

This option works well if you want all the pages to have the same form of canonical URL. Because we’ll use the &APP_PAGE_ID. substitution variable, it will still correctly give the correct URL for each page in the application.

To do this, go to the Shared Components, and open Themes. Open the theme in use by your application, then find the Page themes. Next to each Page Theme is a number that indicates how many pages use that Page Theme; those are the only ones you need to edit (although there’s nothing stopping you from editing all of them if you wish).

Click the Page Theme name to edit it. In the Header definition, add the canonical link – it must be inserted after the <head> tag, and prior to the </head> tag. For example:

<html lang="&BROWSER_LANGUAGE." xmlns:htmldb="http://htmldb.oracle.com">
<link rel="stylesheet" href="#IMAGE_PREFIX#themes/theme_16/theme_V3.css" type="text/css" />
<!--[if IE]><link rel="stylesheet" href="#IMAGE_PREFIX#themes/theme_16/ie.css" type="text/css" /><![endif]-->
<link rel="canonical" href="/apex/f?p=&APP_ALIAS.:&APP_PAGE_ID.:0"/>

Now, it’s important to test your changes thoroughly because many syntax errors you enter will not manifest in any obvious problems when browsing the site. Open your pages and View Source – check that the header section of the HTML includes the correct <link rel="canonical" ...> tag, and ensure that the URL resolves to the same page by copying it out and pasting it into your brower’s address bar.

Once that’s done that’s it! When Google next indexes your site it should honour your canonical URLs and remove duplicate pages from its indexes.