Truncated Mean in Oracle

A colleague needed to get the average from a set of data, but disregarding the top 25% and the bottom 25%. We didn’t know of any builtin Oracle function that would do this, but a review of the wikipedia page for Average yielded the technical term: truncated (or trimmed) mean. So we searched the Oracle docs and Google for ways to implement this function and didn’t come up with anything very useful. There were some SQL Server scripts which would have required two or three passes over the dataset.

After browsing through the aggregate functions documentation, I hit upon the NTILE function which I’ve used before, and realised that was the answer. The NTILE function takes a set of ordered data and divides it evenly (or as evenly as possible) between a number of buckets. In our case, we wanted to discard the top 25% and bottom 25%, so we simply divide the set into 4 buckets and discard the 1st and the 4th buckets; then take the standard average of the remainder:

SELECT AVG(mystat)
FROM (
      SELECT mystat,
             NTILE(4) OVER (ORDER BY mystat) n
      FROM (SELECT mystat
            FROM mytable)
     )
WHERE n IN (2,3);

The benefit of this query is that it only does one pass over the data, and was easily modified to partition the data set into groups; a count was also added so that the average could be taken over the entire data set for any groups that had less than 4 items.

To get the truncated mean in groups, except for groups with <4 items (for which we’ll report the average over the entire group):

SELECT mycat, AVG(mystat)
FROM (
      SELECT mycat, mystat,
             NTILE(4) OVER (PARTITION BY mycat
                            ORDER BY mystat) n,
             COUNT(1) OVER (PARTITION BY mycat) cΒ 
      FROM (SELECT mycat, mystat
            FROM mytable)
     )
WHERE n IN (2,3) OR c < 4
GROUP BY mycat
ORDER BY mycat;

Demise of the Perth AUSOUG Conference?

Got an email this morning from the Australian Oracle User Group that this year the AUSOUG/OAUG “Oracle with 20:20 Foresight” conference will be co-located with the InSync conference, which is held in Melbourne on 16-17 August.

Initially I thought, “what, no conference for Perth?” Thankfully, though, they will also hold Oracle with 20:20 Foresight in Perth as well, 22-23 November. Phew!



A good entry-level interview question

I think this would make a reasonable question for would-be database developers:

What are the differences between the following statements?

UPDATE mytable SET mycolumn = NULL;
ALTER TABLE mytable MODIFY (mycolumn NULL);

If they can’t answer this correctly, they don’t get to stage two.


Lesson learned today

Comment out all “DROP TABLE” commands in my scripts.

(I accidentally hit F5 when the focus was in the wrong window – which happened to contain a “DROP TABLE / CREATE TABLE” script – my Toad session goes and happily drops the table that I’d been gradually accumulating statistics into for the past 3 days – and no, there’s no flashback table in 9i)

At least I kept all my scripts – rerunning them all now…


Please, hard-code your literals

As Feuerstein says,


We all know that hard-coding is a bad thing in software. But most developers think of hard-coding simply as typing a literal value into your program whenever you need it.

So what’s wrong with doing this? Nothing – as long as the value is never going to change. But what’s the chance of that happening? In fact, what’s the chance of anything staying the same (never changing) in our application requirements and resulting code?

Almost nil.

I agree with most of what Steven says in this post, except for the “Almost nil” part. In my experience, developers who have slavishly converted all literal values in their applications to constants have made life much more difficult for subsequent maintenance and performance tuning.

I would contend that literal values should be hard-coded throughout the code when those values will not change, because they cannot change. The main example of this kind of value is that of hidden magical ID numbers, that crop up especially often in database designs featuring any kind of EAV pattern. These designs usually feature some kind of metadata table, e.g.:

PROPERTYTYPES (
  propertyid,
  code,
  description,
  datatype, length, etc.)

and sprinkled throughout the codebase, like sand in your toddler’s nappy after a trip to the beach, are statements like this:

SELECT value INTO custname
FROM propertyvalues
WHERE entityid = p1
AND propertyid = cNAME;

(where cNAME is a constant that happens to be set to 30456 or something like that)

It gets worse when they need more than one property about an entity:

SELECT a.value, b.value, c.value
INTO custname, custphone, custaddress
FROM propertyvalues a, propertyvalues b, propertyvalues c
WHERE a.entityid = p1 AND b.entityid = p1 and c.entityid = p1
AND a.propertyid = cNAME
AND b.propertyid = cPHONE
AND c.propertyid = cADDRESS;

Someone might say, “that’s good, isn’t it? The hardcoded literal values have been stored once and once only in the constant declarations, and they can be used everywhere. If we want to change a property ID we can change it in one place and everything still works.”

Wrong – for two reasons.

Firstly, it’s not just one place – if you change the property ID, you have to also change the data in all the tables that point to that property. You also have to change code in that frontend UI that can’t read those constants, or in that external process that inexplicably was written with its own logic around those particular ID values (and search-and-replace won’t work because they’ve written some braindead code like this: if propertyid > 30453 & propertyid < 30458 { ... }). The larger, more complex and widespread the codebase, the more it’s just not going to change.

Secondly, WHY? Why would you ever want to change these IDs? Much preferable to hard-code those ID values everywhere. If you stick to using constants where it makes sense (like the “maximum salary” that Steven had in his excellent example), then your hard-coded literals will tell future developers one important and life-preserving message:

“Do not change this code.”


Why EAV is bad, mostly; and why it is used so often

Tom Kyte offers some astute observations regarding EAV data models in general.

Interesting to see the justification for entity-attribute-value data models from their use in the design of Oracle Application Express. I’d say I’d have to agree – the site I’m at currently is chocas with name/value pairs, which works fine when querying individual bits of data for the UI, but works miserably for reporting and batch jobs.


Googlebot boosts APEX performance

Every day it seems, Googlebot scans my site. You wouldn’t think that would affect the performance of the site, or if at all, would make it worse, right?

A few days ago I used APEX’s Page View Analysis, Weighted Page Performance (under Home > Administration > Monitor Activity) to see which pages I should focus on to improve performance. The top page was my home page, which was taking 0.9 seconds to load on average. There were other pages that were taking longer to load, but the home page was the most common entry point so it had a higher “weighting”.

So I loaded up the page while logged in as a developer and clicked the “Debug” button at the bottom of the page. It showed that all the regions on the page were rendering in about 0.01 seconds each (sometimes a bit more), except for one region – a region that pulled a small bit of XML from another website (“Verse of the Day” via BibleGateway.com) and displayed it – it was consistently taking 0.5 – 0.8 seconds to load.

The “Verse of the Day”, obviously, only changes once a day, but for some reason, in spite of the Cache settings on the region, it was still going back to the source every page view (side question: have you noticed that in Debug mode, it seems no regions are found in the cache? – perhaps this is by design?).

Oh well – I was thinking that I’d like to store the text that is being shown in that region anyway, so I created a simple table to store the text, then rewrote the PL/SQL to get the text from the table instead; and if not found, only then does it go to BibleGateway.com to get the VotD.

After the first pageview loads the verse into the cache, subsequent pageviews are taking less than 0.1 seconds total, so that’s solved – except for the very first visitor to the site of the day, who will have to wait a whole 0.9 seconds.

However, as I discovered this morning, looking at the logs, it appears Googlebot is the one who is suffering – sometime after midnight most nights it scans the site, seeds the cache for me, then everyone who visits that day gets the benefit of the cache πŸ™‚ πŸ™‚ πŸ™‚


Ten Years on Oracle

Almost exactly ten years ago, as all the Y2K bug projects were quietly winding down, I’d been getting ready for a new decade, which for me was full of uncertainty and promise. It was in 2000 that I got married, and started my first job in the I.T. industry (but not in that order). Many people thought the world was going to end, but for me it felt like the dawn of a new personal era.

After eight years of tertiary study, I’d finally gotten my degree and was fortunate enough to have landed a job as an Oracle analyst programmer at the Valuer General’s Office in Perth. Working there turned out to be a perfect introduction to the industry – I learned a lot about Oracle, about running I.T. projects, and about working productively.

There were some great people there, who mentored me and gently guided me in how to think about database programming. Specifically, it was there I started throwing off the shackles of procedural thinking, to be replaced with a relational approach, thinking about sets of data instead of individual rows.

These mentors never said, “just do it this way because I said so.” They’d allow me to make the mistakes first, then show me the better way, and explain why. For the most part, however, they just let me learn by doing, which (for me at least) is the best way to learn.

Over the last ten years I’ve gained valuable experience in a number of quite different projects, in different types of environments, different team structures, and different kinds of people, all working with Oracle. I’m open to trying new technologies, and always seek opportunities to expand my experience with different languages and software; but I think if I get to the end of my career, having never worked on any system without the Oracle database, I won’t be at all disappointed. It’s certainly not a perfect system, but there’s so much to like, so much to learn and explore.

Only God knows what my future will hold, but if the next ten years are anything like the last, they’ll involve a lot of learning, learning, and more learning. As I get older it might take a greater effort, but I always want to be the kind of person who never takes anything for granted, never assumes something just because someone said it, and who mentors others with patience and grace.

Happy new decade, everyone.


How to Change the ORDER BY When You Can’t Modify the SQL

Is it possible to change the ORDER BY on a query, if the SQL is defined in a wrapped PL/SQL package? A colleague asked me this interesting question today – and the answer was quite simply, Yes and No.

If the SQL already has an ORDER BY, then the answer is No. Fortunately, in our case, the SQL didn’t have any ORDER BY clause:

SELECT a.addressid
FROM address a, addresstype t
WHERE a.locationid = :locationid
AND a.typeid = t.typeid
AND t.typecode = 'STREET';

The plan currently being picked by the CBO (Oracle 9.2.0.8) is:

NESTED LOOPS
   TABLE ACCESS BY ROWID addresstype
      INDEX RANGE SCAN pkx_addresstype
   INDEX RANGE SCAN ix_address

For background, here are the relevant schema details:

TABLE address
(addressid NUMBER PRIMARY KEY
,locationid NUMBER NOT NULL
,typeid NUMBER NOT NULL --FOREIGN KEY to addresstype
);

INDEX idx_address (locationid, typeid, addressid);

TABLE addresstype
(typeid NUMBER PRIMARY KEY
,typecode VARCHAR2(30) NOT NULL
);

typecode is actually unique, but no constraint has been declared to the database.

Because the first step in the plan (the TABLE ACCESS on addresstype) only ever gets one row (in practice), the index range scan on ix_address causes this query to return the rows in ascending order of addressid.

The process that runs this query gets rows one at a time, does some work to “check” it against some other data, and once it finds a “match”, it stops getting more rows.

This design probably seemed like a good idea at the time, but now the business would like this process changed so that in the (rare) occurrence when more than one row would “match”, the process should favour a more recent address rather than an old one. For our purposes, “recent” may be approximated by assuming that bigger addressids are more recent than smaller addressids.

We can’t change the code without a lengthy and costly call to the application vendor; it would be better if we can change it ourselves. We cannot change the PL/SQL package in which this query is defined; so what can we do? Can we change the order in which rows are returned from this query?

The answer is yes – firstly, the query has no ORDER BY, so the order of rows is affected only by the query plan. If we change the query plan, which we can do with a stored outline, the order of rows can be changed.

So in this case, all we have to do is create a stored outline to add an INDEX_DESC hint:

SELECT /*+INDEX_DESC(a)*/ a.addressid
FROM address a, addresstype t
WHERE a.locationid = :locationid
AND a.typeid = t.typeid
AND t.typecode = 'STREET';

This causes the following plan to be used:

NESTED LOOPS
   TABLE ACCESS BY ROWID addresstype
      INDEX RANGE SCAN pkx_addresstype
   INDEX RANGE SCAN DESCENDING ix_address

With the stored outline, the query now returns the rows in descending order of addressid.

NOTE:
If you want to use this approach, it is highly dependant on being able to find a query plan that produces the ordering that you want. This may mean creating a suitable index designed strategically for this purpose, if one doesn’t exist already. Even creating indexes won’t guarantee that you’ll be able to find a plan that gets the order that you want; generally I’d say this approach will only be feasible for relatively simple queries.