Fabian Pascal is back and Debunking again

This book takes pride of place on my bookshelf. Highly recommended reading for anyone in the database industry.

If you haven’t seen Fabian Pascal’s blog before, it’s because he’s only just started it – but he’ll be publishing new material, as well as articles previously published at Database Debunkings, infamous for his fundamental, no-holds-barred, uncompromising take on what the Relational Model is, what it isn’t, and what that means for all professionals who design databases.

It was with sadness that I saw the site go relatively static over the past years, and to see it being revived is a fresh blast of cool air in a world that continues to be inundated by fads and misconceptions. Of particular note was the “THE VOCIFEROUS IGNORANCE HALL OF SHAME“… I’m looking forward to seeing the old vigorous debates that will no doubt be revived or rehashed.

The pure view of the Relational model of data is, perhaps, too idealistic for some – impractical for day-to-day use in a SQL-dominated world. Personally, I’ve found (although I cannot pretend to be an expert, in any sense, on this topic) that starting from a fundamentally pure model, unconstrained by physical limitations, conceived at an almost ideal, Platonic level, allows me to discover the simplest, most provably “correct” solution for any data modelling problem. At some stage I have to then “downgrade” it to a form that is convenient and pragmatic for implementation in a SQL database like Oracle; in spite of this, having that logical design in the back of my head helps to highlight potential inconsistencies or data integrity problems that must then be handled by the application.

That this situation is, in fact, not the best of all possible worlds, is something that we can all learn and learn again. Have a look, and see what you think: dbdebunk.blogspot.com.au.


3 Reasons to Hate Hibernate

Warning: this is a rant.

This is just a collection of observations of Hibernate, from the perspective of an Oracle developer/”DBA”. I’m aware of some of the benefits of using Hibernate to shield Java developers from having to know anything about the database or the SQL language, but sometimes it seems to me that we might generally be better off if they were required to learn a little about what’s going on “underneath the hood”. (Then I remind myself that it’s my job to help them get the most out of the database the client spent so much money getting.)

So, here are my gripes about Hibernate – just getting them off my chest so I can put them to bed.

Disclaimer: I know every Hibernate aficionado will jump in with “but it’s easy to fix that, all you have to do is…” but these are generalizations only.

Exhibit A: Generic Query Generators

As soon as I’d loaded all the converted data into the dev and test instances, we started hitting silly performance issues. A simple search on a unique identifier would take 20-30 seconds to return at first, then settle down to 4-8 seconds a pop. Quite rightly, everyone expected these searches to be virtually instant.

The culprit was usually a query like this:

select count(*) as y0_
from XYZ.SOME_TABLE this_
inner join XYZ.SOME_CHILD_TABLE child1_
on this_.PARENT_ID=child1_.PARENT_ID
where lower(this_.UNIQUE_IDENTIFIER) like :1
order by child1_.COLH asc, child1_.COLB asc, this_.ANOTHER_COL desc

What’s wrong with this query, you might ask?

Issue 1: Case-insensitive searches by default

Firstly, it is calling LOWER() on the unique identifier, which will never contain any alphabetic characters, so case-insensitive searches will never be required – and so it will not use the unique index on that column. Instead of forcing the developers to think about whether case-insensitive searches are required or not for each column, it allows them to simply blanket the whole system with these – and quite often no-one will notice until the system goes into UAT or even Prod and someone actually decides to test searching on that particular column, and decides that waiting for half a minute is unacceptable. It’s quite likely that for some cases even this won’t occur, and these poorly performing queries (along with their associated load on the database server) will be used all the time, and people will complain about the general poor performance of the database.

Issue 2: Count first, then re-query for the data

Secondly, it is doing a COUNT(*) on a query which will immediately after be re-issued in order to get the actual data.  I’d much prefer that the developers were writing the SQL by hand. That way, it’d be a trivial matter to ask them to get rid of the needless COUNT(*) query; and if they simply must show a total record count on the page, add a COUNT(*) OVER () to the main query – thus killing two birds with one efficient stone.

Exhibit B: Magical Class Generators (tables only)

Issue 3: No views, no procedures, no functions

When someone buys Hibernate, they might very well ask: is it possible to call an Oracle procedure or function with this product? And the answer is, of course, “yes”. Sure, you can do anything you want!

The day the Java developers peel off the shrinkwrap, the first thing they try is creating a Java class based on a single table. With glee they see it automagically create all the member attributes and getter/setter methods, and with no manual intervention required they can start coding the creation, modification and deletion of records using this class, which takes care of all the dirty SQL for them.

Then, the crusty old Oracle developer/”DBA” comes along and says: “It’d be better if you could use this API I’ve lovingly crafted in a PL/SQL package – everything you need is in there, and you’ll be shielded from any complicated stuff we might need to put in the database now or later. All you have to do is call these simple procedures and functions.” And the Java developer goes “sure, no problem” – until they discover that Hibernate cannot automatically create the same kind of class they’ve already gotten accustomed to.

“What, we actually need to read the function/procedure definition and hand-code all the calls to them? No sir, not happening.” After all, they bought Hibernate to save them all that kind of work, and who’s going to blame them?

So, you say, “Ok, no problem, we’ll wrap the API calls with some simple views, backed by instead-of triggers.” But then they hit another wall – Hibernate can’t tell from a view definition how that view relates to other views or tables.

The end result is that all the Java code does is access tables directly. And you get the kind of queries (and worse) that you saw in Exhibit “A” above.

There. I feel so much better already.

/rant


Priority #1: Keep it simple

Every place has a different way of assigning priority and/or severity to defect reports – some bigger places have many different ways (unfortunately). I’ve not been subjected to Prince2 training so here’s my take on this subject.

I reckon, the simpler the scheme, the more likely it will be used consistently. Every defect should have just a single priority/severity (call it what you will): Critical, High, Medium or Low.

  • Critical – problem significantly affects ability to test the system; “showstopper” – all other work to be delayed until the issue is resolved – an example might be “unable to log in”; “screen x opens with error every time”; “function y causes my computer to explode”
  • High – problem affects critical functionality; should be fixed as a matter of priority over other issues – an example would be “error x always/often occurs at process point y”
  • Medium – functionality not working as per specified requirements; must eventually be fixed (at least before Go Live) – an example would be “default value not being set correctly”; “navigation does not work correctly”
  • Low – cosmetic issue; “nice to have” function; or error/warning occurs very infrequently but doesn’t significantly affect correct processing; ok to Go Live if not fixed

That’s it. Notice how each category is unambiguous in what it means to the developers, testers and others. I’d expect a system to normally have mostly Medium issues, several Highs, hopefully no Criticals, and maybe some Lows. I’d expect some issues to be reclassified up or down as they are assessed, as developers negotiate with the testers and business reps.

I’m certain that there’s all sorts of great reasons why someone needs more levels, or needs to separate the “priority” concept from the “severity” or “impact” concepts, but to my mind there’s not a lot gained from forcing all your testers, developers, and change managers to learn a complicated system, and classify and update their records. When you need a 2D or 3D matrix of priority vs severity vs whatever printed and posted on your cubicle wall, it’s time to ask, “is all this really necessary?”.

Keep it simple, and everyone will not only use it, everyone else will understand it.

P.S. did you notice that Apex’s builtin “feedback” feature only has one level? It’s either a bug report, or it’s not (e.g. an enhancement request or comment). I love that.


If at first you don’t succeed… it’s impossible.

How many times have you tried something, got either an error or unexpected results, and decided what you were trying to do was not possible? Have you later on discovered someone quietly doing the impossible?

I think this phenomenon is a form of the “correlation-implies-causation” fallacy.

Unfortunately, this seems to happen too often, if the kind of questions I see quite often are any guide. A recent example is: “Why cannot I select from more than one table in Oracle?”. Here, the author seems to have followed the following thought process:

  1. “SELECT * FROM table1” returns some rows.
  2. “SELECT * FROM table1, table2” returns no rows.
  3. Therefore, you can’t query more than one table in one SQL statement in Oracle.

In this case, the writer had not realised that table2 had no rows in it; what complicated things somewhat was that in one session, the second query was returning rows – because he’d inserted some rows into table2 in that session but hadn’t issued a COMMIT, so those rows were not visible by other sessions.

For a person inexperienced in SQL or Oracle, this sort of mistake is forgivable; but I suspect we all make this sort of mistake quite often. I know I have!

When trying something new, it takes diligent research and testing to determine whether one’s approach is simply wrong, or if unrelated factors (e.g. getting the syntax wrong, or the environment is not set up correctly) are causing failure. This gets more tiresome and frustrating (a “gumption trap”, in Persig‘s parlance) when one was halfway through solving some other problem, and this unexpected problem gets in the way.

Sometimes you just have to go to bed and see if it becomes clearer the next day. If the problem persists, ask a question on StackOverflow!

P.S. if a Google search reveals “doing X is impossible”, ask “Why?”


Single-Point-of-Definition by Example

Steven Feuerstein lists seven excellent “Golden Rules” in his presentation (via Eddie Awad) and says “Don’t repeat anything. Aim for a Single Point of Definition for every aspect of your application – formulas, business rules, magic values, SQL statements.” giving the following code as exhibit A:

(source)

I’m guessing in his presentation he spoke about various things that could be done to improve this code, but they’re not in the PDF; so I’d like to give it a go myself and see how much we can improve the maintainability of this code by reducing hard-coding.

1. Type Declarations

Instead of declaring parameters and variables as NUMBER, VARCHAR2 etc, these should use the %TYPE operator so that they are automatically synchronized with the datatype from the table columns they represent:

PROCEDURE process_employee
(department_id_in IN employees.department_id%TYPE)
IS
   l_id     employees.employee_id%TYPE;
   l_salary employees.salary%TYPE;
...

l_name, however, is not based on any table column we know of at this point; so there is no %TYPE we can use for it. But bear with me, we’ll fix this later.

2. Magic Values

This one’s a no-brainer: that “10000000” is obviously a magic value that some bean-counter decided was the correct threshold for the CEO’s salary. Whatever.

You might define this as a constant defined in a global package specification, e.g.

CREATE PACKAGE employee_constant AS
ceo_salary_threshold CONSTANT employees.salary%TYPE := 10000000;
END employee_constant;

Personally, I’d suspect that the business will review and revise this number from time to time, to keep up with inflation; so we might end up needing a database table to store the current threshold, plus a date range for which the threshold applies. I’d then add an interface on top of this table so that queries and procedures don’t need to know how to get the current threshold. We can retrofit this later by changing ceo_salary_threshold into a function instead of a constant. That’s a bit beyond the scope of this exercise, however.

3. Formatting Rules

The rule about formatting an employee name as “LAST,FIRST” is duplicated in a comment and in the SELECT statement; and chances are it will be required elsewhere in the application as well. My preferred method for creating a SPOD for this sort of business rule used to be to move the implementation into a view, e.g.:

CREATE VIEW formatted_employees AS
SELECT employees.*,
       employees.last_name || ',' || employees.first_name
       AS full_name
FROM   employees;
COMMENT ON COLUMN formatted_employees.full_name
IS 'Full name: LAST COMMA FIRST (ReqDoc 123.A.47)';

This view is what I like to call a “formatting” view: it is only allowed to query one table, it contains no WHERE, GROUP BY or HAVING clauses, and it selects all the columns from the table. The view can be used almost anywhere the table may be used. It adds additional columns that format the data in various ways. If need be, we can even add INSTEAD OF triggers to handle inserts/updates on the derived columns – if the business rules make the conversion from derived-to-underlying-column well defined.

So, now we can redefine the cursor as:

CURSOR emps_in_dept_cur
IS
   SELECT employee_id, salary, full_name AS lname
     FROM formatted_employees
    WHERE department_id = department_id_in;

Notice that I don’t call the column “last_comma_first” or anything like that – that would again be hard-coding the business rule, which would then be replicated throughout the application. In Oracle 11g, however, I think it might be better to create virtual columns on the table instead:

ALTER TABLE employees ADD (
full_name VARCHAR2(100)
GENERATED ALWAYS AS (last_name || ',' || first_name) VIRTUAL
);
COMMENT ON COLUMN employees.full_name
IS 'Full name: LAST COMMA FIRST (ReqDoc 123.A.47)';
CURSOR emps_in_dept_cur
IS
   SELECT employee_id, salary, full_name AS lname
     FROM employees
    WHERE department_id = department_id_in;

The virtual column can have its own stats, and even an index if needed for querying.

Another option would be to create a function that does this formatting:

CREATE FUNCTION employee_full_name
   (last_name  IN employees.last_name%TYPE,
    first_name IN employees.first_name%TYPE)
RETURN VARCHAR2 DETERMINISTIC IS
--Full name: LAST COMMA FIRST (ReqDoc 123.A.47)
BEGIN
   RETURN last_name || ',' || first_name;
END employee_full_name;

We could call this function from the procedure or the view, but if we’re on 11g there’s no reason we can’t create a virtual column on it:

ALTER TABLE employees ADD (
full_name VARCHAR2(100)
GENERATED ALWAYS
AS (employee_full_name(last_name,first_name)) VIRTUAL
);

Another advantage to using the view or a virtual column is that we can now remove the “VARCHAR2 (100)” from the variable declaration, e.g.:

l_name   employees.full_name%TYPE;

4. Cursor Parameter

The cursor refers directly to the parameter to the procedure, which is a no-no – this couples the cursor too much with the procedure, i.e. we can’t re-use it elsewhere unless we always define a variable “department_id_in”. Instead, we should use a cursor parameter:

CURSOR emps_in_dept_cur
   (department_id_in IN employees.department_id%TYPE)
IS
   SELECT employee_id, salary, full_name AS lname
     FROM employees
    WHERE department_id = emps_in_dept_cur.department_id_in;

The addition of the context “emps_in_dept_cur.” is not strictly necessary, but it is good practice to define the scope of all variables so that unrelated changes (e.g. the addition of a column called “department_id_in”) don’t change the code.

5. Cursor Row Type

What if we need to add 10 more columns to the cursor? At the moment we’re adding one more variable for each column of the cursor, and specifying it three times (variable declaration, cursor SELECT clause, and the FETCH INTO). We can reduce this to just once by declaring a cursor row type instead:

PROCEDURE process_employee
   (department_id_in IN employees.department_id%TYPE)
IS
   CURSOR emps_in_dept_cur
      (department_id_in IN employees.department_id%TYPE)
   IS
      SELECT employee_id, salary, full_name lname
        FROM employees
       WHERE department_id = emps_in_dept_cur.department_id_in;
   TYPE emps_in_dept_cur_type IS emps_in_dept_cur%ROWTYPE;
   emp emps_in_dept_cur_type;
BEGIN
   OPEN emps_in_dept_cur;
   LOOP
      FETCH emps_in_dept_cur
      INTO emp;
...

6. Don’t COMMIT

Procedures should rarely COMMIT (there are very few exceptions to this rule, e.g. procedures declared as autonomous transactions). Transactional control should be left to the calling process – this process might need to be done along with a number of other changes elsewhere, and we would want to either COMMIT or ROLLBACK all the changes together as one transaction. What if the next procedure raised an error and we had to rollback? Our system would be left in an inconsistent state.

7. Error Package

That RAISE_APPLICATION_ERROR hard-codes an error code and an error message. What if we type the error number wrong somewhere? If the calling process handles ORA-20907 in some fashion, but we mistype it as -20908 in one procedure, the calling process will not handle it.
We could declare an exception instead, e.g. in a global package specification:

CREATE PACKAGE employee_exception AS
invalid_dept_id EXCEPTION;
PRAGMA EXCEPTION_INIT (invalid_dept_id, -20907);
END employee_exception;

Now, our exception handler can raise just the one exception:

EXCEPTION WHEN NO_DATA_FOUND THEN
   RAISE employee_exception.invalid_dept_id;

However, we’ve now lost the error message. It would be better to create an error-handling package instead:

CREATE PACKAGE employee_error AS
 invalid_error_no CONSTANT NUMBER := -20000;
 invalid_error_no_exception EXCEPTION;
 PRAGMA EXCEPTION_INIT (invalid_error_no_exception, -20000);
 invalid_dept_id CONSTANT NUMBER := -20907;
 invalid_dept_id_exception EXCEPTION;
 PRAGMA EXCEPTION_INIT (invalid_dept_id_exception, -20907);
 PROCEDURE raise_exception (error_no IN NUMBER);
END employee_exception;

CREATE PACKAGE BODY employee_error AS
 PROCEDURE raise_exception (error_no IN NUMBER) IS
 BEGIN
 CASE error_no
 WHEN invalid_dept_id
 THEN RAISE_APPLICATION_ERROR(invalid_dept_id, 'Invalid department ID');
 ELSE RAISE_APPLICATION_ERROR(invalid_error_no, 'Bug: invalid error number');
 END;
 END message;
END employee_exception;

EDIT: PRAGMA EXCEPTION_INIT only accepts literal numbers for its second parameter (or else you get PLS-00702 at compile time) – fixed accordingly

Now, our exception handler is nicely modular:

EXCEPTION WHEN NO_DATA_FOUND THEN
employee_error.raise_exception(employee_error.invalid_dept_id);

So now, our code looks like this:

PROCEDURE process_employee (department_id_in IN employees.department_id%TYPE)
IS
   CURSOR emps_in_dept_cur (department_id_in IN employees.department_id%TYPE)
   IS
      SELECT employee_id, salary, full_name lname
        FROM employees
       WHERE department_id = emps_in_dept_cur.department_id_in;
   TYPE emps_in_dept_cur_type IS emps_in_dept_cur%ROWTYPE;
   emp emps_in_dept_cur_type;
BEGIN
   OPEN emps_in_dept_cur;
   LOOP
      FETCH emps_in_dept_cur
      INTO emp;

      IF emp.salary > employee_constant.ceo_salary_threshold THEN adjust_comp_for_ceo (emp.salary);
      ELSE analyze_compensation (emp.employee_id, emp.salary, employee_constant.ceo_salary_threshold); END IF;
      EXIT WHEN emps_in_dept_cur%NOTFOUND;
   END LOOP;
EXCEPTION WHEN NO_DATA_FOUND THEN
   employee_error.raise_exception(employee_error.invalid_dept_id);
END;

One final change that one might make here is to move the SQL query right out of the procedure and use a ref cursor instead, supplied by a central “employee_cursor” package.

There are probably plenty of other changes we could make to improve the maintainability of this code further.
What do you think?


“Not possible” is rarely correct

A good example of how not to answer a “Is it possible to…” question:

Is it possible to write a query which returns a date for every day between two specified days?

It’s ok to admit you don’t know how to do something.

It’s ok to say you don’t think a simple or feasible solution exists for a problem.

It may even be ok to say that something is impossible – if you constrain your answer to current technology. To say this you need to really know the technology, you need to have read about the problem widely enough, and you need to have enough personal experience to be able to say confidently, “no, what you are asking is impossible”. Even then, you might still be wrong, or become wrong sooner or later.

An answer saying outright, “No, that is impossible”, is just inviting a sharp rebuttal. Especially when in the very next sentence you admit that you aren’t an “Oracle specialist” 🙂