Un-riching Rich Text Format

Let’s just call it Legacy because I’m not going to say what the source is. It is a single-user desktop application that after a little investigation (i.e. searching the online forum for the app) was found to be storing its data in tables readable by MS Access. I wanted to get at this data, analyze it, maybe do some smart things to it, and then present it via Apex.

Step 1: Get the data into Oracle.

Simple matter of exporting from MS Access via ODBC. At least, it was simple once I replaced the Oracle ODBC drivers with the latest download from OTN. Before that I was getting a number of annoying TNS errors.

Step 2: Transform the data.

Most of the tables are easy-to-understand normalized relational tables. One of them, however, has a column that came through as a CLOB containing strange values like this:

{\fonttbl{\f0\fnil\fcharset0 Arial;}
{\f1\fnil\fcharset0 Verdana;}}{\colortbl\red0\green0\blue0;\red255\green0\blue0;\red0\green128\blue0;\red0\green0\blue255;\red255\green255\blue0;\red255\green0\blue255;\red128\green0\blue128;\red128\green0\blue0;\red0\green255\blue0;\red0\green255\blue255;\red0\green128\blue128;\red0\green0\blue128;\red255\green255\blue255;\red192\green192\blue192;\red128\green128\blue128;\red255\green255\blue255;}

Now based on my knowledge of the application I knew that this column was used to store small pieces of text (typically 8 to 30 short lines), with some amount of formatting (e.g. fonts, alignment, etc.). Again the online forum came in useful in that a side comment from one of the developers (regarding a small bug undocumented anti-feature) revealed that they stored the formatted text as RTF – Rich Text Format. Should have known from the opening 6 bytes in the data.

Somewhere in these oceans of rtf codes were swimming the plain text I craved. So Googled RTF, skimmed this old RTF specification, and ended up with this admittedly poorly-performing PL/SQL, which for the 651 rows in this table, each with an RTF of average 3KB, works just well enough for my purposes. As it turned out the only RTF codes I was interested in were \fcharset and \*, both of which I used to ignore bits of text I didn’t want in my output. Oh and \par, which denotes the end of a paragraph. I can run this script once a month on the freshly exported data and apply the full weight of Oracle’s analytic capabilities on it.

The code below exemplifies the use of a pipelined function. This is not a good idea, by the way, if you want to use it in regular queries, e.g. a view. In my case, however, I only wanted to call this from within PL/SQL, and then only once a month. Because of the way parameters work with functions like this, I had to call it with dynamic SQL (execute immediate).

That was kind of fun, but I’d rather not have to deal with RTF ever again, thank you.

create or replace package myutil_rtf is
  type t_v4000_table is table of varchar2(4000);
  function extract_text (p_recid in number)
  return t_v4000_table pipelined deterministic;

create or replace package body myutil_rtf is
  function extract_text (p_recid in number)
    return t_v4000_table pipelined deterministic is
    l_ch varchar2(1);
    l_ctrl varchar2(4000);
    l_line varchar2(4000);
    l_rtf clob;
    --don't output any text between
    --\fcharset and closing }
    l_fcharset boolean := false;
    --increments for each enclosed pair of { }
    --within a discard section
    l_discard number;
    select rtf_clob into l_rtf
    from rtf_table where recid = p_recid;
    for i in 1..dbms_lob.getlength(l_rtf) loop
      l_ch := substr(l_rtf,i,1);
      if l_ch = '}' then
        if l_fcharset then
          --closing } found; re-enable output
          l_fcharset := false;
          l_line := null;
        end if;
        if l_discard > 0 then
          l_discard := l_discard - 1;
          if l_discard = 0 then
            l_discard := null;
          end if;
        end if;
      elsif l_ch = '{' then
        if l_discard is not null then
          l_discard := l_discard + 1;
        end if;
      elsif l_ch = '\' then
        --controls start with a backslash
        l_ctrl := '\';
      elsif l_ctrl is not null then
        --controls are always ended by some
        --non-alphanumeric character
        if instr('abcdefghijklmnopqrstuvwxyz'
        || '0123456789',lower(l_ch)) > 0 then
          l_ctrl := l_ctrl || lower(l_ch);
          if l_ctrl = '\par' then
            pipe row (l_line);
            l_line := null;
          elsif substr(l_ctrl,1,9) = '\fcharset' then
            l_fcharset := true;
          elsif l_ctrl || l_ch = '\*' then
            --{\* ... } means you can ignore
            --anything between the { }
            if l_discard is null then
              l_discard := 1;
            end if;
          end if;
          l_ctrl := null;
        end if;
      elsif l_ch not in (chr(10), chr(13), '{')
        and not l_fcharset and l_discard is null then
        l_line := l_line || l_ch;
      end if;
    end loop;
    if l_line is not null and not l_fcharset then
      pipe row (l_line);
    end if;
  end extract_text;
end myutil_rtf;

To extract the text from the table with recid=1:

select column_value line_of_text
      ,rownum line_number
from table(myutil_rtf.extract_text(1));

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s