Un-riching Rich Text Format

Let’s just call it Legacy because I’m not going to say what the source is. It is a single-user desktop application that after a little investigation (i.e. searching the online forum for the app) was found to be storing its data in tables readable by MS Access. I wanted to get at this data, analyze it, maybe do some smart things to it, and then present it via Apex.

Step 1: Get the data into Oracle.

Simple matter of exporting from MS Access via ODBC. At least, it was simple once I replaced the Oracle ODBC drivers with the latest download from OTN. Before that I was getting a number of annoying TNS errors.

Step 2: Transform the data.

Most of the tables are easy-to-understand normalized relational tables. One of them, however, has a column that came through as a CLOB containing strange values like this:

{\rtf1\ansi\deff0\deftab254
{\fonttbl{\f0\fnil\fcharset0 Arial;}
{\f1\fnil\fcharset0 Verdana;}}{\colortbl\red0\green0\blue0;\red255\green0\blue0;\red0\green128\blue0;\red0\green0\blue255;\red255\green255\blue0;\red255\green0\blue255;\red128\green0\blue128;\red128\green0\blue0;\red0\green255\blue0;\red0\green255\blue255;\red0\green128\blue128;\red0\green0\blue128;\red255\green255\blue255;\red192\green192\blue192;\red128\green128\blue128;\red255\green255\blue255;}
\paperw12240\paperh15840\margl1880\margr1880\margt1440\margb1440
{\*\pnseclvl1\pnucrm\pnstart1\pnhang\pnindent720
{\pntxtb}{\pntxta{.}}}
...

Now based on my knowledge of the application I knew that this column was used to store small pieces of text (typically 8 to 30 short lines), with some amount of formatting (e.g. fonts, alignment, etc.). Again the online forum came in useful in that a side comment from one of the developers (regarding a small bug undocumented anti-feature) revealed that they stored the formatted text as RTF – Rich Text Format. Should have known from the opening 6 bytes in the data.

Somewhere in these oceans of rtf codes were swimming the plain text I craved. So Googled RTF, skimmed this old RTF specification, and ended up with this admittedly poorly-performing PL/SQL, which for the 651 rows in this table, each with an RTF of average 3KB, works just well enough for my purposes. As it turned out the only RTF codes I was interested in were \fcharset and \*, both of which I used to ignore bits of text I didn’t want in my output. Oh and \par, which denotes the end of a paragraph. I can run this script once a month on the freshly exported data and apply the full weight of Oracle’s analytic capabilities on it.

The code below exemplifies the use of a pipelined function. This is not a good idea, by the way, if you want to use it in regular queries, e.g. a view. In my case, however, I only wanted to call this from within PL/SQL, and then only once a month. Because of the way parameters work with functions like this, I had to call it with dynamic SQL (execute immediate).

That was kind of fun, but I’d rather not have to deal with RTF ever again, thank you.

create or replace package myutil_rtf is
  type t_v4000_table is table of varchar2(4000);
  function extract_text (p_recid in number)
  return t_v4000_table pipelined deterministic;
end;
/

create or replace package body myutil_rtf is
  function extract_text (p_recid in number)
    return t_v4000_table pipelined deterministic is

    l_ch varchar2(1);
    l_ctrl varchar2(4000);
    l_line varchar2(4000);
    l_rtf clob;

    --don't output any text between
    --\fcharset and closing }
    l_fcharset boolean := false;

    --increments for each enclosed pair of { }
    --within a discard section
    l_discard number;

  begin

    select rtf_clob into l_rtf
    from rtf_table where recid = p_recid;

    for i in 1..dbms_lob.getlength(l_rtf) loop

      l_ch := substr(l_rtf,i,1);

      if l_ch = '}' then

        if l_fcharset then
          --closing } found; re-enable output
          l_fcharset := false;
          l_line := null;
        end if;

        if l_discard > 0 then
          l_discard := l_discard - 1;
          if l_discard = 0 then
            l_discard := null;
          end if;
        end if;

      elsif l_ch = '{' then

        if l_discard is not null then
          l_discard := l_discard + 1;
        end if;

      elsif l_ch = '\' then

        --controls start with a backslash
        l_ctrl := '\';

      elsif l_ctrl is not null then

        --controls are always ended by some
        --non-alphanumeric character
        if instr('abcdefghijklmnopqrstuvwxyz'
        || '0123456789',lower(l_ch)) > 0 then
          l_ctrl := l_ctrl || lower(l_ch);
        else
          if l_ctrl = '\par' then
            pipe row (l_line);
            l_line := null;
          elsif substr(l_ctrl,1,9) = '\fcharset' then
            l_fcharset := true;
          elsif l_ctrl || l_ch = '\*' then
            --{\* ... } means you can ignore
            --anything between the { }
            if l_discard is null then
              l_discard := 1;
            end if;
          end if;
          l_ctrl := null;
        end if;

      elsif l_ch not in (chr(10), chr(13), '{')
        and not l_fcharset and l_discard is null then

        l_line := l_line || l_ch;

      end if;

    end loop;

    if l_line is not null and not l_fcharset then
      pipe row (l_line);
    end if;

    return;
  end extract_text;
end myutil_rtf;
/

To extract the text from the table with recid=1:

select column_value line_of_text
      ,rownum line_number
from table(myutil_rtf.extract_text(1));
Do you know the difference between USER_TAB_COLS & USER_TAB_COLUMNS?
Which APEX am I logging into?

Comments

  1. This helped me so much, thanks a ton Jeff! I had to make two slight adjustments though – the greater than symbols in package myutil_rtf
    are displaying as HTML syntax (&gt) on this page. Replacing them with correct > characters did the trick. Maybe a bug with how SyntaxHighlighter is displaying the SQL code? Regardless, thanks again for sharing!

  2. I am getting an error when executing select statement, do i need some setting to avoid this error
    ORA-06502: PL/SQL: numeric or value error: character string buffer too small

    • Hi Kamran,

      I’m not sure but I suspect it will be due to some issues with this sample code.

      One possibility is that your file has one or more lines that exceed the 4000 character limit per line. Another possibility is that your file has multi-byte characters which might overflow a variable being used by this code.

      You could try increasing the size of the VARCHAR2 variables used by the code to see if it resolves the issue.

      Jeff

Leave a Reply

Your email address will not be published / Required fields are marked *