Un-riching Rich Text Format
Let’s just call it Legacy because I’m not going to say what the source is. It is a single-user desktop application that after a little investigation (i.e. searching the online forum for the app) was found to be storing its data in tables readable by MS Access. I wanted to get at this data, analyze it, maybe do some smart things to it, and then present it via Apex.
Step 1: Get the data into Oracle.
Simple matter of exporting from MS Access via ODBC. At least, it was simple once I replaced the Oracle ODBC drivers with the latest download from OTN. Before that I was getting a number of annoying TNS errors.
Step 2: Transform the data.
Most of the tables are easy-to-understand normalized relational tables. One of them, however, has a column that came through as a CLOB containing strange values like this:
{\rtf1\ansi\deff0\deftab254
{\fonttbl{\f0\fnil\fcharset0 Arial;}
{\f1\fnil\fcharset0 Verdana;}}{\colortbl\red0\green0\blue0;\red255\green0\blue0;\red0\green128\blue0;\red0\green0\blue255;\red255\green255\blue0;\red255\green0\blue255;\red128\green0\blue128;\red128\green0\blue0;\red0\green255\blue0;\red0\green255\blue255;\red0\green128\blue128;\red0\green0\blue128;\red255\green255\blue255;\red192\green192\blue192;\red128\green128\blue128;\red255\green255\blue255;}
\paperw12240\paperh15840\margl1880\margr1880\margt1440\margb1440
{\*\pnseclvl1\pnucrm\pnstart1\pnhang\pnindent720
{\pntxtb}{\pntxta{.}}}
...
Now based on my knowledge of the application I knew that this column was used to store small pieces of text (typically 8 to 30 short lines), with some amount of formatting (e.g. fonts, alignment, etc.). Again the online forum came in useful in that a side comment from one of the developers (regarding a small bug undocumented anti-feature) revealed that they stored the formatted text as RTF – Rich Text Format. Should have known from the opening 6 bytes in the data.
Somewhere in these oceans of rtf codes were swimming the plain text I craved. So Googled RTF, skimmed this old RTF specification, and ended up with this admittedly poorly-performing PL/SQL, which for the 651 rows in this table, each with an RTF of average 3KB, works just well enough for my purposes. As it turned out the only RTF codes I was interested in were \fcharset
and \*
, both of which I used to ignore bits of text I didn’t want in my output. Oh and \par
, which denotes the end of a paragraph. I can run this script once a month on the freshly exported data and apply the full weight of Oracle’s analytic capabilities on it.
The code below exemplifies the use of a pipelined function. This is not a good idea, by the way, if you want to use it in regular queries, e.g. a view. In my case, however, I only wanted to call this from within PL/SQL, and then only once a month. Because of the way parameters work with functions like this, I had to call it with dynamic SQL (execute immediate).
That was kind of fun, but I’d rather not have to deal with RTF ever again, thank you.
create or replace package myutil_rtf is
type t_v4000_table is table of varchar2(4000);
function extract_text (p_recid in number)
return t_v4000_table pipelined deterministic;
end;
/
create or replace package body myutil_rtf is
function extract_text (p_recid in number)
return t_v4000_table pipelined deterministic is
l_ch varchar2(1);
l_ctrl varchar2(4000);
l_line varchar2(4000);
l_rtf clob;
--don't output any text between
--\fcharset and closing }
l_fcharset boolean := false;
--increments for each enclosed pair of { }
--within a discard section
l_discard number;
begin
select rtf_clob into l_rtf
from rtf_table where recid = p_recid;
for i in 1..dbms_lob.getlength(l_rtf) loop
l_ch := substr(l_rtf,i,1);
if l_ch = '}' then
if l_fcharset then
--closing } found; re-enable output
l_fcharset := false;
l_line := null;
end if;
if l_discard > 0 then
l_discard := l_discard - 1;
if l_discard = 0 then
l_discard := null;
end if;
end if;
elsif l_ch = '{' then
if l_discard is not null then
l_discard := l_discard + 1;
end if;
elsif l_ch = '\' then
--controls start with a backslash
l_ctrl := '\';
elsif l_ctrl is not null then
--controls are always ended by some
--non-alphanumeric character
if instr('abcdefghijklmnopqrstuvwxyz'
|| '0123456789',lower(l_ch)) > 0 then
l_ctrl := l_ctrl || lower(l_ch);
else
if l_ctrl = '\par' then
pipe row (l_line);
l_line := null;
elsif substr(l_ctrl,1,9) = '\fcharset' then
l_fcharset := true;
elsif l_ctrl || l_ch = '\*' then
--{\* ... } means you can ignore
--anything between the { }
if l_discard is null then
l_discard := 1;
end if;
end if;
l_ctrl := null;
end if;
elsif l_ch not in (chr(10), chr(13), '{')
and not l_fcharset and l_discard is null then
l_line := l_line || l_ch;
end if;
end loop;
if l_line is not null and not l_fcharset then
pipe row (l_line);
end if;
return;
end extract_text;
end myutil_rtf;
/
To extract the text from the table with recid=1:
select column_value line_of_text
,rownum line_number
from table(myutil_rtf.extract_text(1));
TW
14 October 2020 - 2:05 pm
This helped me so much, thanks a ton Jeff! I had to make two slight adjustments though – the greater than symbols in package myutil_rtf
are displaying as HTML syntax (>) on this page. Replacing them with correct > characters did the trick. Maybe a bug with how SyntaxHighlighter is displaying the SQL code? Regardless, thanks again for sharing!
Jeffrey Kemp
14 October 2020 - 2:16 pm
Thanks TW, fixed.
Kamran
15 February 2021 - 5:10 pm
I am getting an error when executing select statement, do i need some setting to avoid this error
ORA-06502: PL/SQL: numeric or value error: character string buffer too small
Jeffrey Kemp
15 February 2021 - 7:14 pm
Hi Kamran,
I’m not sure but I suspect it will be due to some issues with this sample code.
One possibility is that your file has one or more lines that exceed the 4000 character limit per line. Another possibility is that your file has multi-byte characters which might overflow a variable being used by this code.
You could try increasing the size of the VARCHAR2 variables used by the code to see if it resolves the issue.
Jeff