Skip to content

Can you solve this problem in SQL – i.e. without resorting to a procedural solution?

SQL combine multiple identifiers to create a group id for duplicate records

“I have a data extract with three different identifiers: A, B, C
Each identifier may appear in more than one row, and each row may have one or more of these three identifiers (i.e the column is populated or null).
I want to be able to group all records that have any combination of either A, B or C in common and assign them the same group id.
Extract table showing what the eventual groups should be:

```A    | B    | C    | Group
====   ====   ====   =====
p      NULL   NULL   1
p      r      NULL   1
q      NULL   NULL   2
NULL   r      NULL   1
NULL   NULL   s      2
q      NULL   s      2```

So, the input data is a table with three columns (A, B, and C), some of which are NULL. The output is a third column, “Group”, which will be assigned a number which classifies the row into a “group”. Each group will be distinct in that none of its members will have a value in A, B or C that appears in any row in any other group.

This question is fascinating because it cannot be solved, I believe, without some form of iteration. If I get the row (p) along with (q), they are in two different groups; but, if I add the row (p,q), all of a sudden my original rows are now in the same group along with the new row.

The solution will probably have to examine each row in consideration with the entire record set – an operation of O(n^2), if my understanding of CS theory is correct. I suspect a solution using at least a CTE and/or the MODEL clause will be required.

UPDATE:
An elegant solution, using a hierarchical query and Oracle’s CONNECT_BY_ROOT function, has been posted by Vincent Malgrat.

About these ads

From → SQL

Leave a Comment