So I followed the great advice here to use the new COLLECT function in Oracle 10g to solve the common string-aggregation-in-SQL problem. I chose this solution over the others because it sorts the results as well, which was a user requirement. This is because the COLLECT function optionally takes an ORDER BY clause – even though the 10g documentation forgot to mention this. The 11g documentation was updated to include the additional syntax option, as well as the fact that COLLECT supports the DISTINCT (or UNIQUE) keyword as well to remove duplicates – testing indicates that this works in 10g as well.
This means that if I define a suitable type varchar2_ntt
and a suitable function ntt_to_string
, I can write a query like this:
SELECT dept,
ntt_to_string(
CAST(
COLLECT(ename ORDER BY ename)
AS varchar2_ntt)
) AS ename_list
FROM emp
GROUP BY dept;
That works fine. But now I want to combine this with some analytic functions, like this:
SELECT DISTINCT
dept,
ntt_to_string(
CAST(
COLLECT(ename ORDER BY ename)
AS varchar2_ntt)
) AS ename_list,
FIRST_VALUE(ename)
OVER (PARTITION BY dept
ORDER BY sal DESC) AS topdog
FROM emp;
This doesn’t work because (a) the COLLECT requires a GROUP BY; and (b) the analytic function cannot be used along with a GROUP BY. What I’d expect to be able to do is use an analytic version of the COLLECT:
SELECT DISTINCT
dept,
ntt_to_string(
CAST(
COLLECT(ename
PARTITION BY dept
ORDER BY ename)
AS varchar2_ntt)
) AS ename_list,
FIRST_VALUE(ename)
OVER (PARTITION BY dept
ORDER BY sal DESC) AS topdogFROM emp;
Us SQL developers are never satisfied, are we?
At this stage my best solution for this is to use a CTE:
WITH q AS (SELECT dept, ename, sal FROM emp)
SELECT DISTINCT
dept,
(SELECT ntt_to_string(
CAST(
COLLECT(q2.ename ORDER BY q2.ename)
AS varchar2_ntt)
)
FROM q q2
WHERE q2.dept = q.dept
) AS ename_list,
FIRST_VALUE(ename)
OVER (PARTITION BY dept
ORDER BY sal DESC) AS topdog
FROM q;
A deceptively simple question:
How to select the first continous group of rows using Oracle SQL (Stackoverflow)
The solution given by Malgrat works nicely. He generates a “gap” column which detects changes in the data, then uses a running total to restrict the results to the first “group”. I haven’t come up with a more elegant solution that doesn’t involve multiple table scans.
A colleague needed to get the average from a set of data, but disregarding the top 25% and the bottom 25%. We didn’t know of any builtin Oracle function that would do this, but a review of the wikipedia page for Average yielded the technical term: truncated (or trimmed) mean. So we searched the Oracle docs and Google for ways to implement this function and didn’t come up with anything very useful. There were some SQL Server scripts which would have required two or three passes over the dataset.
After browsing through the aggregate functions documentation, I hit upon the NTILE function which I’ve used before, and realised that was the answer. The NTILE function takes a set of ordered data and divides it evenly (or as evenly as possible) between a number of buckets. In our case, we wanted to discard the top 25% and bottom 25%, so we simply divide the set into 4 buckets and discard the 1st and the 4th buckets; then take the standard average of the remainder:
SELECT AVG(mystat)
FROM (
SELECT mystat,
NTILE(4) OVER (ORDER BY mystat) n
FROM (SELECT mystat
FROM mytable)
)
WHERE n IN (2,3);
The benefit of this query is that it only does one pass over the data, and was easily modified to partition the data set into groups; a count was also added so that the average could be taken over the entire data set for any groups that had less than 4 items.
To get the truncated mean in groups, except for groups with <4 items (for which we’ll report the average over the entire group):
SELECT mycat, AVG(mystat)
FROM (
SELECT mycat, mystat,
NTILE(4) OVER (PARTITION BY mycat
ORDER BY mystat) n,
COUNT(1) OVER (PARTITION BY mycat) c
FROM (SELECT mycat, mystat
FROM mytable)
)
WHERE n IN (2,3) OR c < 4
GROUP BY mycat
ORDER BY mycat;