• [Oracle 10g & 11g] Collect Function and String Aggregation Utility


    Oracle 10g 增加了一个非常有用的 group 函数 ----- COLLECT, 并且在11g中得到了进一步加强。 

    这个函数可以用来实现“String Aggregation” 作用, 即把同一个group中的多行数据转成一行(以collection形式出现)。记得曾在园子中回答过一个SQL问题,见这里, 当时用到的是sys_connect_by_path,但是这种方法性能很不好,在下文中会提到。如果用COLLECT来做,性能会得到很大提升!

     

    首先来看下Oracle 10g中的COLLECT函数,

     

    1. The Collect Function in 10g. (参见这里)

     

    (1) 首先来看看一个例子。

    [默认schema scott是被锁住的,需要首先解锁!]

     
    SQL> alter user scott account unlock

    SQL> conn scott/scott
    Connected.

    SQL> select deptno,
      2         COLLECT(ename) as emps
      3  from emp
      4  group by
      5        deptno;

        DEPTNO EMPS
    ---------- -------------------------------------------------------------------------------------
            10 SYSTP1Nm4rC2OTHGbil8G7f3Adg==('CLARK', 'KING', 'MILLER')
            20 SYSTP1Nm4rC2OTHGbil8G7f3Adg==('SMITH', 'FORD', 'ADAMS', 'SCOTT', 'JONES')
            30 SYSTP1Nm4rC2OTHGbil8G7f3Adg==('ALLEN', 'BLAKE', 'MARTIN', 'TURNER', 'JAMES', 'WARD')

    SQL>

     

     

    注意到返回的结果透漏出Oracle创建了一个类型 “SYSTP1Nm4rC2OTHGbil8G7f3Adg==”.

    在10g R1, 和 10g R2中这个类型存在的schema是不同的,如下所示,

     

    -- Oracle 10.1

    SQL> SELECT owner
      2  ,      typecode
      3  FROM   all_types
      4  WHERE  type_name = 'SYSTP1Nm4rC2OTHGbil8G7f3Adg==';
    
    OWNER                          TYPECODE
    ------------------------------ ------------------------------
    SYS                            COLLECTION
    
    1 row selected.
    

    -- Oracle 10.2

    SQL> select owner,
      2         typecode
      3  from all_types
      4  where type_name = 'SYSTP1Nm4rC2OTHGbil8G7f3Adg==';

    OWNER                TYPECODE
    -------------------- ------------------------------
    SCOTT                COLLECTION

     

     

    正是因为10.1中这个类型是在SYS下面的,因此我们是没法直接用这个类型的,但是在10.2中,我们是可以直接用系统产生的这个类型的,如下,

     

    SQL> select *
      2  from
      3  TABLE("SYSTP1Nm4rC2OTHGbil8G7f3Adg=="('A', 'B', 'C'));

    COLUMN_VAL
    ----------
    A
    B
    C

    SQL>

     

     

    (2) 可否用我们自己定义的类型呢?

    答案是肯定的!

     

    SQL> create or replace type varchar2_ntt as table of varchar2(4000);
      2  /

    Type created.

     

    SQL> select deptno,
      2    CAST(COLLECT(ename) AS varchar2_ntt) as emps
      3  from emp
      4  group by deptno;

        DEPTNO EMPS
    ---------- -----------------------------------------------------------------------------
            10 VARCHAR2_NTT('CLARK', 'KING', 'MILLER')
            20 VARCHAR2_NTT('SMITH', 'FORD', 'ADAMS', 'SCOTT', 'JONES')
            30 VARCHAR2_NTT('ALLEN', 'BLAKE', 'MARTIN', 'TURNER', 'JAMES', 'WARD')

     

     

    得到的结果里显示的是我们自己定义的类型—VARCHAR2_NTT (nested table type)

     

    但是要注意的是,如果需要处理的数据类型是NUMBER的,要求自定义的类型和待处理的数据类型的精度必须一样。 如下所示:

    SQL> create or replace type number_ntt as table of number;
      2  /

    Type created.

    SQL> select deptno,
      2         CAST(COLLECT(sal) as number_ntt) as sals
      3  from emp
      4  group by deptno;
           CAST(COLLECT(sal) as number_ntt) as sals
                        *


           CAST(COLLECT(sal) as number_ntt) as sals
                *
    ERROR at line 2:
    ORA-22814: attribute or element value is larger than specified in type

    SQL> desc emp;
    Name
                                                   Null?    Type
    --------------------------------------------------------------------------------

    EMPNO
                                                   NOT NULL NUMBER(4)
    ENAME
                                                            VARCHAR2(10)
    JOB
                                                            VARCHAR2(9)
    MGR
                                                            NUMBER(4)
    HIREDATE
                                                            DATE
    SAL
                                                            NUMBER(7,2)
    COMM
                                                            NUMBER(7,2)
    DEPTNO
                                                            NUMBER(2)

     

    可以看到SAL的类型是NUMBER(7, 2) 而不是NUMBER, 所以出现问题。解决这个问题有两种方法,要么重新定义类型,要么把SAL转成NUMBER类型。

     

     SQL> select deptno,
      2         CAST(COLLECT(CAST(sal as NUMBER)) as number_ntt) as sals
      3  from   emp
      4  group by
      5         deptno;

         DEPTNO SALS
    ---------- ----------------------------------------------------------------------
            10 NUMBER_NTT(2450, 5000, 1300)
            20 NUMBER_NTT(800, 3000, 1100, 3000, 2975)
            30 NUMBER_NTT(1600, 2850, 1250, 1500, 950, 1250)

    SQL>


     SQL> create type number_7_2_ntt as table of number(7, 2);
      2  /

    Type created.

    SQL> select deptno,
      2         CAST(COLLECT(sal) as number_7_2_ntt) as sals
      3  from emp
      4  group by deptno;

        DEPTNO SALS
    ---------- -------------------------------------------------------------
            10 NUMBER_7_2_NTT(2450, 5000, 1300)
            20 NUMBER_7_2_NTT(800, 3000, 1100, 3000, 2975)
            30 NUMBER_7_2_NTT(1600, 2850, 1250, 1500, 950, 1250)

    SQL>

     

     

    (3) String Aggregation using COLLECT

    开头提到,可以用COLLECT来实现string aggregation, 那么如何做呢? 既然可以通过COLLECT函数将同一个GROUP的数据放在了一个自定义的集合中,剩下的只是我们对该集合做操作,把其中的数据取出来拼成字符串即可。

    首先,需要定义一个FUNCTION – TO_STRING() 来处理自定义集合类型中的数据,

    SQL>

      1  create or replace function to_string(nt_in in varchar2_ntt,
      2                                                           delimiter_in in varchar2 default ',')
      3                                            return varchar2 is
      4     v_idx pls_integer;
      5     v_str varchar2(32767);
      6     v_dlm varchar2(10);
      7  begin
      8     v_idx := nt_in.FIRST;
      9     while v_idx is not null
    10     loop
    11        v_str := v_str || v_dlm || nt_in(v_idx);
    12        v_dlm := delimiter_in;
    13        v_idx := nt_in.NEXT(v_idx);
    14     end loop;
    15     return v_str;
    16* end to_string;
    SQL> /

    Function created.

     

    接下来就可以在SQL语句中直接用了,

     

    SQL> select deptno,
      2         to_string(CAST(collect(ename) as varchar2_ntt)) as emps
      3  from emp
      4  group by deptno;

        DEPTNO EMPS
    ---------- ---------------------------------------------------------------
            10 CLARK,KING,MILLER
            20 SMITH,FORD,ADAMS,SCOTT,JONES
            30 ALLEN,BLAKE,MARTIN,TURNER,JAMES,WARD

     

     

     

    (4) 几种String Aggregation方法性能比较

    除了刚才提到的用COLLECT来做String Aggregation,还有个比较有名的方法就是Tom的STRAGG函数,如下, 这个其实就是个自定义的聚集函数,实现了Oracle Data Cartridge Interface中的几个 routines, 具体做法可以参见另一篇blog,

    Tom Kyte’s STRAGG

    scott@ORA9I.WORLD> create or replace type string_agg_type as object
      2  (
      3     total varchar2(4000),
      4  
      5     static function
      6          ODCIAggregateInitialize(sctx IN OUT string_agg_type )
      7          return number,
      8  
      9     member function
     10          ODCIAggregateIterate(self IN OUT string_agg_type ,
     11                               value IN varchar2 )
     12          return number,
     13  
     14     member function
     15          ODCIAggregateTerminate(self IN string_agg_type,
     16                                 returnValue OUT  varchar2,
     17                                 flags IN number)
     18          return number,
     19  
     20     member function
     21          ODCIAggregateMerge(self IN OUT string_agg_type,
     22                             ctx2 IN string_agg_type)
     23          return number
     24  );
     25  /
     
    Type created.
     
    scott@ORA9I.WORLD> 
    scott@ORA9I.WORLD> create or replace type body string_agg_type
      2  is
      3  
      4  static function ODCIAggregateInitialize(sctx IN OUT string_agg_type)
      5  return number
      6  is
      7  begin
      8      sctx := string_agg_type( null );
      9      return ODCIConst.Success;
     10  end;
     11  
     12  member function ODCIAggregateIterate(self IN OUT string_agg_type,
     13                                       value IN varchar2 )
     14  return number
     15  is
     16  begin
     17      self.total := self.total || ',' || value;
     18      return ODCIConst.Success;
     19  end;
     20  
     21  member function ODCIAggregateTerminate(self IN string_agg_type,
     22                                         returnValue OUT varchar2,
     23                                         flags IN number)
     24  return number
     25  is
     26  begin
     27      returnValue := ltrim(self.total,',');
     28      return ODCIConst.Success;
     29  end;
     30  
     31  member function ODCIAggregateMerge(self IN OUT string_agg_type,
     32                                     ctx2 IN string_agg_type)
     33  return number
     34  is
     35  begin
     36      self.total := self.total || ctx2.total;
     37      return ODCIConst.Success;
     38  end;
     39  
     40  
     41  end;
     42  /
     
    Type body created.
     
    scott@ORA9I.WORLD> 
    scott@ORA9I.WORLD> CREATE or replace
      2  FUNCTION stragg(input varchar2 )
      3  RETURN varchar2
      4  PARALLEL_ENABLE AGGREGATE USING string_agg_type;
      5  /
     
    Function created.

    另外还有一种方法就是用oracle 10g增加的sys_connect_by_path来做,但是性能不好。

    首先来准备测试数据,创建一个数据量是3倍的DBA_OBJECTS的表T。

    SQL> drop table t;

    Table dropped.


    SQL> create table t
      2  as select MOD(rownum, 100) as id,
      3            CAST('A' as VARCHAR2(1)) as val
      4     from dba_objects,
      5             TABLE(varchar2_ntt('A','B', 'C'));

    Table created.


    SQL> select count(*) from t;

      COUNT(*)
    ----------
        188148



    SQL> exec dbms_stats.gather_table_stats(user, 'T');

    PL/SQL procedure successfully completed.
    SQL>

    设置autotrace来获取性能信息,

    SQL> set autotrace traceonly statistics

    SQL> set timing on

     

     

    -- 1. STRAGG
    SQL> select id,
      2         STRAGG(val) as vals
      3  from t
      4  group by id;

    100 rows selected.

    Elapsed: 00:00:05.10

    Statistics
    ----------------------------------------------------------
              1  recursive calls
              0  db block gets
            393  consistent gets
              0  physical reads
              0  redo size
          13379  bytes sent via SQL*Net to client
            462  bytes received via SQL*Net from client
              8  SQL*Net roundtrips to/from client
              1  sorts (memory)
              0  sorts (disk)
            100  rows processed

    SQL>

    时间大约是5秒钟

    -- 2. COLLECT

    SQL> select id,
      2      to_string(cast(collect(val) as varchar2_ntt)) as vals
      3  from t
      4  group by id;

    100 rows selected.

    Elapsed: 00:00:01.31

    Statistics
    ----------------------------------------------------------
            436  recursive calls
            110  db block gets
            521  consistent gets
              1  physical reads
          20392  redo size
          13379  bytes sent via SQL*Net to client
            462  bytes received via SQL*Net from client
              8  SQL*Net roundtrips to/from client
             17  sorts (memory)
              0  sorts (disk)
            100  rows processed

    时间大约是1秒半,比STRAGG要好很多(虽然consistent gets, recursive calls多很多)。

    之所以COLLECT比STRAGG性能好很多,主要在于前者的context switch的次数很好,只是在调用to_string的时候才发生(100次), 而STRAGG对每条记录都要发生一次context switch (188148次)。

    -- 3. SYS_CONNECT_BY_PATH

    SQL> select id,
      2         max(vals) as vals
      3  from
      4  (
      5    select id,
      6           ltrim(sys_connect_by_path(val, ','), ',') as vals
      7    from
      8      ( select id,
      9               val,
    10               row_number() over (partition by id order by val) rn
    11         from t
    12      )
    13    start with rn = 1
    14    connect by prior id = id and prior rn + 1 = rn
    15  )
    16  group by id
    17  ;

    100 rows selected.

    Elapsed: 00:24:08.54

    Statistics
    ----------------------------------------------------------
              1  recursive calls
              0  db block gets
         547953  consistent gets
              0  physical reads
              0  redo size
          62507  bytes sent via SQL*Net to client
            462  bytes received via SQL*Net from client
              8  SQL*Net roundtrips to/from client
           3767  sorts (memory)
              0  sorts (disk)
            100  rows processed

    SQL>

    时间超过24分钟!!! 而且consistent gets 非常之多。

    所以,还是用COLLECT来做String Aggregation性能最好!

    2. COLLECT enhancements in 11g R2. (参见这里)

     

    (1)  Order collection elemetns

    虽然SYS_CONNECT_BY_PATH性能很差,但是有一点还是好的嘛~ 那就是可以进行排序,最后拼出来的字符串可以是经过排序的。那么COLLECT支持排序吗?

    答案是肯定的! 注意这里面讲的是Oracle 11.2对COLLECT的增强,但是其实关于这个排序10g也是支持的,但是其他的比如Distinct等只有在Oracle 11.2之后才有。

    SQL> select * from v$version where rownum = 1;

    BANNER
    ----------------------------------------------------------------
    Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Prod

    Elapsed: 00:00:00.00

    SQL> conn scott/scott
    Connected.


    SQL> select deptno,
      2               CAST(COLLECT(ename ORDER BY ename)
      3                        AS varchar2_ntt) as ordered_emps
      4  from emp
      5  group by deptno;

     

        DEPTNO ORDERED_EMPS
    ---------- -----------------------------------------------------------------------------
            10 VARCHAR2_NTT('CLARK', 'KING', 'MILLER')
            20 VARCHAR2_NTT('ADAMS', 'FORD', 'JONES', 'SCOTT', 'SMITH')
            30 VARCHAR2_NTT('ALLEN', 'BLAKE', 'JAMES', 'MARTIN', 'TURNER', 'WARD')

    Elapsed: 00:00:00.01
    SQL>

    当然也可以根据其他标准来排序,

    SQL> select deptno,
      2               CAST(COLLECT(ename ORDER BY hiredate)
      3                         AS varchar2_ntt) as ordered_emps
      4  from emp
      5  group by deptno;

        DEPTNO ORDERED_EMPS
    ---------- -----------------------------------------------------------------------
            10 VARCHAR2_NTT('CLARK', 'KING', 'MILLER')
            20 VARCHAR2_NTT('SMITH', 'JONES', 'FORD', 'SCOTT', 'ADAMS')
            30 VARCHAR2_NTT('ALLEN', 'WARD', 'BLAKE', 'TURNER', 'MARTIN', 'JAMES')

    Elapsed: 00:00:00.02

     

    (2) Distinct Collection elements

     

    注意DISTICNT/UNIQUE只在Oracle11.2之后才起作用,

     

    SQL> select * from v$version where rownum = 1;

    BANNER
    -----------------------------------------------------------------------------
    Oracle Database 11g Enterprise Edition Release 11.1.0.6.0 - Production

     

    SQL> select deptno,
      2         CAST(COLLECT(distinct job)
      3                  as varchar2_ntt) as distinct_jobs
      4  from emp
      5  group by
      6          deptno;

     

        DEPTNO DISTINCT_JOBS
    ---------- --------------------------------------------------------------------------------
            10 VARCHAR2_NTT('MANAGER', 'PRESIDENT', 'CLERK')

            20 VARCHAR2_NTT('CLERK', 'ANALYST', 'CLERK', 'ANALYST', 'MANAGER')

            30 VARCHAR2_NTT('SALESMAN', 'MANAGER', 'SALESMAN', 'SALESMAN', 'CLERK', 'SALESMAN')

    Elapsed: 00:00:00.03
    SQL>

    SQL> select deptno,
      2         CAST(COLLECT(UNIQUE job) as varchar2_ntt) as distinct_jobs
      3  from emp
      4  group by deptno;

        DEPTNO DISTINCT_JOBS
    ---------- --------------------------------------------------------------------------------
            10 VARCHAR2_NTT('MANAGER', 'PRESIDENT', 'CLERK')

            20 VARCHAR2_NTT('CLERK', 'ANALYST', 'CLERK', 'ANALYST', 'MANAGER')

            30 VARCHAR2_NTT('SALESMAN', 'MANAGER', 'SALESMAN', 'SALESMAN', 'CLERK', 'SALESMAN')

    Elapsed: 00:00:00.03

    可以看出虽然用DISTINCT/UNIQUE,oracle并没有报错,但是结果显示并没有显示出distinct的结果, 因为当前的Oracle版本不是11.2。

    那么对于11.2之前的版本,有没有什么workaround呢? 答案是肯定的! 可以用MULTISET函数和操作符。注意,这只对nested tables type (不是VARRAY) 起作用。

    SQL> SELECT deptno,
      2         SET(
      3            CAST(
      4                COLLECT(job)
      5                   AS varchar2_ntt)) as distinct_jobs
      6  from emp
      7  GROUP BY deptno;

         DEPTNO DISTINCT_JOBS
    ---------- -------------------------------------------------------------------
            10 VARCHAR2_NTT('MANAGER', 'PRESIDENT', 'CLERK')
            20 VARCHAR2_NTT('CLERK', 'ANALYST', 'MANAGER')
            30 VARCHAR2_NTT('SALESMAN', 'MANAGER', 'CLERK')

    SQL>

    这里面用了SET函数,它用来祛除集合中的重复数据。注意SET必须放在CAST外面,因为SET不能作用于系统创建的类型。

  • 相关阅读:
    find . name "*.py" print | xargs.exe grep py
    opensource license 的区别图解
    python urllib2 httplib HTTPConnection
    【转帖】使用python爬虫抓站的一些技巧总结:进阶篇
    ssh keyboard interactive
    using Net::SSH2 shell method
    【转帖】用python爬虫抓站的一些技巧总结
    PySide QtWebKit 读取网页
    qt single instance solution for PySide qt 4.7
    vim 首字符注释自动取消缩进问题
  • 原文地址:https://www.cnblogs.com/fangwenyu/p/1636902.html
Copyright © 2020-2023  润新知