• 求一批字符串里出现最多次数的子串


    在前作 https://www.cnblogs.com/xiandedanteng/p/12713057.html 里,探讨了相同字符串前缀出现次数最多的问题,其实这是求多字符串子串问题的特例。

    本篇准备直接解决求多字符串共同子串的问题,不管子串是出现在开头抑或中间还是结尾:

    这批字符串存储在表tb_string里,具体就是将其中出现最多的子串按出现次数降序排列。

    tb_string表结构:

    create table tb_string(
        id number(4,0) primary key,
        name nvarchar2(20) not null)

    测试数据:

    insert into tb_string(id,name) values('1','asd_pdf_df3');
    insert into tb_string(id,name) values('2','star_asd');
    insert into tb_string(id,name) values('3','starasd');
    insert into tb_string(id,name) values('4','pdfdf3asd');
    insert into tb_string(id,name) values('5','startpdf');

     目测一下,貌似df出现5次,pdf出现3次,asd出现四次...这是出现频率较高的子串.

    解决方法:

    将取到的字符串用substr函数从头到尾截取子串,比如字符串abc,截取的子串将是a,b,c,ab,bc,abc,一共六个;再比如asdf,截取的子串将是a,s,d,f,as,sd,df,asd,sdf,asdf,一共十个.

    然后,将子串统计排序就好。

    为了截取子串,需要下表辅助:

    create table tb_numsery(
        id number(4,0) primary key,
        startIdx integer not null,
        span integer not null)

    其中值:

    insert into tb_numsery(id,startIdx,span) values('1','1','1');
    insert into tb_numsery(id,startIdx,span) values('2','2','1');
    insert into tb_numsery(id,startIdx,span) values('3','3','1');
    insert into tb_numsery(id,startIdx,span) values('4','4','1');
    insert into tb_numsery(id,startIdx,span) values('5','5','1');
    insert into tb_numsery(id,startIdx,span) values('6','6','1');
    insert into tb_numsery(id,startIdx,span) values('7','7','1');
    insert into tb_numsery(id,startIdx,span) values('8','8','1');
    insert into tb_numsery(id,startIdx,span) values('9','9','1');
    insert into tb_numsery(id,startIdx,span) values('10','10','1');
    insert into tb_numsery(id,startIdx,span) values('11','1','2');
    insert into tb_numsery(id,startIdx,span) values('12','2','2');
    insert into tb_numsery(id,startIdx,span) values('13','3','2');
    insert into tb_numsery(id,startIdx,span) values('14','4','2');
    insert into tb_numsery(id,startIdx,span) values('15','5','2');
    insert into tb_numsery(id,startIdx,span) values('16','6','2');
    insert into tb_numsery(id,startIdx,span) values('17','7','2');
    insert into tb_numsery(id,startIdx,span) values('18','8','2');
    insert into tb_numsery(id,startIdx,span) values('19','9','2');
    insert into tb_numsery(id,startIdx,span) values('20','1','3');
    insert into tb_numsery(id,startIdx,span) values('21','2','3');
    insert into tb_numsery(id,startIdx,span) values('22','3','3');
    insert into tb_numsery(id,startIdx,span) values('23','4','3');
    insert into tb_numsery(id,startIdx,span) values('24','5','3');
    insert into tb_numsery(id,startIdx,span) values('25','6','3');
    insert into tb_numsery(id,startIdx,span) values('26','7','3');
    insert into tb_numsery(id,startIdx,span) values('27','8','3');
    insert into tb_numsery(id,startIdx,span) values('28','1','4');
    insert into tb_numsery(id,startIdx,span) values('29','2','4');
    insert into tb_numsery(id,startIdx,span) values('30','3','4');
    insert into tb_numsery(id,startIdx,span) values('31','4','4');
    insert into tb_numsery(id,startIdx,span) values('32','5','4');
    insert into tb_numsery(id,startIdx,span) values('33','6','4');
    insert into tb_numsery(id,startIdx,span) values('34','7','4');
    insert into tb_numsery(id,startIdx,span) values('35','1','5');
    insert into tb_numsery(id,startIdx,span) values('36','2','5');
    insert into tb_numsery(id,startIdx,span) values('37','3','5');
    insert into tb_numsery(id,startIdx,span) values('38','4','5');
    insert into tb_numsery(id,startIdx,span) values('39','5','5');
    insert into tb_numsery(id,startIdx,span) values('40','6','5');
    insert into tb_numsery(id,startIdx,span) values('41','1','6');
    insert into tb_numsery(id,startIdx,span) values('42','2','6');
    insert into tb_numsery(id,startIdx,span) values('43','3','6');
    insert into tb_numsery(id,startIdx,span) values('44','4','6');
    insert into tb_numsery(id,startIdx,span) values('45','5','6');
    insert into tb_numsery(id,startIdx,span) values('46','1','7');
    insert into tb_numsery(id,startIdx,span) values('47','2','7');
    insert into tb_numsery(id,startIdx,span) values('48','3','7');
    insert into tb_numsery(id,startIdx,span) values('49','4','7');
    insert into tb_numsery(id,startIdx,span) values('50','1','8');
    insert into tb_numsery(id,startIdx,span) values('51','2','8');
    insert into tb_numsery(id,startIdx,span) values('52','3','8');
    insert into tb_numsery(id,startIdx,span) values('53','1','9');
    insert into tb_numsery(id,startIdx,span) values('54','2','9');
    insert into tb_numsery(id,startIdx,span) values('55','1','10');

    这些值里,第一列用来做id,第二列用来指定substr的第二参数起始点,第三列用来指定截取的宽度。用这个表辅助从名称里截取子串方便多了,不像原方案一层层剥去有点费劲.

    仔细观察这些值,你会发现它与上面从abc,abcd截取子串的过程暗合,其中道理请自行揣摩。

    这些值可以由下面的程序得到:

    package sery;
    
    public class SeryMaker {
        public static void main(String[] args) {
            int n=10;
            int level=1;
            int index=1;
            for(int i=n;i>0;i--) {
                for(int j=1;j<=i;j++) {
                    System.out.println(getInsert(index,j,level));
                    index++;
                }
                
                level++;
            }
        }
        
        private static String getInsert(int index,int j,int level) {
            return String.format("insert into tb_numsery(id,startIdx,span) values('%d','%d','%d');",index,j,level);
        }
    }

    有了tb_numsery表的帮助,最终的SQL语句就出来了:

    select d.* from
    (
        select c.child,count(*) as cnt from
        (
            select substr(a.name,b.startIdx,b.span) as child from
            tb_string a,tb_numsery b
            where b.startIdx<=length(a.name)
            and b.span<=length(a.name)
            and b.startIdx+b.span<=length(a.name)+1
        ) c
        group by c.child
    ) d
    where d.cnt>1 and length(d.child)>1
    order by d.cnt desc,d.child

    解读:最内层让两表做笛卡儿积,目的是利用tb_numsery表里的数据截取tb_string表的子串;中层是按子串进行分组,得到出现次数,最外层则是过滤和排序。

    最外层where条件对出现次数等于1和子串长度等于1的数据进行了过滤,你也可以根据实际情况进行调整。

    结果:

    SQL> select d.* from
      2  (
      3  select c.child,count(*) as cnt from
      4  (
      5  select substr(a.name,b.startIdx,b.span) as child from
      6  tb_string a,tb_numsery b
      7  where b.startIdx<=length(a.name)
      8  and b.span<=length(a.name)
      9  and b.startIdx+b.span<=length(a.name)+1
     10  ) c
     11  group by c.child
     12  ) d
     13  where d.cnt>1 and length(d.child)>1
     14  order by d.cnt desc,d.child;
    
    CHILD                                           CNT
    ---------------------------------------- ----------
    df                                                5
    as                                                4
    asd                                               4
    sd                                                4
    ar                                                3
    pd                                                3
    pdf                                               3
    st                                                3
    sta                                               3
    star                                              3
    ta                                                3
    
    CHILD                                           CNT
    ---------------------------------------- ----------
    tar                                               3
    
    已选择12行。

    目测一下df和as出现次数,和实际情况是符合的。

    在我的T440p机器上运行一下,得到结果不到0.01秒,借助oracle的强大性能,可能比一些Java算法要快了。

    --2020年4月20日--

  • 相关阅读:
    PHP发送邮件(php100视频后笔记)
    简单页面访问统计
    怎样制作RSS源
    r给自己网站/博客制作安卓应用程序
    反射简单概念
    笔记本
    使用IHttpModule实现简单的页面重映射Url
    工厂方法模式与抽象工厂模式的区别
    安装TFS2010实际体验
    限定登录失败次数,超过指定次数就限制登录一段时间
  • 原文地址:https://www.cnblogs.com/heyang78/p/12740440.html
Copyright © 2020-2023  润新知