在前作 https://www.cnblogs.com/xiandedanteng/p/12713057.html 里,探讨了相同字符串前缀出现次数最多的问题,其实这是求多字符串子串问题的特例。
本篇准备直接解决求多字符串共同子串的问题,不管子串是出现在开头抑或中间还是结尾:
这批字符串存储在表tb_string里,具体就是将其中出现最多的子串按出现次数降序排列。
tb_string表结构:
create table tb_string( id number(4,0) primary key, name nvarchar2(20) not null)
测试数据:
insert into tb_string(id,name) values('1','asd_pdf_df3'); insert into tb_string(id,name) values('2','star_asd'); insert into tb_string(id,name) values('3','starasd'); insert into tb_string(id,name) values('4','pdfdf3asd'); insert into tb_string(id,name) values('5','startpdf');
目测一下,貌似df出现5次,pdf出现3次,asd出现四次...这是出现频率较高的子串.
解决方法:
将取到的字符串用substr函数从头到尾截取子串,比如字符串abc,截取的子串将是a,b,c,ab,bc,abc,一共六个;再比如asdf,截取的子串将是a,s,d,f,as,sd,df,asd,sdf,asdf,一共十个.
然后,将子串统计排序就好。
为了截取子串,需要下表辅助:
create table tb_numsery( id number(4,0) primary key, startIdx integer not null, span integer not null)
其中值:
insert into tb_numsery(id,startIdx,span) values('1','1','1'); insert into tb_numsery(id,startIdx,span) values('2','2','1'); insert into tb_numsery(id,startIdx,span) values('3','3','1'); insert into tb_numsery(id,startIdx,span) values('4','4','1'); insert into tb_numsery(id,startIdx,span) values('5','5','1'); insert into tb_numsery(id,startIdx,span) values('6','6','1'); insert into tb_numsery(id,startIdx,span) values('7','7','1'); insert into tb_numsery(id,startIdx,span) values('8','8','1'); insert into tb_numsery(id,startIdx,span) values('9','9','1'); insert into tb_numsery(id,startIdx,span) values('10','10','1'); insert into tb_numsery(id,startIdx,span) values('11','1','2'); insert into tb_numsery(id,startIdx,span) values('12','2','2'); insert into tb_numsery(id,startIdx,span) values('13','3','2'); insert into tb_numsery(id,startIdx,span) values('14','4','2'); insert into tb_numsery(id,startIdx,span) values('15','5','2'); insert into tb_numsery(id,startIdx,span) values('16','6','2'); insert into tb_numsery(id,startIdx,span) values('17','7','2'); insert into tb_numsery(id,startIdx,span) values('18','8','2'); insert into tb_numsery(id,startIdx,span) values('19','9','2'); insert into tb_numsery(id,startIdx,span) values('20','1','3'); insert into tb_numsery(id,startIdx,span) values('21','2','3'); insert into tb_numsery(id,startIdx,span) values('22','3','3'); insert into tb_numsery(id,startIdx,span) values('23','4','3'); insert into tb_numsery(id,startIdx,span) values('24','5','3'); insert into tb_numsery(id,startIdx,span) values('25','6','3'); insert into tb_numsery(id,startIdx,span) values('26','7','3'); insert into tb_numsery(id,startIdx,span) values('27','8','3'); insert into tb_numsery(id,startIdx,span) values('28','1','4'); insert into tb_numsery(id,startIdx,span) values('29','2','4'); insert into tb_numsery(id,startIdx,span) values('30','3','4'); insert into tb_numsery(id,startIdx,span) values('31','4','4'); insert into tb_numsery(id,startIdx,span) values('32','5','4'); insert into tb_numsery(id,startIdx,span) values('33','6','4'); insert into tb_numsery(id,startIdx,span) values('34','7','4'); insert into tb_numsery(id,startIdx,span) values('35','1','5'); insert into tb_numsery(id,startIdx,span) values('36','2','5'); insert into tb_numsery(id,startIdx,span) values('37','3','5'); insert into tb_numsery(id,startIdx,span) values('38','4','5'); insert into tb_numsery(id,startIdx,span) values('39','5','5'); insert into tb_numsery(id,startIdx,span) values('40','6','5'); insert into tb_numsery(id,startIdx,span) values('41','1','6'); insert into tb_numsery(id,startIdx,span) values('42','2','6'); insert into tb_numsery(id,startIdx,span) values('43','3','6'); insert into tb_numsery(id,startIdx,span) values('44','4','6'); insert into tb_numsery(id,startIdx,span) values('45','5','6'); insert into tb_numsery(id,startIdx,span) values('46','1','7'); insert into tb_numsery(id,startIdx,span) values('47','2','7'); insert into tb_numsery(id,startIdx,span) values('48','3','7'); insert into tb_numsery(id,startIdx,span) values('49','4','7'); insert into tb_numsery(id,startIdx,span) values('50','1','8'); insert into tb_numsery(id,startIdx,span) values('51','2','8'); insert into tb_numsery(id,startIdx,span) values('52','3','8'); insert into tb_numsery(id,startIdx,span) values('53','1','9'); insert into tb_numsery(id,startIdx,span) values('54','2','9'); insert into tb_numsery(id,startIdx,span) values('55','1','10');
这些值里,第一列用来做id,第二列用来指定substr的第二参数起始点,第三列用来指定截取的宽度。用这个表辅助从名称里截取子串方便多了,不像原方案一层层剥去有点费劲.
仔细观察这些值,你会发现它与上面从abc,abcd截取子串的过程暗合,其中道理请自行揣摩。
这些值可以由下面的程序得到:
package sery; public class SeryMaker { public static void main(String[] args) { int n=10; int level=1; int index=1; for(int i=n;i>0;i--) { for(int j=1;j<=i;j++) { System.out.println(getInsert(index,j,level)); index++; } level++; } } private static String getInsert(int index,int j,int level) { return String.format("insert into tb_numsery(id,startIdx,span) values('%d','%d','%d');",index,j,level); } }
有了tb_numsery表的帮助,最终的SQL语句就出来了:
select d.* from ( select c.child,count(*) as cnt from ( select substr(a.name,b.startIdx,b.span) as child from tb_string a,tb_numsery b where b.startIdx<=length(a.name) and b.span<=length(a.name) and b.startIdx+b.span<=length(a.name)+1 ) c group by c.child ) d where d.cnt>1 and length(d.child)>1 order by d.cnt desc,d.child
解读:最内层让两表做笛卡儿积,目的是利用tb_numsery表里的数据截取tb_string表的子串;中层是按子串进行分组,得到出现次数,最外层则是过滤和排序。
最外层where条件对出现次数等于1和子串长度等于1的数据进行了过滤,你也可以根据实际情况进行调整。
结果:
SQL> select d.* from 2 ( 3 select c.child,count(*) as cnt from 4 ( 5 select substr(a.name,b.startIdx,b.span) as child from 6 tb_string a,tb_numsery b 7 where b.startIdx<=length(a.name) 8 and b.span<=length(a.name) 9 and b.startIdx+b.span<=length(a.name)+1 10 ) c 11 group by c.child 12 ) d 13 where d.cnt>1 and length(d.child)>1 14 order by d.cnt desc,d.child; CHILD CNT ---------------------------------------- ---------- df 5 as 4 asd 4 sd 4 ar 3 pd 3 pdf 3 st 3 sta 3 star 3 ta 3 CHILD CNT ---------------------------------------- ---------- tar 3 已选择12行。
目测一下df和as出现次数,和实际情况是符合的。
在我的T440p机器上运行一下,得到结果不到0.01秒,借助oracle的强大性能,可能比一些Java算法要快了。
--2020年4月20日--