• 构造与使用分离:命中内容高亮及合并的展示问题解决实现


    在入侵检测业务中,针对文件内容型的告警详情,需要匹配命中内容的上下若干行。

    第一种实现

    先来看一种实现:

    
    /**
     * Created by zhangli on 19-12-18.
     * 高亮文本工具类
     */
    public class HighLightUtils {
        private static final Integer LINE_NUM = 10;
        private static final int MAX_REGEX_NUM = 10;
     
        /**
         * @param content 文本内容
         * @param keywords 关键字列表
         * @return 高亮内容段落集
         */
        public static List<MatchedContent> highlight(String content, List<String> keywords) {
            if (StringUtils.isEmpty(content) || CollectionUtils.isEmpty(keywords)) {
                return Collections.emptyList();
            }
     
            List<MatchedContent> partContentList = Lists.newArrayList();
            for (String keyword : keywords) {
                if (!content.contains(keyword)) {
                    continue;
                }
                partContentList.addAll(highlight(content, escapeRegexSpecialWord(keyword)));
            }
            return partContentList;
        }
     
        /**
         * @param content 文本内容
         * @param regex 正则表达式
         * @return 高亮内容段落集
         */
        public static List<MatchedContent> highlight(String content, String regex) {
            return highlight(content, regex, MAX_REGEX_NUM, LINE_NUM);
        }
     
        public static List<MatchedContent> highlight(String content, String regex, int maxMatchNum, int lineNum) {
            if (StringUtils.isEmpty(content) || StringUtils.isEmpty(regex)) {
                return Collections.emptyList();
            }
     
            content = content.replaceAll("\\r\\n", "\n");
     
            Pattern pattern = Pattern.compile(regex);
            Matcher m = pattern.matcher(content);
     
            List<MatchedContent> partContentList = Lists.newArrayList();
            int maxNum = maxMatchNum;
            while (m.find()) {
                RegexMatchPoint regexMatchPoint = new RegexMatchPoint(m.start(), m.end());
                partContentList.add(getPartContentMap(content, regexMatchPoint, lineNum));
                if (--maxNum == 0) {
                    break;
                }
            }
            return partContentList;
        }
     
        /**
         * 根据正则匹配获取高亮内容及起始行
         */
        private static MatchedContent getPartContentMap(String content, RegexMatchPoint m, int lineNum) {
            // 获取匹配内容在文件中的行数
            int startMatchLine = content.substring(0, m.getStart()).split("\\n").length;
            int endMatchLine = content.substring(0, m.getEnd()).split("\\n").length;
            // 高亮文件匹配内容
            String highlightContent = highlightOneRegexContent(content, m);
            // 截取匹配内容前后共20行(若匹配内容跨行,且大于10行,则从匹配的地方开始截取)
            String partContent = getPartContent(highlightContent, startMatchLine, endMatchLine);
            // 获取截取内容首行的行号
            int startLine = endMatchLine - lineNum + 1;
            //如果匹配的内容大于10行,则从最初匹配行开始,而不是固定的10行
            if (startMatchLine < startLine) {
                startLine = startMatchLine;
            }
     
            return MatchedContent.builder()
                    .startLine(startLine < 1 ? 1 : startLine)
                    .partContent(partContent)
                    .build();
        }
     
        /**
         * 获取高亮行前后部分内容
         */
        private static String getPartContent(String content, Integer startMatchLine, Integer endMatchLine) {
            int start = StringUtils.ordinalIndexOf(content, "\n", endMatchLine - LINE_NUM);
            if (endMatchLine - startMatchLine > LINE_NUM) {
                start = StringUtils.ordinalIndexOf(content, "\n", startMatchLine - 1);
            }
            start = start < 0 ? 0 : start + 1;
            int end = StringUtils.ordinalIndexOf(content, "\n", endMatchLine + LINE_NUM);
            end = end < 0 ? content.length() : end;
     
            return content.substring(start, end);
        }
     
        /**
         * 高亮单个匹配的内容
         */
        private static String highlightOneRegexContent(String content, RegexMatchPoint point) {
            int start = 0;
     
            StringBuffer highlightContentSb = new StringBuffer();
            highlightContentSb.append(content.substring(start, point.getStart())).append(CommonValues.HIGH_LIGHT_START)
                    .append(content.substring(point.getStart(), point.getEnd())).append(CommonValues.HIGH_LIGHT_END)
                    .append(content.substring(point.getEnd()));
     
            return highlightContentSb.toString();
        }
     
        private static String escapeRegexSpecialWord(String keyword) {
            if (keyword != "") {
                String[] fbsArr = { "\\", "$", "(", ")", "*", "+", ".", "[", "]", "?", "^", "{", "}", "|" };
                for (String key : fbsArr) {
                    if (keyword.contains(key)) {
                        keyword = keyword.replace(key, "\\" + key);
                    }
                }
            }
            return keyword;
        }
     
        @Setter
        @Getter
        @ToString
        public static class RegexMatchPoint implements Comparable<RegexMatchPoint> {
            private Integer start;
            private Integer end;
     
            public RegexMatchPoint(Integer start, Integer end) {
                this.start = start;
                this.end = end;
            }
     
            //按开始位置排序
            @Override
            public int compareTo(RegexMatchPoint o) {
                if (start.compareTo(o.getStart()) == 0) {
                    return end.compareTo(o.getEnd());
                } else {
                    return start.compareTo(o.getStart());
                }
            }
     
            public RegexMatchPoint copy() {
                return new RegexMatchPoint(start, end);
            }
     
        }
    }
    
    

    这个实现还是不错的,至少给人很好的启发,是一个很好的改进基础。

    那么,它的问题在哪里呢?

    • 按单个匹配高亮,如果是行内有多个匹配的话,就很难合并;
    • 缺乏行号的记录,所有跟行有关的东西,都是通过 split("\n") 和 substring(start, end)来实现的;
    • 合并命中内容比较困难。

    构造与使用分离

    何为构造与使用分离呢? 是指构造的时候,就提取出足够的必要信息; 而在使用时则运用这些信息去处理,而不是“边构建边使用”。就像编译器做代码编译和自动生成一样,应该不会是边编译边生成代码。

    边构建边使用的实现,会将构建与处理耦合在一起,一旦有需要改动,就会比较困难。

    很显然,如果要构造与使用分离,那么我们需要首先拿到什么内容? (命中内容的行号、起始位置、结束位置;所有文件行及行号) 应该先把这些必要信息提取出来。一旦我们确定了求解问题需要的必要信息,想出一个清晰的算法就比较自然了。

    高亮内容合并展示的算法简述

    步骤一:获取所有行及行号【行号,行内容】;

    步骤二: 先找到所有匹配正则的字符串的行号及起始结束点 regexMatchPoints =(lineNo, start, end);

    步骤三:将 regexMatchPoints 按行分组;因为行内的多个匹配合并很麻烦;

    步骤四:所有匹配行号,按匹配行号排序,方便最终按行号序展示;

    步骤五:按行生成高亮内容展示[行号,高亮行内容];

    步骤六:按匹配行号计算起始行和结束行行号,如果已经在这个区间的行号则可以过滤(合并实现);

    步骤七:根据所有起始行号和结束行号获取对应的行内容。


    实现代码

    
    /**
     * 高亮文本展示工具类
     * Created by qinshu on 2021/12/31
     */
    public class HighLightUtil {
     
        private static final Logger LOG = LogUtils.getLogger(HighLightUtil.class);
     
        /** 高亮展示前后的行数 */
        private static final Integer HIGHLIGHT_LINE_NUM = 5;
     
        /** 最大匹配多少次 */
        private static final int MAX_REGEX_NUM = 10;
     
        /**
         * @param content 文本内容
         * @param regex 正则表达式
         * @return 高亮内容段落集
         */
        public static List<MatchedFileContent> highlight(String content, String regex) {
            return highlight(content, regex, MAX_REGEX_NUM, HIGHLIGHT_LINE_NUM);
        }
     
        /**
         * @param base64Content 文本内容(base64编码后的文本)
         * @param regex 正则表达式
         * @return 高亮内容段落集
         */
        public static List<MatchedFileContent> highlightBase64(String base64Content, String regex) {
            if (StringUtils.isEmpty(base64Content)) {
                return Collections.emptyList();
            }
            return highlight(Base64Utils.decodeContent(base64Content), regex);
        }
     
        public static List<MatchedFileContent> highlight(String content, String regex, int maxMatchNum, int highlightLineNum) {
            if (StringUtils.isEmpty(content) || StringUtils.isEmpty(regex)) {
                return Collections.emptyList();
            }
     
            content = content.replaceAll("\\r\\n", "\n");
     
            List<String> allLines = Arrays.asList(content.split("\n"));
            Pattern pattern = Pattern.compile(regex);
            List<RegexMatchPoint> regexMatchPoints = findAllRegexMatches(allLines, pattern);
     
            // 按行号分组,匹配高亮展示,因为单行多个匹配的高亮需要单行展示,分开后合并比较麻烦
            Map<Integer, List<RegexMatchPoint>> regexMatchPointMap = regexMatchPoints.stream().collect(Collectors.groupingBy(RegexMatchPoint::getLineNo));
     
            // highLightLineMap: [行号,高亮行]
            Map<Integer, String> highLightLineMap = new HashMap<>();
            regexMatchPointMap.forEach((lineNo, matchPointsOfLine) -> {
                highLightLineMap.put(lineNo, highlightOneLineContent(allLines.get(lineNo), matchPointsOfLine));
            } );
     
            List<MatchedFileContent> partContentList = merge(highLightLineMap, allLines, highlightLineNum);
            return partContentList.subList(0, Math.min(partContentList.size(), maxMatchNum));
        }
     
        private static List<MatchedFileContent> merge(Map<Integer, String> highLightLineMap, List<String> allLines, int highlightLineNum) {
     
            // 按行号排序
            List<Integer> highLightLineNos = Lists.newArrayList(highLightLineMap.keySet());
            Collections.sort(highLightLineNos);
     
            // 计算需要展示的行号
            List<MatchedFileLine> matchedFileLines = Lists.newArrayList();
            for (Integer highLineNo: highLightLineNos) {
                if (!exist(matchedFileLines, highLineNo)) {
                    int startLine = highLineNo - highlightLineNum;
                    int endLine = 0;
                    if (startLine < 0) {
                        startLine = 0;
                        endLine = highLineNo + highlightLineNum;
                    }
                    else {
                        startLine = highLineNo - highlightLineNum + 1;
                        endLine = highLineNo + highlightLineNum;
                    }
                    matchedFileLines.add(new MatchedFileLine(startLine, endLine));
                }
            }
     
            return matchedFileLines.stream()
                    .map(fileLine -> getMatchedFileContent(highLightLineMap, allLines, fileLine)).collect(Collectors.toList());
        }
     
        /**
         * 获取指定行号的行内容
         */
        private static String getLine(Map<Integer, String> highLightLineMap, List<String> allLines, Integer lineNo) {
            String highLightLine =  highLightLineMap.get(lineNo);
            return highLightLine != null ? highLightLine : allLines.get(lineNo);
        }
     
        private static boolean exist(List<MatchedFileLine> matchedFileLines, Integer lineNo) {
            return matchedFileLines.stream().anyMatch(fileLine -> exist(fileLine, lineNo));
        }
     
        private static boolean exist(MatchedFileLine matchedFileLine, Integer lineNo) {
            return lineNo >= matchedFileLine.getStartLine() && lineNo < matchedFileLine.getEndLine();
        }
     
        /**
         * 根据起始行号获取
         * @param highLightLineMap 高亮行
         * @param allLines 文件所有行
         * @param fileLine 匹配内容上下文行号
         * @return 匹配内容上下文及起始行号
         */
        private static MatchedFileContent getMatchedFileContent(Map<Integer, String> highLightLineMap, List<String> allLines, MatchedFileLine fileLine) {
            StringBuilder partContentBuilder = new StringBuilder();
            for (int i = fileLine.getStartLine(); i < fileLine.getEndLine() && i < allLines.size(); i++) {
                partContentBuilder.append(getLine(highLightLineMap, allLines, i) + "\n");
            }
            return new MatchedFileContent(fileLine.getStartLine() + 1, partContentBuilder.toString());
        }
     
        /**
         * 获取所有正则匹配点
         * @param allLines 文件内容的所有行
         * @param pattern 正则匹配编译表达式
         * @return 所有匹配正则表达式的字符串的位置
         */
        private static List<RegexMatchPoint> findAllRegexMatches(List<String> allLines, Pattern pattern) {
            // 先拿到所有的正则匹配点,行号从 0 开始
            List<RegexMatchPoint> regexMatchPoints = Lists.newArrayList();
            for (int i=0; i < allLines.size(); i++) {
                String line = allLines.get(i);
                Matcher m = pattern.matcher(line);
                while (m.find()) {
                    RegexMatchPoint regexMatchPoint = new RegexMatchPoint(i, m.start(), m.end());
                    regexMatchPoints.add(regexMatchPoint);
                }
            }
            return regexMatchPoints;
        }
     
     
        /**
         * 高亮文本内容
         */
        public static String highlightContent(String content, List<String> match) {
            if (CollectionUtils.isEmpty(match)) {
                return content;
            }
     
            try {
                for (String matchContent : match) {
                    String highlightContent = String.format("%s%s%s", CommonValues.HIGH_LIGHT_START, matchContent, CommonValues.HIGH_LIGHT_END);
                    content = content.replaceAll(ExprUtils.escapeExprSpecialWord(matchContent), highlightContent);
                }
            } catch (Exception e) {
                LOG.error("highlight content error, content:{}, match:{}", content, match);
            }
            return content;
        }
     
        /**
         * 高亮一行的展示
         */
        public static String highlightOneLineContent(String content, List<RegexMatchPoint> points) {
            int start = 0;
            int lastMatchEnd = 0;
            StringBuilder sb = new StringBuilder();
            for (RegexMatchPoint point: points) {
                sb.append(content, start, point.getStart()).append(CommonValues.HIGH_LIGHT_START)
                        .append(content, point.getStart(), point.getEnd()).append(CommonValues.HIGH_LIGHT_END);
                start = point.getEnd();
                lastMatchEnd = point.getEnd();
            }
            sb.append(content.substring(lastMatchEnd));
            return sb.toString();
        }
     
        @Setter
        @Getter
        @ToString
        public static class RegexMatchPoint implements Comparable<RegexMatchPoint> {
     
            private Integer lineNo;
            private Integer start;
            private Integer end;
     
            public RegexMatchPoint(Integer lineNo, Integer start, Integer end) {
                this.lineNo = lineNo;
                this.start = start;
                this.end = end;
            }
     
            public RegexMatchPoint copy() {
                return new RegexMatchPoint(lineNo, start, end);
            }
        }
     
        @Setter
        @Getter
        public static class MatchedFileLine {
            private Integer startLine;
            private Integer endLine;
            public MatchedFileLine(Integer startLine, Integer endLine) {
                this.startLine = startLine;
                this.endLine = endLine;
            }
        }
    }
    
    

    自测

    
    /**
     * 高亮展示
     * Created by qinshu on 2021/12/31
     */
    public class HighlightUtilTest {
     
        String content = "dependencies {\n" +
                "    testCompile group: 'junit', name: 'junit'\n" +
                "\n" +
                "    compile project(\":detect-lib\")\n" +
                "    compile project(\":connect-cli\")\n" +
                "    compile project(\":wisteria-client\")\n" +
                "    compile project(\":upload-cli\")\n" +
                "    compile project(\":scan-client\")\n" +
                "    compile(\"com.qt.qt-common:config-loader\")\n" +
                "    compile project(\":switches-lib\")\n" +
                "    compile project(\":bizevent-lib\")\n" +
                "    compile project(\":user-client\")\n" +
                "    compile project(\":notif-client\")\n" +
                "    compile project(\":detect-client\")\n" +
                "    compile project(\":job-cli\")\n" +
                "    compile('com.qt.qt-common:redis-lib')\n" +
                "    compile('com.qt.qt-common:rabbitmq-lib')\n" +
                "    compile('com.qt.qt-common:encrypt-property-lib')\n" +
                "    compile project(\":leader-latch-lib\")\n" +
                "    compile(\"com.qt.qt-common:eventflow-lib:1.0.0-SNAPSHOT\")\n" +
                "    compile(\"com.qt.qt-common:intrusion-detect-lib:1.0.1\")\n" +
                "    compile('com.qt.qt-common:mysql-lib')\n" +
                "    compile('com.qt.qt-common:rule-crypto')\n" +
                "    compile project(\":rule-lib\")\n" +
                "    compile project(\":api-auth-lib\")\n" +
                "\n" +
                "    // Spring Cloud\n" +
                "    //  配置中心\n" +
                "    compile ('org.springframework.cloud:spring-cloud-starter-zookeeper-config')\n" +
                "    //  服务发现\n" +
                "    compile ('org.springframework.cloud:spring-cloud-starter-zookeeper-discovery')\n" +
                "    compile ('com.netflix.hystrix:hystrix-javanica')\n" +
                "\n" +
                "    // Spring Boot\n" +
                "    compile('org.springframework.boot:spring-boot-starter-web')\n" +
                "    compile('org.springframework.boot:spring-boot-starter-aop')\n" +
                "    compile('org.springframework.boot:spring-boot-starter-data-redis')\n" +
                "\n" +
                "    // Spring\n" +
                "    compile('org.springframework:spring-orm')\n" +
                "    compile('org.springframework:spring-jdbc')\n" +
                "    compile('org.springframework:spring-aop')\n" +
                "\n" +
                "    // mongodb\n" +
                "    compile('org.springframework.data:spring-data-mongodb:1.10.23.RELEASE')\n" +
                "\n" +
                "    // Mysql\n" +
                "    runtime('mysql:mysql-connector-java')\n" +
                "    compile('com.zaxxer:HikariCP')\n" +
                "    compile('org.mybatis.spring.boot:mybatis-spring-boot-starter')\n" +
                "    compile('com.github.pagehelper:pagehelper-spring-boot-starter')\n" +
                "\n" +
                "    //redisson\n" +
                "    compile('io.projectreactor:reactor-core:3.2.8.RELEASE')\n" +
                "\n" +
                "    // Jackson\n" +
                "    compile('com.fasterxml.jackson.core:jackson-core')\n" +
                "    compile('com.fasterxml.jackson.core:jackson-annotations')\n" +
                "    compile('com.fasterxml.jackson.core:jackson-databind')\n" +
                "    compile('org.codehaus.jackson:jackson-core-asl')\n" +
                "\n" +
                "    compile('joda-time:joda-time')\n" +
                "    compile('commons-io:commons-io:2.5')\n" +
                "    compile('org.apache.commons:commons-lang3:3.5')\n" +
                "    compile('org.apache.commons:commons-collections4:4.1')\n" +
                "    compile('cglib:cglib:3.2.5')\n" +
                "    compile('net.java.dev.jna:jna:5.8.0')\n" +
                "    compile('org.apache.calcite:calcite-core:1.26.0')\n" +
                "\n" +
                "    // Test\n" +
                "    testCompile('org.mockito:mockito-core:2.13.0')\n" +
                "    testCompile('org.springframework:spring-test')\n" +
                "    testCompile('org.springframework.boot:spring-boot-starter-test')\n" +
                "\n" +
                "    // string-similarity\n" +
                "    compile('info.debatty:java-string-similarity:0.24')\n" +
                "\n" +
                "    compile('com.jayway.jsonpath:json-path')\n" +
                "\n" +
                "    compile('com.qt.qt-common:cron-lib:1.0.0')\n" +
                "\n" +
                "}";
     
        @Test
        public void tsetHighlight() {
            String regex = "org\\.apache";
            List<MatchedFileContent> matched = HighLightUtil.highlight(content, regex);
            Assert.assertTrue(matched.size() > 0);
        }
     
        @Test
        public void testHighlightBase64() {
            String content = "MG1laW5hMiAxbWVpbmEyCjBtZWluYTIgMW1laW5hMgo=";
            String regex = "meina2";
            List<MatchedFileContent> matchedFileContents = HighLightUtil.highlightBase64(content, regex);
            Assert.assertEquals(1, matchedFileContents.size());
            Assert.assertEquals("[MatchedFileContent(startLine=1, partContent=0<qthighlight--meina2--qthighlight> 1<qthighlight--meina2--qthighlight>\n" +
                    "0<qthighlight--meina2--qthighlight> 1<qthighlight--meina2--qthighlight>\n" +
                    ")]", matchedFileContents.toString());
        }
     
        @Test
        public void testHighLight2() {
            String content = "customdir2  1\n" +
                    "customdir2  2\n" +
                    "customdir2  3\n" +
                    "customdir2  4\n" +
                    "customdir2  5\n" +
                    "customdir2  6\n" +
                    "customdir2  7 customdir2  7 customdir2  7 customdir2  7\n" +
                    "customdir2  8 customdir2  8 customdir2  8 customdir2  8";
            String regex = "customdir2";
            List<MatchedFileContent> matchedFileContents = HighLightUtil.highlight(content, regex);
            Assert.assertEquals(2, matchedFileContents.size());
        }
     
        @Test
        public void testHighLight3() {
            String content = "customdir2  1\n" +
                    "customdir2  2\n" +
                    "customdir3  3\n" +
                    "customdir5  4\n" +
                    "customdir6  5\n" +
                    "customdir9  6\n" +
                    "customdir10  7 customdir8  7 customdird  7 customdiro  7\n" +
                    "customdir2  8 customdir2  8 customdir2  8 customdir2  8";
            String regex = "customdir2";
            List<MatchedFileContent> matchedFileContents = HighLightUtil.highlight(content, regex);
            Assert.assertEquals(2, matchedFileContents.size());
        }
    }
    
    

    小结

    本文讲解了如何运用“构造与使用分离”的思想,来重构和改进高亮展示命中文本内容的算法实现。 构造与使用分离,即是在构造的时候抽取所需的必要信息,而在使用的时候去构建所需要功能,而不是边构建边使用,将构建与使用耦合在一起,后续如果有需求变更,改动就会比较麻烦。

  • 相关阅读:
    专职DBA-MySQL主从延时复制
    专职DBA-MySQL主从异步复制
    专职DBA-MySQL连接启动与多实例数据库
    Java日期时间2
    Java日期时间
    763. Partition Labels
    377. Combination Sum IV
    459. Repeated Substring Pattern
    784. Letter Case Permutation
    526. Beautiful Arrangement
  • 原文地址:https://www.cnblogs.com/lovesqcc/p/15851440.html
Copyright © 2020-2023  润新知