敏感(舆情)词过滤功能开发总结

公司最近有这样的一个需求:针对用户发布的动态、评论，如果触发了后台设置的舆情词功能，则触发下沉操作（状态标识）。查询时候如果是下沉的动态或者评论，那么自己能看到但是其他人看不到。

这个需求里面有几个功能点：后台需要一个舆情词功能管理，当用户发布动态和评论后，需要判断出来是否包含有舆情词。这就涉及到一个算法以及速度的问题。

一般这种功能是有三方的，比如腾讯百度之类的，都有这种三方接口。但是考虑到接入三方rpc调用的速度问题，决定看看有没有类似的组件。于是，github上搜了下，确实还有。

github地址：https://github.com/toolgood/ToolGood.Words

我主要是用java语言的，把java相关的代码重新建一个maven项目拷贝到自己的项目里面。作为过滤使用

后台的crud功能，写完后有个问题是这样的：每次我新增或者修改了舆情词，在过滤时候每次都要把这些内容重新加载到内存中，但是如果每次都加载，随着内存越来越多，会出现oom的问题。

于是我在项目启动时候加了预热功能，代码如下：

package com.gwm.lafeng.consts;

import com.gwm.lafeng.WordsSearchEx2;
import com.gwm.lafeng.dao.community.BbsSentimentDao;
import org.apache.commons.collections.CollectionUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Service;

import javax.annotation.PostConstruct;
import javax.annotation.Resource;
import java.util.List;

/**
 * @Author fanht
 * @Description
 * @Date 2022/1/21 2:44 下午
 * @Version 1.0
 */
@Service
public class WarmUpConstants {

    private Logger logger = LoggerFactory.getLogger(this.getClass());

    public static WordsSearchEx2 sensitiveWords;

    @Resource
    private BbsSentimentDao bbsSentimentDao;
    
    @PostConstruct
    public void WarmUpConstants() {
        initSentiment();
    }


    private void initSentiment() {
        logger.info("=====================初始化舆情词=========================");
        try {
            List<String> sentimentList = bbsSentimentDao.queryAllSentimentWords();
            if (CollectionUtils.isNotEmpty(sentimentList)) {
                sensitiveWords = new WordsSearchEx2();
                sensitiveWords.SetKeywords(sentimentList);
            }
        } catch (Exception e) {
            logger.error("初始化舆情词异常",e);
        }
        logger.info("===============初始化舆情词end============");
    }
}

　然后每次新增修改删除时候，把数据库的数据重新查询出来放到内存中。考虑到这个如果同步比较慢，因此加入了异步操作。

 application 添加允许异步

  @EnableAsync

 

  /**
     * 每次新增、修改、导入成功后刷新内存中的舆情词库
     */
    private void resetWarmUp(){
        try {
            asyncWarmUpUtils.resetWarmUp();
        } catch (Exception e) {
            logger.error("重置舆情词库异常",e);
        }
    }




/**
 * @Author fanht
 * @Description
 * @Date 2022/1/24 5:58 下午
 * @Version 1.0
 */
@Component
public class AsyncWarmUpUtils {

    private Logger logger = LoggerFactory.getLogger(this.getClass());

    @Autowired
    private BbsSentimentDao bbsSentimentDao;

    /**
     * 每次新增、修改、导入成功后刷新内存中的舆情词库 考虑比较慢 加上异步
     */
    @Async
    public void resetWarmUp(){
        try {
            List<String> sentimentList = bbsSentimentDao.queryAllSentimentWords();
            if (org.apache.commons.collections.CollectionUtils.isNotEmpty(sentimentList)) {
                WarmUpConstants.sensitiveWords.SetKeywords(sentimentList);
            }
        } catch (Exception e) {
            logger.error("重置舆情词库异常",e);
        }
    }
}

它这个慢主要是用toolGoods 自带的过滤词时候，如果每次都调用，是特别慢的。

这样放进去之后，每次检测时候直接判断就行了：

/**
     * 判断发布评论是否含有舆情词
     * @param commentDTO
     * @return true or false
     */
    private  PublishCommentDTO hasSentimentWords(PublishCommentDTO commentDTO){

        try {
            //如果已经触发了下沉操作 则不再判断 此处主要影响mq计算评论的个数
            if(BbsConstant.IS_SINK_1.equals(commentDTO.getIsSink()) || StringUtils.isEmpty(commentDTO.getCommentContent())){
                return commentDTO;
            }
            if(WarmUpConstants.sensitiveWords == null){
                return  commentDTO;
            }
            //todo 这一步很耗时 此处为优化后的
            WordsSearchEx2 sensitiveWords = WarmUpConstants.sensitiveWords;
            if(sensitiveWords.ContainsAny(commentDTO.getCommentContent())){
                LOG.info("评论含有舆情词");
                List<WordsSearchResult> senWords = sensitiveWords.FindAll(commentDTO.getCommentContent());
                StringBuffer stringBuffer = new StringBuffer();
                senWords.forEach(f ->{
                    stringBuffer.append(f.Keyword + ",");
                });
                commentDTO.setIsSink(BbsConstant.IS_SINK_1);
                commentDTO.setHasSentimentWords(BbsConstant.HAS_SENTIMENT);
                commentDTO.setSentimentWords(stringBuffer.toString().length() > 200 ? stringBuffer.toString().substring(0,200):stringBuffer.toString());
            }else {
                commentDTO.setIsSink(BbsConstant.IS_SINK_0);
                commentDTO.setHasSentimentWords(BbsConstant.NO_SENTIMENT);
                commentDTO.setSentimentWords(null);
            }
            return commentDTO;
        } catch (Exception e) {
            LOG.error("判断评论是否含有舆情词异常",e);
        }
        return commentDTO;
    }

如果是定时检测的，代码如下

package com.gwm.lafeng.timingtask;

import com.alibaba.excel.util.CollectionUtils;
import com.alibaba.fastjson.JSONObject;
import com.gwm.lafeng.WordsSearchEx2;
import com.gwm.lafeng.WordsSearchResult;
import com.gwm.lafeng.common.config.BbsConstant;
import com.gwm.lafeng.common.util.DateUtils;
import com.gwm.lafeng.consts.WarmUpConstants;
import com.gwm.lafeng.dao.community.BbsPostsDao;
import com.gwm.lafeng.dao.community.BbsSentimentDao;
import com.gwm.lafeng.document.CommunityStreamDocument;
import com.gwm.lafeng.entity.community.BbsPosts;
import com.gwm.lafeng.service.community.EsCommunityStreamService;
import org.apache.commons.lang3.StringUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.scheduling.annotation.EnableScheduling;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;

import javax.annotation.Resource;
import java.util.Date;
import java.util.List;

/**
 * @Author fanht
 * @Description 触发舆情词下沉功能
 * @Date 2022/1/15 9:53 上午
 * @Version 1.0
 */
@Component
@EnableScheduling
public class TimingBbsSentimentManager {

    private Logger logger = LoggerFactory.getLogger(this.getClass());

    @Resource
    private EsCommunityStreamService esCommunityStreamService;

    @Resource
    private BbsPostsDao bbsPostsDao;


    /**
     *每10S执行一次
     */
    @Scheduled(cron = "10 * * * * ?")
    public void job() {
        logger.info("=======触发舆情词则自动下沉start============");
        //查询10s内 审核通过的动态
        List<BbsPosts> postsList = bbsPostsDao.queryCheckPass(DateUtils.jumpMinute(new Date(), -24*60));
        if (!CollectionUtils.isEmpty(postsList)) {
            WordsSearchEx2 sensitiveWords = WarmUpConstants.sensitiveWords;

            postsList.forEach(t -> {
                BbsPosts bbsPosts = new BbsPosts();
                bbsPosts.setId(t.getId());
                StringBuffer stringBuffer = new StringBuffer();

                boolean tas = StringUtils.isEmpty(t.getTitle())? false : sensitiveWords.ContainsAny(t.getTitle());
                boolean cas =StringUtils.isEmpty(t.getContent())?  false :  sensitiveWords.ContainsAny(t.getContent());

                if(tas ){
                    List<WordsSearchResult> senWords = sensitiveWords.FindAll(t.getTitle());
                    senWords.forEach(s -> {
                        stringBuffer.append(s.Keyword + ",");
                    });
                }
                if(cas){
                    List<WordsSearchResult> senWords = sensitiveWords.FindAll(t.getContent());
                    senWords.forEach(s -> {
                        stringBuffer.append(s.Keyword + ",");
                    });
                }

                if(tas || cas){
                    bbsPosts.setHasSentimentWords(BbsConstant.HAS_SENTIMENT);
                    bbsPosts.setSentimentWords(stringBuffer.toString().length() > 200 ? stringBuffer.toString().substring(0,200):stringBuffer.toString());
                    logger.info("动态标题or内容还有舆情词,触发下沉操作"+ JSONObject.toJSON(bbsPosts));
                    bbsPostsDao.updateSentimentSink(bbsPosts);
                    CommunityStreamDocument communityStreamDocument = new CommunityStreamDocument();
                    communityStreamDocument.setId(bbsPosts.getId().toString());
                    communityStreamDocument.setIsSink(BbsConstant.IS_SINK_1);
                    try {
                        esCommunityStreamService.saveOrUpdateDocument(communityStreamDocument);
                    } catch (Exception e) {
                       logger.error("定时更新话题舆情词下沉异常",e);
                    }
                }
            });
        }
    }
}

遇到的问题记录：

1.首先就是oom的问题

刚开始开发完的时候，写完定时后发现服务器在dev环境老是报oom 的问题。是因为自己写的job ，每隔一分钟执行一次。每次都要把信息加载到toolGoodls的内存中。但是不清楚为何没有回收，导致内存越来越大，最终oom。解决方案就是自己写了一个static的静态类，在预热中加载到内存中。我比较了下之前的的写法：

//todo 这一步很耗时
WordsSearchEx2 sensitiveWords = new WordsSearchEx2();
sensitiveWords.SetKeywords(sentimentList);


  这样写是特别耗时和耗费内存的。

2.crud后及时生效
   为了实时生效，我是在项目初始化后先查下库之后放入到了内存中，之后每次修改删除添加，就同步做修改。这样做到了实时生效，速度也基本在100ms以内。也没有出现oom的问题

3.异步处理
   同步的话，在crud之后把最新的敏感词放入内存中比较慢，这里是用了注解@Async的异步注解，用异步线程池也是可以的。如果是异步线程池 可以这样处理

    ThreadFactory factory = new ThreadFactoryBuilder().setNameFormat("async-sentiment-words-%d").build();
    public final ExecutorService executorService = new ThreadPoolExecutor(5,20,10L, TimeUnit.MILLISECONDS,
            new LinkedBlockingQueue<>(20),factory,new ThreadPoolExecutor.AbortPolicy());



        executorService.submit(new Callable<Object>() {
            @Override
            public Object call() throws Exception {
                //异步方法
              
                return null;
            }
        });

相关阅读:
【神经网络与深度学习】学习笔记：AlexNet&Imagenet学习笔记
 【神经网络与深度学习】学习笔记：AlexNet&Imagenet学习笔记
 【神经网络与深度学习】如何将别人训练好的model用到自己的数据上
 【神经网络与深度学习】如何将别人训练好的model用到自己的数据上
 【神经网络与深度学习】Caffe使用step by step：使用自己数据对已经训练好的模型进行finetuning
【神经网络与深度学习】Caffe使用step by step：使用自己数据对已经训练好的模型进行finetuning
【神经网络与深度学习】用训练好的caffemodel来进行分类
 【神经网络与深度学习】用训练好的caffemodel来进行分类
 【神经网络与深度学习】Caffe部署中的几个train-test-solver-prototxt-deploy等说明
 【神经网络与深度学习】Caffe部署中的几个train-test-solver-prototxt-deploy等说明
原文地址：https://www.cnblogs.com/thinkingandworkinghard/p/15874624.html