用php实现一个敏感词过滤功能

周末空余时间撸了一个敏感词过滤功能，下边记录下实现过程。

敏感词，一方面是你懂的，另一方面是我们自己可能也要过滤一些人身攻击或者广告信息等，具体词库可以google下，有很多。

过滤敏感词，使用简单的循环str_replace是性能很低效的，还会随着词库的增加，性能指数下降，而且简单的替换，不能解决一些不是完全匹配的词。这时候就需要先构建一个字典树(trie)，单纯的字典树占用空间较大，使用Double-Array Trie或者Ternary Search Tree可以在保证性能的同时节省一部分空间，但是敏感词基本不会很多，几千甚至上万个词基本没压力，所以就实现就选择先构建一个字典树，然后逐字做匹配。

代码不多，就贴到这里。

<?php

class SensitiveWordFilter
{
    private $dict;
    private $dictPath;

    public function __construct($dictPath)
    {
        $this->dict = array();
        $this->dictPath = $dictPath;
        $this->initDict();
    }

    private function initDict()
    {
        $handle = fopen($this->dictPath, 'r');
        if (!$handle) {
            throw new RuntimeException('open dictionary file error.');
        }

        while (!feof($handle)) {
            $word = trim(fgets($handle, 128));

            if (empty($word)) {
                continue;
            }

            $uWord = $this->unicodeSplit($word);

            $pdict = &$this->dict;

            $count = count($uWord);
            for ($i = 0; $i < $count; $i++) {
                if (!isset($pdict[$uWord[$i]])) {
                    $pdict[$uWord[$i]] = array();
                }
                $pdict = &$pdict[$uWord[$i]];
            }

            $pdict['end'] = true;
        }

        fclose($handle);
    }

    public function filter($str, $maxDistance = 5)
    {
        if ($maxDistance < 1) {
            $maxDistance = 1;
        }

        $uStr = $this->unicodeSplit($str);

        $count = count($uStr);

        for ($i = 0; $i < $count; $i++) {
            if (isset($this->dict[$uStr[$i]])) {
                $pdict = &$this->dict[$uStr[$i]];

                $matchIndexes = array();

                for ($j = $i + 1, $d = 0; $d < $maxDistance && $j < $count; $j++, $d++) {
                    if (isset($pdict[$uStr[$j]])) {
                        $matchIndexes[] = $j;
                        $pdict = &$pdict[$uStr[$j]];
                        $d = -1;
                    }
                }

                if (isset($pdict['end'])) {
                    $uStr[$i] = '*';
                    foreach ($matchIndexes as $k) {
                        if ($k - $i == 1) {
                            $i = $k;
                        }
                        $uStr[$k] = '*';
                    }
                }
            }
        }

        return implode($uStr);
    }

    public function unicodeSplit($str)
    {
        $str = strtolower($str);
        $ret = array();
        $len = strlen($str);
        for ($i = 0; $i < $len; $i++) {
            $c = ord($str[$i]);

            if ($c & 0x80) {
if (($c & 0xf8) == 0xf0 && $len - $i >= 4) {
if ((ord($str[$i + 1]) & 0xc0) == 0x80 && (ord($str[$i + 2]) & 0xc0) == 0x80 && (ord($str[$i + 3]) & 0xc0) == 0x80) {
$uc = substr($str, $i, 4);
$ret[] = $uc;
$i += 3;
}
} else if (($c & 0xf0) == 0xe0 && $len - $i >= 3) {
if ((ord($str[$i + 1]) & 0xc0) == 0x80 && (ord($str[$i + 2]) & 0xc0) == 0x80) {
$uc = substr($str, $i, 3);
$ret[] = $uc;
$i += 2;
}
} else if (($c & 0xe0) == 0xc0 && $len - $i >= 2) {
if ((ord($str[$i + 1])  & 0xc0) == 0x80) {
$uc = substr($str, $i, 2);
$ret[] = $uc;
$i += 1;
}
}
} else {
$ret[] = $str[$i];
}
}
return $ret;
}
}

使用方法

<?php
require 'SensitiveWordFilter.php';

/*
初始化传入词库文件路径，词库文件每个词一个换行符。
如：
敏感1
敏感2

目前只支持UTF-8编码
*/
$filter = new SensitiveWordFilter(__DIR__ . '/sensitive_words.txt');

/*
第一个参数传入要过滤的字符串，第二个是匹配的字间距，
比如'枪支'是一个敏感词，想过滤'枪||||支'的时候，
就需要指定一个两个字的间距，可以根据情况设定，
超过指定间距就不会过滤。所有匹配的敏感词会被替换为'*'。
*/
$filter->filter('这是一个敏感词', 10);

性能没有具体详细的做测试，不过一般场景足够，主要是吃CPU，词库可以把生成好的字典JSON编码后存到Redis或者Memcached中，下次使用直接取出还原。

PHP写WEB的话，不是Daemon这种，所以构建的数据结构不能方便的驻留内存，相比来说，C、C++、Java等可能更合适，如果对性能要求苛刻，可以用其他语言写个服务。当然，如果非要使用PHP，也可以使用Swoole封装服务。

相关阅读:
windows 7 和 windows server 2013 重置SID
设置 sharepoint 会话过期时间
Sharepoint SPQuery语法
sharepoint 2013 浏览器关掉cookies无效的脚本
ueditor 集成使用 (sharepoint 集成)
c#怎么获取当前页面的url
部署Office Web Apps Server并配置其与SharePoint 2013的集成
sharepoint 查询一个站点下所有的调查问卷调查问卷的列表类型
sharepoint 2013基于AD的Form表单登录（一）——登录配置
你必须要知道的HTTP协议原理

原文地址：https://www.cnblogs.com/zenghansen/p/5688995.html