• 判断中日韩文的正则表达式


    PHP中GBK和UTF8编码处理

    一、编码范围

    1. GBK (GB2312/GB18030)
    \x00-\xff  GBK双字节编码范围
    \x20-\x7f  ASCII
    \xa1-\xff  中文
    \x80-\xff  中文

    2. UTF-8 (Unicode)
    \u4e00-\u9fa5 (中文)
    \x3130-\x318F (韩文
    \xAC00-\xD7A3 (韩文)
    \u0800-\u4e00 (日文)
    ps: 韩文是大于[\u9fa5]的字符


    正则例子:
    preg_replace("/([\x80-\xff])/","",$str);
    preg_replace("/([u4e00-u9fa5])/","",$str);

    二、代码例子

    //判断内容里有没有中文-GBK (PHP)
    function check_is_chinese($s){
        return preg_match('/[\x80-\xff]./', $s);
    }

    //获取字符串长度-GBK (PHP)
    function gb_strlen($str){
        $count = 0;
        for($i=0; $i<strlen($str); $i++){
            $s = substr($str, $i, 1);
            if (preg_match("/[\x80-\xff]/", $s)) ++$i;
            ++$count;
        }
        return $count;
    }

    //截取字符串字串-GBK (PHP)
    function gb_substr($str, $len){
        $count = 0;
        for($i=0; $i<strlen($str); $i++){
            if($count == $len) break;
            if(preg_match("/[\x80-\xff]/", substr($str, $i, 1))) ++$i;
            ++$count;       
        }
        return substr($str, 0, $i);
    }

    //统计字符串长度-UTF8 (PHP)
    function utf8_strlen($str) {
        $count = 0;
        for($i = 0; $i < strlen($str); $i++){
            $value = ord($str[$i]);
            if($value > 127) {
                $count++;
                if($value >= 192 && $value <= 223) $i++;
                elseif($value >= 224 && $value <= 239) $i = $i + 2;
                elseif($value >= 240 && $value <= 247) $i = $i + 3;
                else die('Not a UTF-8 compatible string');
            }
            $count++;
        }
        return $count;
    }


    //截取字符串-UTF8(PHP)
    function utf8_substr($str,$position,$length){
        $start_position = strlen($str);
        $start_byte = 0;
        $end_position = strlen($str);
        $count = 0;
        for($i = 0; $i < strlen($str); $i++){
            if($count >= $position && $start_position > $i){
                $start_position = $i;
                $start_byte = $count;
            }
            if(($count-$start_byte)>=$length) {
                $end_position = $i;
                break;
             
            $value = ord($str[$i]);
            if($value > 127){
                $count++;
                if($value >= 192 && $value <= 223) $i++;
                elseif($value >= 224 && $value <= 239) $i = $i + 2;
                elseif($value >= 240 && $value <= 247) $i = $i + 3;
                else die('Not a UTF-8 compatible string');
            }
            $count++;

        }
        return(substr($str,$start_position,$end_position-$start_position));
    }


    //字符串长度统计-UTF8 [中文3个字节,俄文、韩文占2个字节,字母占1个字节] (Ruby)
    def utf8_string_length(str)
        temp = CGI::unescape(str)
        i = 0;
        j = 0;
        temp.length.times{|t|
            if temp[t] < 127
                i += 1
            elseif temp[t] >= 127 and temp[t] < 224
                j += 1
                if 0 == (j % 2)
                    i += 2
                    j = 0
                end
            else
                j += 1
                if 0 == (j % 3)
                    i +=2
                    j = 0
                end
            end
        }
        return i
    }

    //判断是否是有韩文-UTF-8 (JavaScript)
    function checkKoreaChar(str) {
        for(i=0; i<str.length; i++) {
            if(((str.charCodeAt(i) > 0x3130 && str.charCodeAt(i) < 0x318F) || (str.charCodeAt(i) >= 0xAC00 && str.charCodeAt(i) <= 0xD7A3))) {
                return true;
            }
        }
        return false;
    }
    //判断是否有中文字符-GBK (JavaScript)
    function check_chinese_char(s){
        return (s.length != s.replace(/[^\x00-\xff]/g,"**").length);

    三、参考文档


    另:


    public function csubstr($str, $start=0, $length, $charset="utf-8", $suffix=true)
    {
    if(function_exists("mb_substr"))
    return mb_substr($str, $start, $length, $charset);
    $re['utf-8'] = "/[\x01-\x7f]|[\xc2-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xff][\x80-\xbf]{3}/";
    $re['gb2312'] = "/[\x01-\x7f]|[\xb0-\xf7][\xa0-\xfe]/";
    $re['gbk'] = "/[\x01-\x7f]|[\x81-\xfe][\x40-\xfe]/";
    $re['big5'] = "/[\x01-\x7f]|[\x81-\xfe]([\x40-\x7e]|\xa1-\xfe])/";
    preg_match_all($re[$charset], $str, $match);
    $slice = join("",array_slice($match[0], $start, $length));
    if($suffix) return $slice."…";
    return $slice;
    }

  • 相关阅读:
    或许因为缺少默认route配置而导致的的ping超慢,甚至timeout
    zabbix没有frontends目录
    jenkins自动部署到tomcat报错:ERROR: Publisher hudson.plugins.deploy.DeployPublisher aborted due to exception
    tomcat访问manager报404;server.xml中配置了Context path
    配置使用;yum安装slatstack的master,minion<at>centos6_x86_64
    jenkins报错;自定义工作目录;
    深入剖析Java中的装箱和拆箱
    探秘Java中的String、StringBuilder以及StringBuffer
    Java异常处理和设计
    JVM的内存区域划分
  • 原文地址:https://www.cnblogs.com/kuyuecs/p/1688954.html
Copyright © 2020-2023  润新知