• PHP 多字节字符串 函数


    参考资料

    多字节字符编码方案和他们相关的问题相当复杂,超越了本文档的范围。 关于这些话题的更多信息请参考以下 URL 和其他资源。

    Table of Contents



    mb_check_encoding> <PHP字符编码的要求
    [edit] Last updated: Fri, 12 Jul 2013
     
    reject note add a note add a note User Contributed Notes 多字节字符串 函数 - [29 notes]
    up
    2
    marc at ermshaus dot org
    4 years ago
    A small correction to patrick at hexane dot org's mb_str_replace function. The original function does not work as intended in case $replacement contains $needle.

    <?php
    function mb_str_replace($needle, $replacement, $haystack)
    {
       
    $needle_len = mb_strlen($needle);
       
    $replacement_len = mb_strlen($replacement);
       
    $pos = mb_strpos($haystack, $needle);
        while (
    $pos !== false)
        {
           
    $haystack = mb_substr($haystack, 0, $pos) . $replacement
                   
    . mb_substr($haystack, $pos + $needle_len);
           
    $pos = mb_strpos($haystack, $needle, $pos + $replacement_len);
        }
        return
    $haystack;
    }
    ?>
    up
    1
    efesar
    2 years ago
    This small mb_trim function works for me.

    <?php
    function mb_trim( $string )
    {
       
    $string = preg_replace( "/(^s+)|(s+$)/us", "", $string );
       
        return
    $string;
    }
    ?>
    up
    1
    johannesponader at dontspamme dot googlemail dot co
    2 years ago
    Please note that when migrating code to handle UTF-8 encoding, not only the functions mentioned here are useful, but also the function htmlentities() has to be changed to htmlentities($var, ENT_COMPAT, "UTF-8") or similar. I didn't scan the manual for it, but there could be some more functions that need adjustments like this.
    up
    1
    chris at maedata dot com
    6 years ago
    The opposite of what Eugene Murai wrote in a previous comment is true when importing/uploading a file. For instance, if you export an Excel spreadsheet using the Save As Unicode Text option, you can use the following to convert it to UTF-8 after uploading:

    //Convert file to UTF-8 in case Windows mucked it up
    $file = explode( " ", mb_convert_encoding( trim( file_get_contents( $_FILES['file']['tmp_name'] ) ), 'UTF-8', 'UTF-16' ) );
    up
    1
    mdoocy at u dot washington dot edu
    6 years ago
    Note that some of the multi-byte functions run in O(n) time, rather than constant time as is the case for their single-byte equivalents. This includes any functionality requiring access at a specific index, since random access is not possible in a string whose number of bytes will not necessarily match the number of characters. Affected functions include: mb_substr(), mb_strstr(), mb_strcut(), mb_strpos(), etc.
    up
    1
    deceze at gmail dot com
    10 months ago
    Please note that all the discussion about mb_str_replace in the comments is pretty pointless. str_replace works just fine with multibyte strings:

    <?php

    $string 
    = '漢字はユニコード';
    $needle  = 'は';
    $replace = 'Foo';

    echo
    str_replace($needle, $replace, $string);
    // outputs: 漢字Fooユニコード

    ?>

    The usual problem is that the string is evaluated as binary string, meaning PHP is not aware of encodings at all. Problems arise if you are getting a value "from outside" somewhere (database, POST request) and the encoding of the needle and the haystack is not the same. That typically means the source code is not saved in the same encoding as you are receiving "from outside". Therefore the binary representations don't match and nothing happens.
    up
    -1
    phpnet at rcpt dot at
    2 years ago
    <?php
    /**
    * Multibyte safe version of trim()
    * Always strips whitespace characters (those equal to s)
    *
    * @author Peter Johnson
    * @email phpnet@rcpt.at
    * @param $string The string to trim
    * @param $chars Optional list of chars to remove from the string ( as per trim() )
    * @param $chars_array Optional array of preg_quote'd chars to be removed
    * @return string
    */
    public static function mb_trim( $string, $chars = "", $chars_array = array() )
    {
        for(
    $x=0; $x<iconv_strlen( $chars ); $x++ ) $chars_array[] = preg_quote( iconv_substr( $chars, $x, 1 ) );
       
    $encoded_char_list = implode( "|", array_merge( array( "s"," "," "," ", "", "x0B" ), $chars_array ) );

       
    $string = mb_ereg_replace( "^($encoded_char_list)*", "", $string );
       
    $string = mb_ereg_replace( "($encoded_char_list)*$", "", $string );
        return
    $string;
    }
    ?>
    up
    -1
    mt at mediamedics dot nl
    3 years ago
    A multibyte one-to-one alternative for the str_split function (http://php.net/manual/en/function.str-split.php):

    <?php
       
    function mb_str_split($string, $split_length = 1){
               
           
    mb_internal_encoding('UTF-8');
           
    mb_regex_encoding('UTF-8'); 
           
           
    $split_length = ($split_length <= 0) ? 1 : $split_length;
           
           
    $mb_strlen = mb_strlen($string, 'utf-8');
           
           
    $array = array();
                   
            for(
    $i = 0; $i < $mb_strlen; $i + $split_length){
           
               
    $array[] = mb_substr($string, $i, $split_length);
            }

            return
    $array;
       
        }
    ?>
    up
    0
    rawsrc at gmail dot com
    1 year ago
    Hi,

    For those who are looking for mb_str_replace, here's a simple function :
    <?php
    function mb_str_replace($needle, $replacement, $haystack) {
       return
    implode($replacement, mb_split($needle, $haystack));
    }
    ?>
    I haven't found a simpliest way to proceed :-)
    up
    0
    peter AT(no spam) dezzignz dot com
    3 years ago
    The function trim() has not failed me so far in my multibyte applications, but in case one needs a truly multibyte function, here it is. The nice thing is that the character to remove can be whitespace or any other specified character, even a multibyte character.

    <?php

    // multibyte string split

    function mbStringToArray ($str) {
        if (empty(
    $str)) return false;
       
    $len = mb_strlen($str);
       
    $array = array();
        for (
    $i = 0; $i < $len; $i++) {
           
    $array[] = mb_substr($str, $i, 1);
            }
        return
    $array;
        }

    // removes $rem at both ends

    function mb_trim ($str, $rem = ' ') {
        if (empty(
    $str)) return false;
       
    // convert to array
       
    $arr = mbStringToArray($str);
       
    $len = count($arr);
       
    // left side
       
    for ($i = 0; $i < $len; $i++) {
            if (
    $arr[$i] === $rem) $arr[$i] = '';
            else break;
            }
       
    // right side
       
    for ($i = $len-1; $i >= 0; $i--) {
            if (
    $arr[$i] === $rem) $arr[$i] = '';
            else break;
            }
       
    // convert to string
       
    return implode ('', $arr);
        }

    ?>
    up
    0
    roydukkey at roydukkey dot com
    3 years ago
    This would be one way to create a multibyte substr_replace function

    <?php
    function mb_substr_replace($output, $replace, $posOpen, $posClose) {
            return
    mb_substr($output, 0, $posOpen).$replace.mb_substr($output, $posClose+1);
        }
    ?>
    up
    0
    sakai at d4k dot net
    4 years ago
    I hope this mb_str_replace will work for arrays.  Please use mb_internal_encoding() beforehand, if you need to change the encoding.

    Thanks to marc at ermshaus dot org for the original.

    <?php

    if(!function_exists('mb_str_replace')) {

        function
    mb_str_replace($search, $replace, $subject) {

            if(
    is_array($subject)) {
               
    $ret = array();
                foreach(
    $subject as $key => $val) {
                   
    $ret[$key] = mb_str_replace($search, $replace, $val);
                }
                return
    $ret;
            }

            foreach((array)
    $search as $key => $s) {
                if(
    $s == '') {
                    continue;
                }
               
    $r = !is_array($replace) ? $replace : (array_key_exists($key, $replace) ? $replace[$key] : '');
               
    $pos = mb_strpos($subject, $s);
                while(
    $pos !== false) {
                   
    $subject = mb_substr($subject, 0, $pos) . $r . mb_substr($subject, $pos + mb_strlen($s));
                   
    $pos = mb_strpos($subject, $s, $pos + mb_strlen($r));
                }
            }

            return
    $subject;

        }

    }

    ?>
    up
    0
    mitgath at gmail dot com
    4 years ago
    according to:
    http://bugs.php.net/bug.php?id=21317
    here's missing function

    <?php
    function mb_str_pad ($input, $pad_length, $pad_string, $pad_style, $encoding="UTF-8") {
       return
    str_pad($input,
    strlen($input)-mb_strlen($input,$encoding)+$pad_length, $pad_string, $pad_style);
    }
    ?>
    up
    0
    Ben XO
    4 years ago
    PHP5 has no mb_trim(), so here's one I made. It work just as trim(), but with the added bonus of PCRE character classes (including, of course, all the useful Unicode ones such as pZ).

    Unlike other approaches that I've seen to this problem, I wanted to emulate the full functionality of trim() - in particular, the ability to customise the character list.

    <?php
       
    /**
         * Trim characters from either (or both) ends of a string in a way that is
         * multibyte-friendly.
         *
         * Mostly, this behaves exactly like trim() would: for example supplying 'abc' as
         * the charlist will trim all 'a', 'b' and 'c' chars from the string, with, of
         * course, the added bonus that you can put unicode characters in the charlist.
         *
         * We are using a PCRE character-class to do the trimming in a unicode-aware
         * way, so we must escape ^, \, - and ] which have special meanings here.
         * As you would expect, a single in the charlist is interpretted as
         * "trim backslashes" (and duly escaped into a double- ). Under most circumstances
         * you can ignore this detail.
         *
         * As a bonus, however, we also allow PCRE special character-classes (such as 's')
         * because they can be extremely useful when dealing with UCS. 'pZ', for example,
         * matches every 'separator' character defined in Unicode, including non-breaking
         * and zero-width spaces.
         *
         * It doesn't make sense to have two or more of the same character in a character
         * class, therefore we interpret a double in the character list to mean a
         * single in the regex, allowing you to safely mix normal characters with PCRE
         * special classes.
         *
         * *Be careful* when using this bonus feature, as PHP also interprets backslashes
         * as escape characters before they are even seen by the regex. Therefore, to
         * specify '\s' in the regex (which will be converted to the special character
         * class 's' for trimming), you will usually have to put *4* backslashes in the
         * PHP code - as you can see from the default value of $charlist.
         *
         * @param string
         * @param charlist list of characters to remove from the ends of this string.
         * @param boolean trim the left?
         * @param boolean trim the right?
         * @return String
         */
       
    function mb_trim($string, $charlist='\\s', $ltrim=true, $rtrim=true)
        {
           
    $both_ends = $ltrim && $rtrim;

           
    $char_class_inner = preg_replace(
                array(
    '/[^-]\]/S', '/\{4}/S' ),
                array(
    '\\\0', '\' ),
               
    $charlist
           
    );

           
    $work_horse = '[' . $char_class_inner . ']+';
           
    $ltrim && $left_pattern = '^' . $work_horse;
           
    $rtrim && $right_pattern = $work_horse . '$';

            if(
    $both_ends)
            {
               
    $pattern_middle = $left_pattern . '|' . $right_pattern;
            }
            elseif(
    $ltrim)
            {
               
    $pattern_middle = $left_pattern;
            }
            else
            {
               
    $pattern_middle = $right_pattern;
            }

            return
    preg_replace("/$pattern_middle/usSD", '', $string) );
        }
    ?>
    up
    0
    patrick at hexane dot org
    5 years ago
    I wonder why there isn't a mb_str_replace().  Here's one for now:

    function mb_str_replace( $needle, $replacement, $haystack ) {
      $needle_len = mb_strlen($needle);
      $pos = mb_strpos( $haystack, $needle);
      while (!($pos ===false)) {
        $front = mb_substr( $haystack, 0, $pos );
        $back  = mb_substr( $haystack, $pos + $needle_len);
        $haystack = $front.$replacement.$back;
        $pos = mb_strpos( $haystack, $needle);
      }
      return $haystack;
    }
    up
    0
    motin at demomusic dot nu
    6 years ago
    As peter dot albertsson at spray dot se already pointed out, overloading strlen may break code that handles binary data and relies upon strlen for bytelengths.

    The problem occurs when a file is filled with a string using fwrite in the following manner:

    $len = strlen($data);
    fwrite($fp, $data, $len);

    fwrite takes amount of bytes as the third parameter, but mb_strlen returns the amount of characters in the string. Since multibyte characters are possibly more than one byte in length each - this will result in that the last characters of $data never gets written to the file.

    After hours of investigating why PEAR::Cache_Lite didn't work - the above is what I found.

    I made an attempt at using single byte functions, but it doesn't work. Posting here anyway in case it helps someone else:

    /**
    * PHP Singe byte functions simulation (non successful)
    *
    * Usage: sb_string(functionname, arg1, arg2, etc);
    * Example: sb_string("strlen", "tuöéä"); returns 8 (should...)
    */
    function sb_string() {

      $arguments = func_get_args();

      $func_overloading = ini_get("mbstring.func_overload");

      ini_set("mbstring.func_overload", 0);

      $ret = call_user_func_array(array_shift($arguments), $arguments);

      ini_set("mbstring.func_overload", $func_overloading);

      return $ret;

    }
    up
    0
    pdezwart .at. snocap
    6 years ago
    If you are trying to emulate the UnicodeEncoding.Unicode.GetBytes() function in .NET, the encoding you want to use is: UCS-2LE
    up
    0
    hayk at mail dot ru
    6 years ago
    Since PHP 5.1.0 and PHP 4.4.2 there is an Armenian ArmSCII-8 (ArmSCII-8, ArmSCII8, ARMSCII-8, ARMSCII8) encoding avaliable.
    up
    0
    daniel at softel dot jp
    6 years ago
    Note that although "multi-byte" hints at total internationalization, the mb_ API was designed by a Japanese person to support the Japanese language.

    Some of the functions, for example mb_convert_kana(), make absolutely no sense outside of a Japanese language environment.

    It should perhaps be considered "lucky" if the functions work with non-Japanese multi-byte languages.

    I don't mean any disrespect to the mb_ API because I'm using it everyday and I appreciate its usefulness, but maybe a better name would be the jp_ API.
    up
    0
    Aardvark
    7 years ago
    Since not all hosted servces currently support the multi-byte function set, it may still be necessary to process Unicode strings using standard single byte functions.  The function at the following link - http://www.kanolife.com/escape/2006/03/php-unicode-processing.html - shows by example how to do this.  While this only covers UTF-8, the standard PHP function "iconv" allows conversion into and out of UTF-8 if strings need to be input or output in other encodings.
    up
    0
    peter kehl
    7 years ago
    UTF-16LE solution for CSV for Excel by Eugene Murai works well:
    $unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');

    However, then Excel on Mac OS X doesn't identify columns properly and its puts each whole row in its own cell. In order to fix that, use TAB "\t" character as CSV delimiter rather than comma or colon.

    You may also want to use HTTP encoding header, such as
    header( "Content-type: application/vnd.ms-excel; charset=UTF-16LE" );
    up
    0
    Anonymous
    7 years ago
    get the string octet-size, when mbstring.func_overload is set to 2 :

    <?php
    function str_sizeof($string) {
        return
    count(preg_split("`.`", $string)) - 1 ;
    }
    ?>

    answering to peter albertsson, once you got your data octet-size, you can access each octet with something
    $string[0] ... $string[$size-1], since the [ operator doesn't complies with multibytes strings.
    up
    0
    peter dot albertsson at spray dot se
    8 years ago
    Setting mbstring.func_overload = 2 may break your applications that deal with binary data.

    After having set mbstring.func_overload = 2 and  mbstring.internal_encoding = UTF-8 I can't even read a binary file and print/echo it to output without corrupting it.
    up
    0
    nzkiwi at NOSPAMmte dot biglobe dot ne dot jp
    8 years ago
    A friend has pointed out that the entry
    "mbstring.http_input PHP_INI_ALL" in Table 1 on the mbstring page appears to be wrong: above Example 4 it says that "There is no way to control HTTP input character conversion from PHP script. To disable HTTP input character conversion, it has to be done in php.ini".
    Also the table shows the old-PHP-version defaults:
    ;; Disable HTTP Input conversion
    mbstring.http_input = pass  *BUT* (for PHP 4.3.0 or higher)
    ;; Disable HTTP Input conversion
    mbstring.encoding_translation = Off
    up
    0
    Eugene Murai
    8 years ago
    PHP can input and output Unicode, but a little different from what Microsoft means: when Microsoft says "Unicode", it unexplicitly means little-endian UTF-16 with BOM(FF FE = chr(255).chr(254)), whereas PHP's "UTF-16" means big-endian with BOM. For this reason, PHP does not seem to be able to output Unicode CSV file for Microsoft Excel. Solving this problem is quite simple: just put BOM infront of UTF-16LE string.

    Example:

    $unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');
    up
    -1
    Lee Byron
    1 year ago
    Looks like mb_str_replace is the most requested missing function from the multibyte string library.

    I wanted a version of mb_str_replace with as similar a code signature and behavior to str_replace as possible while conforming to the code signature patterns of the mb library and avoiding performance pitfalls like unnecessary concatenations and regular expressions.

    <?php
    /**
     * Multibyte safe version of str_replace.
     * See http://php.net/manual/en/function.str-replace.php
     */
    function mb_str_replace(
     
    $search,
     
    $replace,
     
    $subject,
     
    string $encoding = null,
     
    int &$count = null) {

      if (
    is_array($subject)) {
       
    $result = array();
        foreach (
    $subject as $item) {
         
    $result[] = mb_str_replace($search, $replace, $item, $encoding, $count);
        }
        return
    $result;
      }

      if (!
    is_array($search)) {
        return
    _mb_str_replace($search, $replace, $subject, $encoding, $count);
      }

     
    $replace_is_array = is_array($replace);
      foreach (
    $search as $key => $value) {
       
    $subject = _mb_str_replace(
         
    $value,
         
    $replace_is_array ? $replace[$key] : $replace,
         
    $subject,
         
    $encoding,
         
    $count
       
    );
      }
      return
    $subject;
    }

    /**
     * Implementation of mb_str_replace. Do not call directly. Enforces string parameters.
     */
    function _mb_str_replace(
     
    string $search,
     
    string $replace,
     
    string $subject,
     
    string $encoding = null,
     
    int &$count = null) {

     
    $search_length = mb_strlen($search, $encoding);
     
    $subject_length = mb_strlen($subject, $encoding);
     
    $offset = 0;
     
    $result = '';

      while (
    $offset < $subject_length) {
       
    $match = mb_strpos($subject, $search, $offset, $encoding);
        if (
    $match === false) {
          if (
    $offset === 0) {
           
    // No match was ever found, just return the subject.
           
    return $subject;
          }
         
    // Append the final portion of the subject to the replaced.
         
    $result .=
           
    mb_substr($subject, $offset, $subject_length - $offset, $encoding);
          break;
        }
        if (
    $count !== null) {
         
    $count++;
        }
       
    $result .= mb_substr($subject, $offset, $match - $offset, $encoding);
       
    $result .= $replace;
       
    $offset = $match + $search_length;
      }

      return
    $result;
    }
    ?>
    up
    -1
    Smelly
    6 years ago
    Below is some code to output a UTF-8 encoded CSV in a way understandable by Excel. It requires iconv instead of mbstring.

    header("Content-type: application/octet-stream");
    header("Content-Transfer-Encoding: binary");
    header("Content-Disposition: attachment; filename=report.xls");
       
    // assume $tmpString contains UTF-8 encoded CSV:
    $tmpString =  iconv ( 'UTF-8', 'UTF-16LE//IGNORE', $tmpString );

    print chr(255).chr(254).$tmpString;
    up
    -1
    motin at demomusic dot nu
    6 years ago
    Follow up on last note from 2007-jan-20: http://se2.php.net/manual/en/function.mb-strlen.php#72979

    There is the correct way of simulating singlebyte strlen as well as some pitfalls to watch out for when developing in a mb-func_overload:ed environment.
    up
    -1
    Geoffrey
    8 years ago
    For Windows users php_mbstring can be added as follows:-

    if you have dowloaded  the "short" version of PHP,
    (php-4.3.10-installer.exe), download the full version .
    (php-4.3.10-Win32.zip)

    unzip it, find php_mbstring.dll in
    f:php-4.3.10-Win32extensions, and copy it across to your
    phpextensions directory

    use Notepad to open your PHP.INI

    change the extension_dir line to read
    extension_dir = "e:phpextensions"  (or whatever your
    directory is called)

    remove the semi-colon on line
     ; extension=php_mbstring.dll

    save PHP.INI,  restart PHP
  • 相关阅读:
    CODEFORCES-CONTEST653-D. Delivery Bears
    CodeForces 1244C-exgcd?
    洛谷P3948
    L2-010 排座位 (25 分) (最短路)
    L2-008 最长对称子串 (25 分) (模拟)
    L2-007 家庭房产 (25 分) (并查集)
    L2-005 集合相似度 (25 分) (STL——set)
    L2-002 链表去重 (25 分) (模拟)
    L2-001 紧急救援 (25 分) (最短路+路径打印)
    hiho 1098 最小生成树二·Kruscal算法 (最小生成树)
  • 原文地址:https://www.cnblogs.com/rockchip/p/3202719.html
Copyright © 2020-2023  润新知