lua 截取字符，以及取字符个数（非字符串长度）

需求

按字面个数来截取

函数(字符串, 开始位置, 截取长度)

utf8sub("你好1世界哈哈",2,5)	=	好1世界哈
utf8sub("1你好1世界哈哈",2,5)	=	你好1世界
utf8sub("你好世界1哈哈",1,5)	=	你好世界1
utf8sub("12345678",3,5)		=	34567
utf8sub("øpø你好pix",2,5)	=	pø你好p

错误方法

网上找了一些算法, 都不太正确; 要么就是乱码, 要么就是只考虑了4 byte 中文的情况, 不够全面

string.sub(s,1,截取长度*4)

网上很多直接使用”""string.sub(s,1,截取长度*4)“是肯定不对的, 因为如果中英文混合的字符串, 例如你好1世界的字符长度分别是4,4,1,4,4, 如果截取4个字, 4*4=4+4+1+4+3, 那世界的界字将会被取前3个byte, 就会出现乱码
if byte>128 then index = index + 4

问题关键

utf8字符是变长字符
字符长度有规律

UTF-8字符规律

字符串的首个byte表示了该utf8字符的长度

0xxxxxxx - 1 byte
110yxxxx - 192, 2 byte
1110yyyy - 225, 3 byte
11110zzz - 240, 4 byte

各种正确算法

-- 判断utf8字符byte长度
-- 0xxxxxxx - 1 byte
-- 110yxxxx - 192, 2 byte
-- 1110yyyy - 225, 3 byte
-- 11110zzz - 240, 4 byte
local function chsize(char)
    if not char then
        print("not char")
        return 0
    elseif char > 240 then
        return 4
    elseif char > 225 then
        return 3
    elseif char > 192 then
        return 2
    else
        return 1
    end
end

-- 计算utf8字符串字符数, 各种字符都按一个字符计算
-- 例如utf8len("1你好") => 3
function utf8len(str)
    local len = 0
    local currentIndex = 1
    while currentIndex <= #str do
        local char = string.byte(str, currentIndex)
        currentIndex = currentIndex + chsize(char)
        len = len +1
    end
    return len
end

-- 截取utf8 字符串
-- str:            要截取的字符串
-- startChar:    开始字符下标,从1开始
-- numChars:    要截取的字符长度
function utf8sub(str, startChar, numChars)
    local startIndex = 1
    while startChar > 1 do
        local char = string.byte(str, startIndex)
        startIndex = startIndex + chsize(char)
        startChar = startChar - 1
    end

    local currentIndex = startIndex

    while numChars > 0 and currentIndex <= #str do
        local char = string.byte(str, currentIndex)
        currentIndex = currentIndex + chsize(char)
        numChars = numChars -1
    end
    return str:sub(startIndex, currentIndex - 1)
end

-- 自测
function test()
    -- test utf8len
    assert(utf8len("你好1世界哈哈") == 7)
    assert(utf8len("你好世界1哈哈 ") == 8)
    assert(utf8len(" 你好世 界1哈哈") == 9)
    assert(utf8len("12345678") == 8)
    assert(utf8len("øpø你好pix") == 8)

    -- test utf8sub
    assert(utf8sub("你好1世界哈哈",2,5) == "好1世界哈")
    assert(utf8sub("1你好1世界哈哈",2,5) == "你好1世界")
    assert(utf8sub(" 你好1世界 哈哈",2,6) == "你好1世界 ")
    assert(utf8sub("你好世界1哈哈",1,5) == "你好世界1")
    assert(utf8sub("12345678",3,5) == "34567")
    assert(utf8sub("øpø你好pix",2,5) == "pø你好p")

    print("all test succ")
end

test()

https://my.oschina.net/u/930967/blog/758653

相关阅读:
深入正则表达式(0):正则表达式概述
 讲透学烂二叉树(二)：图中树的定义&各类型树的特征分析
 讲透学烂二叉树(一)：图的概念和定义—各种属性特征浅析
 Gzip之后继者Brotli浅析之CDN厂商的智能压缩，服务器Brotli设置
 ECMAScript进化史(1):话说Web脚本语言王者JavaScript的加冕历史
 nginx网站限速限流配置——网站被频繁攻击，nginx上的设置limit_req和limit_conn
linux添加用户，修改用户密码，修改用户权限，设置root用户操作
 nginx 限制ip访问，禁止非法域名指向本机ip——防止被别人绑定域名到自己IP的方法
 centos8 新增ssh自定义端口与屏蔽默认22端口。
1g云主机升级centos8不满足centos 8 至少2g内存要求，linux虚拟内存来凑
原文地址：https://www.cnblogs.com/7qin/p/13511309.html