• R语言字符串相似度 stringdist包


    计算字符串相似度可以使用utils包中的adist函数,或者MKmisc包中的stringdist函数,或者RecordLinkage包中也有如jarowinkler之类的距离函数。本文介绍stringdist包中的stringdist函数和stringdistmatrix函数。
    stringdist包作者是 Mark der Loo
    stringdist用于计算对象a,b中的字符串两两之间的相似度,对于一个对象中的元素少于另一个的情况,采用循环补齐机制。stringdistmatrix的出相似度矩阵,其中采用a中的行,b中的列。

    stringdist(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, nthread = getOption("sd_num_thread"))

    stringdistmatrix(a, b, method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"), useBytes = FALSE, weight = c(d = 1, i = 1, s = 1, t = 1), maxDist = Inf, q = 1, p = 0, useNames = c("none", "strings", "names"), ncores = 1, cluster = NULL, nthread = getOption("sd_num_thread"))
    1
    2
    3
    参数:
    a,b: 字符串类型的目标对象
    method:距离计算方法,默认为“osa”,可以设置为jaccard,hamming,jarowinkler等方法。
    useBytes:以字节为单位进行比较
    weight:权值必须为正并且不超过1
    maxDist:最大距离限制
    q:在使用method=’qgram’, ‘jaccard’ 或 ‘cosine’的时候设置,必须为非负数
    p:jarowinkler距离的惩罚因子,默认为0,在0-0.25之间取值
    nThread:最大线程数
    useNames:输出的行、列名使用输入变量的行、列名
    ncores:核心数
    cluster:自定义集群数

    案例:

    > stringdistmatrix(c("foo","bar","boo"),c("baz","buz"))
    [,1] [,2]
    [1,] 3 3
    [2,] 1 2
    [3,] 2 2

    > # string distance matching is case sensitive:
    > stringdist("ABC","abc")
    [1] 3
    >
    > # so you may want to normalize a bit:
    > stringdist(tolower("ABC"),"abc")
    [1] 0
    >
    > # stringdist recycles the shortest argument:
    > stringdist(c('a','b','c'),c('a','c'))
    Warning message: longer object length is not a multiple of shorter object length
    [1] 0 1 1
    >
    > # different edit operations may be weighted; e.g. weighted substitution:
    > stringdist('ab','ba',weight=c(1,1,1,0.5))
    [1] 0.5
    >
    > # Non-unit weights for insertion and deletion makes the distance metric asymetric
    > stringdist('ca','abc')
    [1] 3
    > stringdist('abc','ca')
    [1] 3
    > stringdist('ca','abc',weight=c(0.5,1,1,1))
    [1] 2
    > stringdist('abc','ca',weight=c(0.5,1,1,1))
    [1] 2.5

    > # q-grams are based on the difference between occurrences of q consecutive characters
    > # in string a and string b.
    > # Since each character abc occurs in 'abc' and 'cba', the q=1 distance equals 0:
    > stringdist('abc','cba',method='qgram',q=1)
    [1] 0
    >
    > # since the first string consists of 'ab','bc' and the second
    > # of 'cb' and 'ba', the q=2 distance equals 4 (they have no q=2 grams in common):
    > stringdist('abc','cba',method='qgram',q=2)
    [1] 4

    > stringdist('MARTHA','MATHRA',method='jw')
    [1] 0.08333333
    > # Note that stringdist gives a _distance_ where wikipedia gives the corresponding
    > # _similarity measure_. To get the wikipedia result:
    > 1 - stringdist('MARTHA','MATHRA',method='jw')
    [1] 0.9166667
    >
    > # The corresponding Jaro-Winkler distance can be computed by setting p=0.1
    > stringdist('MARTHA','MATHRA',method='jw',p=0.1)
    [1] 0.06666667
    > # or, as a similarity measure
    > 1 - stringdist('MARTHA','MATHRA',method='jw',p=0.1)
    [1] 0.9333333
    >
    > # This gives distance 1 since Euler and Gauss translate to different soundex codes.
    > stringdist('Euler','Gauss',method='soundex')
    [1] 1
    > # Euler and Ellery translate to the same code and have distance 0
    > stringdist('Euler','Ellery',method='soundex')
    [1] 0
    >
    ————————————————

    函数 Levenshtein编辑距离.可以将其转换为相似度指标,例如1-(Levenshtein编辑距离/更长的字符串长度).

    RecordLinkage 包中的levenshteinSim函数也可以直接执行此操作,并且可能比adist快. 

    library(RecordLinkage)
    > levenshteinSim("apple", "apple")
    [1] 1
    > levenshteinSim("apple", "aaple")
    [1] 0.8
    > levenshteinSim("apple", "appled")
    [1] 0.8333333
    > levenshteinSim("appl", "apple")
    [1] 0.8
    

    ETA:有趣的是,虽然RecordLinkage软件包中的levenshteinDist似乎比adist略快,但levenshteinSim却比任何一个都慢.使用 rbenchmark 包:

    > benchmark(levenshteinDist("applesauce", "aaplesauce"), replications=100000)
                                             test replications elapsed relative
    1 levenshteinDist("applesauce", "aaplesauce")       100000   4.012        1
      user.self sys.self user.child sys.child
    1     3.583    0.452          0         0
    > benchmark(adist("applesauce", "aaplesauce"), replications=100000)
                                   test replications elapsed relative user.self
    1 adist("applesauce", "aaplesauce")       100000   4.277        1     3.707
      sys.self user.child sys.child
    1    0.461          0         0
    > benchmark(levenshteinSim("applesauce", "aaplesauce"), replications=100000)
                                            test replications elapsed relative
    1 levenshteinSim("applesauce", "aaplesauce")       100000   7.206        1
      user.self sys.self user.child sys.child
    1      6.49    0.743          0         0
    

    此开销仅是由于levenshteinSim的代码造成的,它只是levenshteinDist的包装:

    > levenshteinSim
    function (str1, str2) 
    {
        return(1 - (levenshteinDist(str1, str2)/pmax(nchar(str1), 
            nchar(str2))))
    }
    

    仅供参考:如果您始终比较两个字符串而不是向量,则可以创建一个使用max而不是pmax的新版本,并将运行时间节省约25%:

    mylevsim = function (str1, str2) 
    {
        return(1 - (levenshteinDist(str1, str2)/max(nchar(str1), 
            nchar(str2))))
    }
    > benchmark(mylevsim("applesauce", "aaplesauce"), replications=100000)
                                      test replications elapsed relative user.self
    1 mylevsim("applesauce", "aaplesauce")       100000   5.608        1     4.987
      sys.self user.child sys.child
    1    0.627          0         0
    

    长话短说,adistlevenshteinDist在性能上几乎没有区别,尽管如果您不想添加软件包依赖项,则前者是更可取的.如何将其转换为相似性指标确实会对性能产生一些影响.

  • 相关阅读:

    ATM三层架构思路
    一个项目的从无到有
    re模块
    logging模块
    物联网公共安全平台软件体系架构
    本科生怎样发表自己的论文
    Cloud Native 云化架构阅读笔记
    实验5 Spark SQL编程初级实践
    云计算环境下计算机软件系统架构分析
  • 原文地址:https://www.cnblogs.com/purple5252/p/15824715.html
Copyright © 2020-2023  润新知