简单易用的字符串模糊匹配库Fuzzywuzzy

简单易用的字符串模糊匹配库Fuzzywuzzy
简单易用的字符串模糊匹配库Fuzzywuzzy
阅读目录

FuzzyWuzzy 简介

安装

用法

已知移植

FuzzyWuzzy 简介

FuzzyWuzzy 是一个简单易用的模糊字符串匹配工具包。它依据 Levenshtein Distance 算法计算两个序列之间的差异。

Levenshtein Distance 算法，又叫 Edit Distance 算法，是指两个字符串之间，由一个转成另一个所需的最少编辑操作次数。许可的编辑操作包括将一个字符替换成另一个字符，插入一个字符，删除一个字符。一般来说，编辑距离越小，两个串的相似度越大。

项目地址：https://github.com/seatgeek/fuzzywuzzy

环境依赖
- Python 2.7 以上
- difflib
- python-Levenshtein（可选, 在字符串匹配时可提供4-10x 的加速, 但在某些特定情况下可能会导致不同的结果）
支持的测试工具
- pycodestyle
- hypothesis
- pytest
安装

使用 PIP 通过 PyPI 安装
```
    pip install fuzzywuzzy
```
or the following to install python-Levenshtein too
```
    pip install fuzzywuzzy[speedup]
```
使用 PIP 通过 Github 安装
```
    pip install git+git://github.com/seatgeek/fuzzywuzzy.git@0.17.0#egg=fuzzywuzzy
```
或者添加你的 requirements.txt 文件 (然后运行 pip install -r requirements.txt)
```
    git+ssh://git@github.com/seatgeek/fuzzywuzzy.git@0.17.0#egg=fuzzywuzzy
```
使用 GIT 手工安装
```
    git clone git://github.com/seatgeek/fuzzywuzzy.git fuzzywuzzy
    cd fuzzywuzzy
    python setup.py install
```
用法

全匹配

fuzz.ratio()对位置敏感：
from fuzzywuzzy import fuzz from fuzzywuzzy import process print(fuzz.ratio("this is a test", "this is a test!"))
运行结果：
C:Pychamanacondalibsite-packagesfuzzywuzzyfuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning') 97
1.报错显示我们需要安装python-Levenshtein库
2.当我安装python-Levenshtein时又报错：error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools"

3.提示让我安装Microsoft Visual C++ Build Tools，第一种方法安装Microsoft Visual C++ Build Tools，我总不能为了安装一个库去安装一个编译器吧，第二种方法去https://www.lfd.uci.edu/~gohlke/pythonlibs/这个网站下找到对应版本的python-Levenshtein并下载。cp对应python版本号，amd后面对应计算机位数。

4.安装

非完全匹配（Partial Ratio）

fuzz.partial_ratio()对位置敏感：
from fuzzywuzzy import fuzz from fuzzywuzzy import process print(fuzz.partial_ratio("this is a test", "this is a test!"))
运行结果：
100
忽略顺序匹配（Token Sort Ratio）
from fuzzywuzzy import fuzz from fuzzywuzzy import process print(fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")) print(fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear"))
运行结果：
91 100
fuzz._process_and_sort(s, force_ascii, full_process=True)
对字符串s排序。force_ascii:True 或者False。为True表示转换为ascii码。如果full_process为True，则会将字符串s转换为小写，去掉除字母和数字之外的字符（发现不能去掉-字符），剩下的字符串以空格分开，然后排序。如果为False，则直接对字符串s排序。
fuzz._token_sort(s1, s2, partial=True, force_ascii=True, full_process=True)
给出字符串 s1, s2的相似度。首先经过 fuzz._process_and_sort（）函数处理。partial为True时，再经过fuzz.partial_ratio（）函数。partial为False时，再经过fuzz.ratio（）函数。

so:
fuzz._token_sort(s1, s2, partial=True, force_ascii=True, full_process=True)
partial为True时：
fuzz.partial_token_sort_ratio(s1, s2, force_ascii=True, full_process=True)
partial为False时：
fuzz.token_sort_ratio(s1, s2, force_ascii=True, full_process=True)
去重子集匹配（Token Set Ratio）
from fuzzywuzzy import fuzz from fuzzywuzzy import process print(fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")) print(fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear"))
运行结果：
84 100
so:
fuzz._token_set(s1, s2, partial=True, force_ascii=True, full_process=True)
partial为False时，就是 fuzz.token_set_ratio（）函数。
fuzz.token_set_ratio(s1, s2, force_ascii=True, full_process=True)
当partial为True时，就是 fuzz.partial_token_set_ratio（）函数。
fuzz.partial_token_set_ratio(s1, s2, force_ascii=True, full_process=True)
Process

用来返回模糊匹配的字符串和相似度。

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"] >>> process.extract("new york jets", choices, limit=2) [('New York Jets', 100), ('New York Giants', 78)] >>> process.extractOne("cowboys", choices) ("Dallas Cowboys", 90)

你可以传入附加参数到 extractOne 方法来设置使用特定的匹配模式。一个典型的用法是来匹配文件路径:
已知移植

FuzzyWuzzy 已经被移植到其他语言环境，我们已知的有：

Java: xpresso's fuzzywuzzy implementation

Java: fuzzywuzzy (java port)

Rust: fuzzyrusty (Rust port)

JavaScript: fuzzball.js (JavaScript port)

C++: Tmplt/fuzzywuzzy

C#: fuzzysharp (.Net port)

Go: go-fuzzywuzz (Go port)
Refer

https://www.jianshu.com/p/ed22a82b45d1

https://blog.csdn.net/sunyao_123/article/details/76942809
相关阅读:
getter 和 setter方法
 了解coredata 数据库的博客
 iOS 本地缓存简述
 iOS 9.0 xcode7
iOS 直播推流SDK -- PLCameraStreamingKit
时间充裕的时候看看技术总结
 技术分享7
学习笔记-音频编解码
 学习笔记-weak strong ARC mrc
飘雪效果的swf
原文地址：https://www.cnblogs.com/-wenli/p/11079173.html

简单易用的字符串模糊匹配库Fuzzywuzzy

阅读目录

FuzzyWuzzy 简介

环境依赖

安装

使用 PIP 通过 PyPI 安装

使用 PIP 通过 Github 安装

使用 GIT 手工安装

用法

全匹配

非完全匹配（Partial Ratio）

忽略顺序匹配（Token Sort Ratio）

去重子集匹配（Token Set Ratio）

Process

已知移植