AC自动机是多模式匹配的一个经典数据结构,原理是和KMP一样的构造fail指针,不过AC自动机是在Trie树上构造的,但原理是一样的。官方github:
https://github.com/WojciechMula/pyahocorasick/
类似的匹配工具,还有:
亲测,好像pyahocorasick更快~
1 安装
This module is written in C. You need a C compiler installed to compile native CPython extensions. To install:
pip install pyahocorasick
Then create an Automaton.
当然笔者window机器会报错:
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools
如果用conda
会报没有在列表中的错误,所以可以(参考:https://github.com/conda-forge/pyahocorasick-feedstock/):
conda config --add channels conda-forge
conda install pyahocorasick
所有类似的,当时尝试成功的库有ahocorapy
:
pip install ahocorapy
2 pyahocorasick 使用
官方案例:
>>> import ahocorasick
>>> A = ahocorasick.Automaton()
# 载入内容
>>> for idx, key in enumerate('he her hers she'.split()):
... A.add_word(key, (idx, key))
# 判定关键词是否在A之中
>>> 'he' in A
True
>>> 'HER' in A
False
# 类似dict的方式去查找:
>>> A.get('he')
(0, 'he')
>>> A.get('she')
(3, 'she')
>>> A.get('cat', 'not exists')
'not exists'
>>> A.get('dog')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError
转化为Aho-Corasick automaton:
A.make_automaton()
相关键值删减:
# 删除
A.pop('he')
>>> (0, 'he')
内容的保存与加载:
>>> import cPickle
>>> pickled = cPickle.dumps(A)
>>> B = cPickle.loads(pickled)
>>>A.get('cat', 'not exists') # 如果不存在,则不会直接拨错,会报:not exists
>>> B.get('he')
(0, 'he')
句子查询:
可以通过.iter
来定位到关键词所在的位置在哪,
list(A.iter('he is girl', ignore_white_space=False))
>>> [(1, (0, 'he'))]
3 pyahocorasick 案例
部分来源于:huanyong的QASystemOnMedicalKG项目
sentences = ['降压药的选择不知道波依定和洛活喜两种药各有什么特点,适合哪种人服 ',
'很痒怎么回事?是什么原因呢?长了一些像疹子一样的东西',
'请问吃紧急避孕药月经推迟多久才是正常的?我因为吃了那药,月经已经? ',
'怎样快点长高我160想长高5厘米怎么可以长高 ',
'用了祛痘洗面奶反而脸上长痘痘了怎么回事啊? ']
region_wds = ['降压药','避孕药','洗面奶','月经']
def build_actree( wordlist):
actree = ahocorasick.Automaton() # 初始化trie树
for index, word in enumerate(wordlist):
actree.add_word(word, (index, word)) # 向trie树中添加单词
actree.make_automaton() # 将trie树转化为Aho-Corasick自动机
return actree
actree = build_actree( keyword)
然后就是进行查找:
def ac_detect(text):
region_wds = []
for w1 in actree.iter(text):
if len(w1) > 0:
region_wds.append(w1[1][1])
return region_wds
text = sentences[0]
for text in sentences:
print('\n句子为:',text)
print('匹配到关键词:',ac_detect(text))
最后的结果为:
句子为: 降压药的选择不知道波依定和洛活喜两种药各有什么特点,适合哪种人服
匹配到关键词: ['降压药']
句子为: 下体很痒怎么回事?是什么原因呢?长了一些像疹子一样的东西,正好我
匹配到关键词: []
句子为: 请问吃紧急避孕药月经推迟多久才是正常的?我因为吃了那药,月经已经?
匹配到关键词: ['避孕药', '月经', '月经']
句子为: 怎样快点长高我160想长高5厘米怎么可以长高
匹配到关键词: []
句子为: 用了祛痘洗面奶反而脸上长痘痘了怎么回事啊?
匹配到关键词: ['洗面奶']
4 ahocorapy 的使用
github:https://github.com/abusix/ahocorapy
Creation of the Search Tree
from ahocorapy.keywordtree import KeywordTree
kwtree = KeywordTree(case_insensitive=True)
kwtree.add('malaga')
kwtree.add('lacrosse')
kwtree.add('mallorca')
kwtree.add('mallorca bella')
kwtree.add('orca')
kwtree.finalize()
Searching
result = kwtree.search('My favorite islands are malaga and sylt.')
print(result)
The search_all method returns a generator for all keywords found, or None if there is none.
results = kwtree.search_all('malheur on mallorca bellacrosse')
for result in results:
print(result)
还可以看逻辑图:
from ahocorapy_visualizer.visualizer import Visualizer
visualizer = Visualizer()
visualizer.draw('readme_example.png', kwtree)
当然可能报错:
ModuleNotFoundError: No module named 'pygraphviz'
这个时候又会回到之前pyahocorasick遇到的问题。
5 关键词匹配与flashtext进行对比
import ahocorasick
def build_actree(wordlist):
'''
AC自动机进行关键词匹配
构造AC trie
'''
actree = ahocorasick.Automaton() # 初始化trie树
for index, word in enumerate(wordlist):
actree.add_word(word, (index, word)) # 向trie树中添加单词
actree.make_automaton() # 将trie树转化为Aho-Corasick自动机
#self.actree = actree
return actree
def ac_detect(actree,text):
'''
AC自动机进行关键词匹配
文本匹配
'''
region_wds = []
for w1 in actree.iter(text):
if len(w1) > 0:
region_wds.append(w1[1][1])
return region_wds
wordlist = ['健康','减肥']
text = '今天你减肥了吗,今天你健康了吗,减肥 = 健康!'
actree = build_actree(wordlist)
%time ac_detect(actree,text)
>>> CPU times: user 10 µs, sys: 3 µs, total: 13 µs
>>> Wall time: 17.4 µs
>>> ['减肥', '健康', '减肥', '健康']
与flashtext进行对比:
from flashtext import KeywordProcessor
def build_actree(wordlist):
'''
AC自动机进行关键词匹配
构造AC trie
'''
actree = KeywordProcessor()
for index, word in enumerate(wordlist):
actree.add_keyword(word) # 向trie树中添加单词
#self.actree = actree
return actree
def ac_detect(actree,text,span_info = True):
'''
AC自动机进行关键词匹配
文本匹配
'''
region_wds = []
for w1 in actree.extract_keywords(text,span_info = span_info):
if len(w1) > 0:
region_wds.append(w1[0])
return region_wds
wordlist = ['健康','减肥']
text = '今天你减肥了吗,今天你健康了吗,减肥 = 健康!'
actree = build_actree(wordlist)
%time ac_detect(actree,text)
>>> CPU times: user 41 µs, sys: 0 ns, total: 41 µs
>>> Wall time: 47.2 µs
>>> ['减肥', '健康', '减肥', '健康']
转载:https://blog.csdn.net/sinat_26917383/article/details/101701508