小言_互联网的博客

python | 关键词快速匹配检索小工具 pyahocorasick / ahocorapy

1019人阅读  评论(0)

AC自动机是多模式匹配的一个经典数据结构,原理是和KMP一样的构造fail指针,不过AC自动机是在Trie树上构造的,但原理是一样的。官方github:
https://github.com/WojciechMula/pyahocorasick/

类似的匹配工具,还有:

python︱flashtext高效关键词查找与替换

亲测,好像pyahocorasick更快~



1 安装

This module is written in C. You need a C compiler installed to compile native CPython extensions. To install:

pip install pyahocorasick

Then create an Automaton.
当然笔者window机器会报错:

 error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

如果用conda会报没有在列表中的错误,所以可以(参考:https://github.com/conda-forge/pyahocorasick-feedstock/):

conda config --add channels conda-forge
conda install pyahocorasick

所有类似的,当时尝试成功的库有ahocorapy

pip install ahocorapy

2 pyahocorasick 使用

官方案例:

>>> import ahocorasick
>>> A = ahocorasick.Automaton()

# 载入内容
>>> for idx, key in enumerate('he her hers she'.split()):
...   A.add_word(key, (idx, key))

# 判定关键词是否在A之中
>>> 'he' in A
True
>>> 'HER' in A
False

# 类似dict的方式去查找:
>>> A.get('he')
(0, 'he')
>>> A.get('she')
(3, 'she')
>>> A.get('cat', 'not exists')
'not exists'
>>> A.get('dog')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError

转化为Aho-Corasick automaton:

A.make_automaton()

相关键值删减:

# 删除
A.pop('he')
>>> (0, 'he')

内容的保存与加载:

>>> import cPickle
>>> pickled = cPickle.dumps(A)
>>> B = cPickle.loads(pickled)
>>>A.get('cat', 'not exists')  # 如果不存在,则不会直接拨错,会报:not exists
>>> B.get('he')
(0, 'he')

句子查询:
可以通过.iter来定位到关键词所在的位置在哪,

list(A.iter('he is girl', ignore_white_space=False))
>>> [(1, (0, 'he'))]

3 pyahocorasick 案例

部分来源于:huanyong的QASystemOnMedicalKG项目

sentences = ['降压药的选择不知道波依定和洛活喜两种药各有什么特点,适合哪种人服 ',
 '很痒怎么回事?是什么原因呢?长了一些像疹子一样的东西',
 '请问吃紧急避孕药月经推迟多久才是正常的?我因为吃了那药,月经已经? ',
 '怎样快点长高我160想长高5厘米怎么可以长高 ',
 '用了祛痘洗面奶反而脸上长痘痘了怎么回事啊? ']

region_wds = ['降压药','避孕药','洗面奶','月经']

def build_actree( wordlist):
    actree = ahocorasick.Automaton()         # 初始化trie树
    for index, word in enumerate(wordlist):
        actree.add_word(word, (index, word))     # 向trie树中添加单词
    actree.make_automaton()    # 将trie树转化为Aho-Corasick自动机
    return actree


actree = build_actree( keyword)

然后就是进行查找:


def ac_detect(text):
    region_wds = []
    for w1 in actree.iter(text):
        if len(w1) > 0:
            region_wds.append(w1[1][1])
    return region_wds

text = sentences[0]
for text in sentences:
    print('\n句子为:',text)
    print('匹配到关键词:',ac_detect(text))

最后的结果为:

句子为: 降压药的选择不知道波依定和洛活喜两种药各有什么特点,适合哪种人服 
匹配到关键词: ['降压药']

句子为: 下体很痒怎么回事?是什么原因呢?长了一些像疹子一样的东西,正好我 
匹配到关键词: []

句子为: 请问吃紧急避孕药月经推迟多久才是正常的?我因为吃了那药,月经已经? 
匹配到关键词: ['避孕药', '月经', '月经']

句子为: 怎样快点长高我160想长高5厘米怎么可以长高 
匹配到关键词: []

句子为: 用了祛痘洗面奶反而脸上长痘痘了怎么回事啊? 
匹配到关键词: ['洗面奶']

4 ahocorapy 的使用

github:https://github.com/abusix/ahocorapy

Creation of the Search Tree

from ahocorapy.keywordtree import KeywordTree
kwtree = KeywordTree(case_insensitive=True)
kwtree.add('malaga')
kwtree.add('lacrosse')
kwtree.add('mallorca')
kwtree.add('mallorca bella')
kwtree.add('orca')
kwtree.finalize()

Searching

result = kwtree.search('My favorite islands are malaga and sylt.')
print(result)

The search_all method returns a generator for all keywords found, or None if there is none.

results = kwtree.search_all('malheur on mallorca bellacrosse')
for result in results:
    print(result)

还可以看逻辑图:

from ahocorapy_visualizer.visualizer import Visualizer
visualizer = Visualizer()
visualizer.draw('readme_example.png', kwtree)

当然可能报错:

ModuleNotFoundError: No module named 'pygraphviz'

这个时候又会回到之前pyahocorasick遇到的问题。


5 关键词匹配与flashtext进行对比

python︱flashtext高效关键词查找与替换

import ahocorasick
def build_actree(wordlist):
    '''
        AC自动机进行关键词匹配
        构造AC trie
    '''
    actree = ahocorasick.Automaton()         # 初始化trie树
    for index, word in enumerate(wordlist):
        actree.add_word(word, (index, word))     # 向trie树中添加单词
    actree.make_automaton()    # 将trie树转化为Aho-Corasick自动机
    #self.actree = actree
    return actree


def ac_detect(actree,text):
    '''
        AC自动机进行关键词匹配
        文本匹配
    '''
    region_wds = []
    for w1 in actree.iter(text):
        if len(w1) > 0:
            region_wds.append(w1[1][1])
    return region_wds

wordlist = ['健康','减肥']
text = '今天你减肥了吗,今天你健康了吗,减肥 = 健康!'
actree = build_actree(wordlist)
%time ac_detect(actree,text)

>>> CPU times: user 10 µs, sys: 3 µs, total: 13 µs
>>> Wall time: 17.4 µs
>>> ['减肥', '健康', '减肥', '健康']

与flashtext进行对比:

from flashtext import KeywordProcessor

def build_actree(wordlist):
    '''
        AC自动机进行关键词匹配
        构造AC trie
    '''
    actree = KeywordProcessor()
    for index, word in enumerate(wordlist):
        actree.add_keyword(word)     # 向trie树中添加单词
    #self.actree = actree
    return actree

def ac_detect(actree,text,span_info = True):
    '''
        AC自动机进行关键词匹配
        文本匹配
    '''
    region_wds = []
    for w1 in actree.extract_keywords(text,span_info = span_info):
        if len(w1) > 0:
            region_wds.append(w1[0])
    return region_wds

wordlist = ['健康','减肥']
text = '今天你减肥了吗,今天你健康了吗,减肥 = 健康!'
actree = build_actree(wordlist)
%time ac_detect(actree,text)

>>> CPU times: user 41 µs, sys: 0 ns, total: 41 µs
>>> Wall time: 47.2 µs
>>> ['减肥', '健康', '减肥', '健康']

转载:https://blog.csdn.net/sinat_26917383/article/details/101701508
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场