python字符串匹配工具性能比较-爱开源

做敏感词过滤的时候要用到字符串匹配，从一个文件中读入需要匹配的敏感词，和一段文本去匹配，用string的find方法是不太合适了，搜了一下，发现AC自动机的方式更好。AC自动机是一个可以用来快速进行多关键字匹配的数据结构，具体信息可以参见这篇经典的论文：Efficient string matching: an aid to bibliographic search

再找了一下，python有几个扩展是做这个的，可以直接拿来用。考虑到性能问题，于是把他们放在一起对比一下，看看哪一个更快：

Ahocorasick：使用Aho-Corasick自动机的方式，根据一组关键词进行匹配，返回关键词出现的位置。用C实现，python包装
Acora：多关键字搜索引擎，使用Aho-Corasick以及NFA-to-DFA自动机的方式
Esmre：也是使用的AhoCorasick自动机的方式，做了一些细微的修改。也是用C实现，python包装

对比的方式：敏感词列表中有684个中文敏感词，每个一行，用open的方式读入，用for循环进行构建，然后用敏感词和网上找的一片文章拼接出一段文本作为待匹配文本，大小为16K。分别使用上面三个工具进行匹配，记录使用的时间。测试代码如下：

#encoding:utf-8
#!/usr/bin/python

import time
import ahocorasick
import codecs
import esm
from acora import AcoraBuilder

dictFile = "dict.txt"
testFile = "content.txt"

class Ahocorasick(object):
    def __init__(self,dic):
        self.__tree = ahocorasick.KeywordTree()
        fp = open(dic)
        for line in fp:
            self.__tree.add(line.rstrip("n"))
        fp.close()
        self.__tree.make()

    def findall(self,content):
        hitList = []
        for start, end in self.__tree.findall(content):
            hitList.append(content[start:end])
        return hitList

class Acora(object):

    def __init__(self,dic):
        self.__builder = AcoraBuilder()
        fp = open(dic)
        for line in fp:
            self.__builder.add(line.rstrip("n").decode("utf-8"))
        fp.close()
        self.__tree = self.__builder.build()

    def findall(self,content):
        hitList = []
        for hitWord, pos in self.__tree.finditer(content):
            hitList.append(hitWord)
        return hitList

class Esmre(object):

    def __init__(self,dic):
        self.__index = esm.Index()
        fp = open(dic)
        for line in fp:
            self.__index.enter(line.rstrip("n"))
        fp.close()
        self.__index.fix()

    def findall(self,content):
        hitList = []
        for pos, hitWord in self.__index.query(content):
            hitList.append(hitWord)
        return hitList

if __name__ == "__main__":
    fp = open(testFile)
    content = fp.read()
    fp.close()

    t0 = time.time()
    ahocorasick_obj = Ahocorasick(dictFile)
    t1 = time.time()
    acora_obj = Acora(dictFile)
    t2 = time.time()
    esmre_obj = Esmre(dictFile)
    t3 = time.time()

    ahocorasick_obj.findall(content)
    t4 = time.time()
    acora_obj.findall(content.decode("utf-8"))
    t5 = time.time()
    esmre_obj.findall(content)
    t6 = time.time()

    Ahocorasick_init = t1 - t0
    Acora_init = t2 - t1
    Esmre_init = t3 - t2
    Ahocorasick_process = t4 - t3
    Acora_process = t5 - t4
    Esmre_process = t6 - t5
    print "Ahocorasick init: %f" % Ahocorasick_init
    print "Acora init: %f" % Acora_init
    print "Esm_init: %f" % Esmre_init
    print "Ahocorasick process: %s" % Ahocorasick_process
    print "Acora process: %s" % Acora_process
    print "Esm_process: %f" % Esmre_process

测试结果如下：

构建自动机所需时间：

Ahocorasick：	0.017880	0.030629	0.018124
Acora：	60.202289	60.188462	59.801024
Esmre：	0.006770	0.006437	0.006538

匹配关键字所用时间：

Ahocorasick：	0.00136995315552	0.00102996826172	0.00102400779724
Acora：	0.00235295295715	0.00202894210815	0.00201892852783
Esmre：	0.005928	0.005371	0.005410

从结果来看，Ahocorasick构建自动机第二，匹配关键字最快，Acora构建自动机特别慢（也可能是我写的有问题，只有这个需要将文本转换成unicode的形式），匹配的速度拍第二，esm构建自动机最快，匹配的最慢。综合来看，Ahocorasick性能最佳。做敏感词过滤的时候，可以考虑使用这个工具。

转载请注明：爱开源 » python字符串匹配工具性能比较

python字符串匹配工具性能比较

与本文相关的文章

您必须登录才能发表评论！

与本文相关的文章

您必须 登录 才能发表评论！

您必须登录才能发表评论！