分词工具（Word Segmentation Tools）

Jieba分词：https://github.com/fxsjy/jieba
SnowNLP：https://github.com/isnowfy/snownlp
LTP：http://www.ltp-cloud.com/
HanNLP：https://github.com/hankcs/HanLP/
THULAC：https://github.com/thunlp/THULAC-Python
NLPIR：https://github.com/NLPIR-team/NLPIR
…

分词方法（Segmentation Method）

前向最大匹配（forward-max matching）

程序思路：

词库（字典）：word.txt

南京
南京市
长江
长江大桥
大桥
市长
北京
北京市长
烤鸭
南京烤鸭

Python实现代码：

class FMM(object):
    def __init__(self, dic_path):
        # 集合
        self.dictionary = set()
        self.maximum = 0

        # 读取词库（字典），并初始化词库（字典）及其长度
        with open(dic_path, 'r', encoding='utf8') as f:
            for line in f:
                line = line.strip()
                if line:
                    self.dictionary.add(line)
            self.maximum = len(self.dictionary)

    def cut(self, text, max_len):
        result = []  # 用于存放切分出来的词
        index = max_len
        no_word = ''  # 记录没有在 词库（字典）中的词，可以用于发现新词

        while len(text) > 0:
            if index == 0:  # 存储没有收录的词汇
                no_word += text[:index + 1]
                text = text[index + 1:]
                index = max_len

            if text[:index] in self.dictionary:
                if no_word != '':
                    # 把没有在 词库（字典）中的词 也存储的词汇加入分词后的结果中
                    # 只有当下一个词出现在词库中，之前no_word里面的词才会加入进来
                    result.append(no_word)
                    no_word = ''

                # 如果之前no_word存放了词库（字典）里面没有出现过的词
                result.append(text[:index])
                text = text[index:]
                index = max_len
            else:
                index -= 1

        return result


if __name__ == '__main__':
    text = '北京市长喜欢南京烤鸭和你南京市长江大桥'
    max_len = 5
    tokenizer = FMM('word.txt')
    print(tokenizer.cut(text, max_len))

# 输出：
# ['北京市长', '喜欢', '南京烤鸭', '和你', '南京市', '长江大桥']

后向最大匹配（backward-max matching）

词库（字典）：word.txt

南京
南京市
长江
长江大桥
大桥
南京市长
市长
北京
北京市长
烤鸭
南京烤鸭

Python实现代码：

class BKMM(object):
    def __init__(self, dic_path):
        # 集合
        self.dictionary = set()
        self.maximum = 0
        # 读取词库（字典），并初始化词库（字典）及其长度
        with open(dic_path, 'r', encoding='utf8') as f:
            for line in f:
                line = line.strip()
                if line:
                    self.dictionary.add(line)
            self.maximum = len(self.dictionary)

    def cut(self, text, max_len):
        result = []  # 用于存放切分出来的词
        index = len(text)  # 目标字符串的长度
        no_word = ''  # 记录没有在 词库（字典）中的词，可以用于发现新词

        while index > 0:
            word = None
            # 实现 backward-max matching 算法
            for first in range(index - max_len if index > max_len else 0, index):
                if text[first:index] in self.dictionary:
                    word = text[first:index]

                    # 如果之前no_word存放了词库（字典）里面没有出现过的词
                    if no_word != '':    # 处理上次记录的不存在的词汇，比如 no_word='你和'
                        result.append(no_word[::-1]) 
                         # 把不存在的词汇也加入result，因为这也是句子的一部分，
                         # 反转是因为，之前存储的就是倒序的，比如 no_word='你和'
                        no_word = ''  # 置为空，用于存储下次不再词库（字典）中的词汇

                    # 存储在词库（字典）中存在的词汇
                    result.append(text[first:index])
                    index = first
                    break
            if word == None:  # 将不存在 词库（字典） 的词汇（单个词汇）加入
                index = index - 1
                no_word += text[index]  # 注意：这里是倒序的，比如：和你-->在这里的no_word='你和'
        return result[::-1]


if __name__ == '__main__':
    text = '北京市长喜欢南京烤鸭和你南京市长江大桥'
    max_len = 5
    tokenizer = BKMM('word.txt')
    print(tokenizer.cut(text, max_len))

输出：
['北京市长', '喜欢', '南京烤鸭', '和你', '南京市', '长江大桥']

双向最大匹配

算法流程：

1、比较正向最大匹配和逆向最大匹配结果
2、如果分词数量结果不同，那么取分词数量较少的那个
3、如果分词数量结果相同:
- 分词结果相同，可以返回任何一个
- 分词结果不同，返回单字数比较少的那个

参考：
[1]https://blog.csdn.net/weixin_42894555/article/details/106708086

免责声明：本站所有文章内容,图片，视频等均是来源于用户投稿和互联网及文摘转载整编而成，不代表本站观点，不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益，请在线联系站长,一经查实,本站将立刻删除。本文来自网络,若有侵权，请联系删除，如若转载，请注明出处：https://yundeesoft.com/32025.html

分词（Segmentation ）

分词工具（Word Segmentation Tools）

分词方法（Segmentation Method）

前向最大匹配（forward-max matching）

后向最大匹配（backward-max matching）

双向最大匹配

发表回复

分词（Segmentation ）

分词工具（Word Segmentation Tools）

分词方法（Segmentation Method）

前向最大匹配（forward-max matching）

后向最大匹配（backward-max matching）

双向最大匹配

相关推荐

发表回复