2024-05-28 09:17 门头沟学院算法工程师

关注

中文分词模拟器 - 华为OD统一考试(D卷)

OD统一考试（D卷）

分值： 200分

题解： Java / Python / C++

alt

题目描述

给定一个连续不包含空格字符的字符串，该字符串仅包含英文小写字母及英文标点符号（逗号、句号、分号），同时给定词库，对该字符串进行精确分词。

说明：

精确分词：字符串分词后，不会出现重叠。例如 "ilovechina"，不同切分后可得到 "i", "love", "china"。
标点符号不分词，仅用于断句。
词库：根据常识及词库统计出来的常用词汇。例如：dictionary={"i","love","china","ilovechina","lovechina"}。
分词原则:采用分词顺序优先且最长匹配原则。“ilovechina”，假设分词结果[i,ilove,lo,love,ch,china,lovechina] 则输出 [ilove,china]
- 错误输出：[i, lovechina]，原因："ilove" > 优先于 "lovechina" 成词。
- 错误输出：[i, love, china]，原因："ilove" > "i"，遵循最长匹配原则。

输入描述

字符串长度限制：0 < length < 256
词库长度限制：0 < length < 100000
第一行输入待分词语句 "ilovechina"
第二行输入中文词库 "i, love, china, ch, na, ve, lo, this, is, the, word"

输出描述

按顺序输出分词结果 "i, love, china"

示例1

输入：
ilovechina
i,love,china,ch,na,ve,lo,this,is,the,word

输出：
i,love,china

说明：
输入的字符串被按最长匹配原则分为 "i", "love", "china"。

示例2

输入：
ilovech
i,love,china,ch,na,ve,lo,this,is,the,word

输出：
i,love,ch

说明：
输入的字符串被按最长匹配原则分为 "i", "love", "ch"。

题解

这道题目属于字符串处理和字典树（Trie）应用的题目。题解思路如下：

构建字典树（Trie）：

使用字典树来存储词库中的单词，方便高效地进行查找和匹配。

分词过程：

对输入的字符串进行遍历，从当前位置向后进行匹配。

尽量匹配最长的单词，如果匹配成功，则将该单词加入结果列表。

如果没有匹配到词库中的任何单词，则将当前字符作为一个单词加入结果列表。

时间复杂度：

构建字典树的时间复杂度是 O(n)，其中 n 为词库中所有单词的总长度。

分词的时间复杂度是 O(m)，其中 m 为待分词字符串的长度。

空间复杂度：

字典树的空间复杂度是 O(n)，其中 n 为词库中所有单词的总长度。

Java

import java.util.*;
/**
 * @author code5bug
 */
class TrieNode {
    Map<Character, TrieNode> children = new HashMap<>();
    boolean isEndOfWord = false;
}

class Trie {
    private final TrieNode root;

    public Trie() {
        root = new TrieNode();
    }

    public void insert(String word) {
        TrieNode node = root;
        for (char ch : word.toCharArray()) {
            node.children.putIfAbsent(ch, new TrieNode());
            node = node.children.get(ch);
        }
        node.isEndOfWord = true;
    }

    public List<String> searchAll(String text) {
        List<String> result = new ArrayList<>();
        int length = text.length();
        for (int i = 0; i < length; ) {
            TrieNode node = root;
            int longestMatch = -1;
            for (int j = i; j < length; j++) {
                char ch = text.charAt(j);
                if (!node.children.containsKey(ch)) {
                    break;
                }
                node = node.children.get(ch);
                if (node.isEndOfWord) {
                    longestMatch = j;
                }
            }
            if (longestMatch == -1) {
                result.add(text.substring(i, i + 1));
                i++;
            } else {
                result.add(text.substring(i, longestMatch + 1));
                i = longestMatch + 1;
            }
        }
        return result;
    }
}

public class Main {
    public static void main(String[] args) {
        Scanner scanner = new Scanner(System.in);
        String text = scanner.nextLine().trim();
        String[] dictionaryWords = scanner.nextLine().trim().split(",");

        Trie trie = new Trie();
        for (String word : dictionaryWords) {
            trie.insert(word);
        }

        List<String> segmentedWords = trie.searchAll(text);
        System.out.println(String.join(",", segmentedWords));
    }
}

Python

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False

class Trie:
    def __init__(self):
        self.root = TrieNode()
    
    def insert(self, word):
        node = self.root
        for ch in word:
            if ch not in node.children:
                node.children[ch] = TrieNode()
            node = node.children[ch]
        node.is_end_of_word = True
    
    def search_all(self, text):
        result = []
        length = len(text)
        i = 0
        while i < length:
            node = self.root
            longest_match = -1
            for j in range(i, length):
                ch = text[j]
                if ch not in node.children:
                    break
                node = node.children[ch]
                if node.is_end_of_word:
                    longest_match = j
            if longest_match == -1:
                result.append(text[i])
                i += 1
            else:
                result.append(text[i:longest_match+1])
                i = longest_match + 1
        return result

if __name__ == "__main__":
    import sys
    input = sys.stdin.read
    data = input().splitlines()
    text = data[0].strip()
    dictionary_words = data[1].strip().split(',')

    trie = Trie()
    for word in dictionary_words:
        trie.insert(word)

    segmented_words = trie.search_all(text)
    print(','.join(segmented_words))

C++

#include <iostream>
#include <unordered_map>
#include <vector>
#include <string>
using namespace std;

class TrieNode {
public:
    unordered_map<char, TrieNode*> children;
    bool isEndOfWord;
    TrieNode() : isEndOfWord(false) {}
};

class Trie {
private:
    TrieNode* root;
public:
    Trie() {
        root = new TrieNode();
    }

    void insert(const string& word) {
        TrieNode* node = root;
        for (char ch : word) {
            if (node->children.find(ch) == node->children.end()) {
                node->children[ch] = new TrieNode();
            }
            node = node->children[ch];
        }
        node->isEndOfWord = true;
    }

    vector<string> searchAll(const string& text) {
        vector<string> result;
        int length = text.size();
        for (int i = 0; i < length; ) {
            TrieNode* node = root;
            int longestMatch = -1;
            for (int j = i; j < length; j++) {
                char ch = text[j];
                if (node->children.find(ch) == node->children.end()) {
                    break;
                }
                node = node->children[ch];
                if (node->isEndOfWord) {
                    longestMatch = j;
                }
            }
            if (longestMatch == -1) {
                result.push_back(string(1, text[i]));
                i++;
            } else {
                result.push_back(text.substr(i, longestMatch - i + 1));
                i = longestMatch + 1;
            }
        }
        return result;
    }
};

int main() {
    string text;
    getline(cin, text);
    string dictionaryLine;
    getline(cin, dictionaryLine);

    vector<string> dictionaryWords;
    size_t pos = 0;
    while ((pos = dictionaryLine.find(',')) != string::npos) {
        dictionaryWords.push_back(dictionaryLine.substr(0, pos));
        dictionaryLine.erase(0, pos + 1);
    }
    dictionaryWords.push_back(dictionaryLine);

    Trie trie;
    for (const string& word : dictionaryWords) {
        trie.insert(word);
    }

    vector<string> segmentedWords = trie.searchAll(text);
    for (size_t i = 0; i < segmentedWords.size(); i++) {
        cout << segmentedWords[i];
        if (i < segmentedWords.size() - 1) {
            cout << ",";
        }
    }
    cout << endl;
    return 0;
}

相关练习题

alt

希望这个专栏不仅能帮您成功通过华为机试，还能让您熟练掌握算法。

整理题解不易，如果有帮助到您，请给点个赞 ‍❤️‍ 和收藏 ⭐，让更多的人看到。🙏🙏🙏

#面经##春招##秋招##校招##华为#

全部评论

推荐最新楼层

owltown

门头沟学院 C++

有个问题，如果分词表是I,lo,ve,love,china的话，根据字典树的构建love确实会被分出来，但是分词表的顺序lo才是正确的分词结果。这个问题，字典树该怎么解决

点赞回复分享

发布于 2024-09-30 19:32 上海

大大的小苏打

浙江师范大学 C++

有个小疑问,假如是下面这种情况： ilovech i,ilov,ilove,ech 那么按照算法是不是会选择先匹配ilove，然后剩下了一个ch不就无法匹配了么，但是如果是和ilov匹配，然后剩下的就是和ech匹配，想知道题目会出现这种情况么

点赞回复分享

发布于 2024-07-26 16:44 浙江

遥夜fafa

门头沟学院嵌入式工程师

今年真是见鬼了，没想到我也要来看这个

点赞回复分享

发布于 2024-05-28 14:41 广东

05-29 18:03

西安邮电大学产品经理

入职字节一个月，已经学会……

已熟练使用包括但不限于：拉齐、颗粒度、用户心智、还需教育、CC、体感、底层逻辑、降维打击、圈层、cover掉、布局、反哺、闭环、通路、通晒、晒一下、POC、同学你好、这里就不再过多赘述了、同步、给到、review、double check、西快、节点追投、很急加急、sop、辛苦xx一下、冲一下、找人地图、各种文档emoji、拉个会议、发个ONCALL、FYI、to do、Get、Done欢迎字节同学补充👏

希望一切顺顺利利：先解除锁定，再按热水

我的实习日记

点赞评论收藏

05-28 21:59

曼伦商贸_供应链管理(准入职员工)

曼伦内推-曼伦内推码

【品牌市场岗位】 我面的是曼伦·杜蕾斯品牌市场岗位，跟面试官聊下来，我这个岗主要是做媒体投放执行以及Campaign支持。 ⌚️【面试流程】 面试官有：HR和LEADER，一开始HR就明确说明面试分为三个阶段：1️⃣3-5分钟自我介绍2️⃣针对过往经历提问3️⃣岗位问题探讨。 🙋‍♀️【问题汇总】 1️⃣请简单介绍一下你自己。 2️⃣能详细讲一下你在某个项目中的具体角色和负责的工作内容吗？ 3️⃣你最近有看到哪些杜蕾斯广告吗？ 4️⃣怎么看待品牌仍然在做大型户外媒体投放？ 5️⃣最近在户外媒体看到哪些印象深刻的广告？ 6️⃣户外媒体形式及差异？ 7️⃣如何看待从创意岗转向渠道沟通和媒体投放多...

投递曼伦商贸等公司6个岗位 >

点赞评论收藏

04-14 14:42

西北大学产品经理

为什么找了半年没有一个offer？太苦了，到底哪里出问题了呢

吐泡泡的咸鱼：我也工作了几年了，也陆陆续续面试过不少人，就简历来说，第一眼学历不太够，你只能靠你的实习或者论文或者项目经历，然后你没有论文，没有含金量高的比赛和奖项，只能看实习和项目，实习来说，你写的实习经历完全不清楚你想找什么工作？行研？数据分析？且写的太少了，再看项目，这些项目先不说上过大学读过研究生的都知道很水，然后对你想找的岗位有什么帮助呢？项目和实习也完全不匹配啊，你好像在努力将你所有的经历都放在简历里想表现你的优秀，但是对于你想找的岗位来说，有什么用呢？最后只能获得岗位不匹配的评价。所以你需要明白你想要找的岗位要求是什么，是做什么的，比如产品经理，然后再看你的经历里有什么匹配的上这个岗位，或者对这个岗位以及这个岗位所在的公司有价值，再写到你的简历上

点赞评论收藏