首页 > 试题广场 >

字符串出现次数的TopK问题

[编程题]字符串出现次数的TopK问题

热度指数：45538 时间限制：C/C++ 2秒，其他语言4秒空间限制：C/C++ 256M，其他语言512M
算法知识视频讲解

给定一个字符串数组，再给定整数 k ，请返回出现次数前k名的字符串和对应的次数。

返回的答案应该按字符串出现频率由高到低排序。如果不同的字符串有相同出现频率，按字典序排序。

对于两个字符串，大小关系取决于两个字符串从左到右第一个不同字符的 ASCII 值的大小关系。

比如"ah1x"小于"ahb"，"231"<”32“

字符仅包含数字和字母

数据范围：字符串数满足 $0 \le n \le 100000$ ，每个字符串长度 $0 \le n \le 10$ ， $0 \le k \le 2500$

要求：空间复杂度 $O(n)$ ，时间复杂度 $O(n \log k)$

示例1

输入

["a","b","c","b"],2

输出

[["b","2"],["a","1"]]

说明

"b"出现了2次，记["b","2"]，"a"与"c"各出现1次，但是a字典序在c前面，记["a","1"]，最后返回[["b","2"],["a","1"]]

示例2

输入

["123","123","231","32"],2

输出

[["123","2"],["231","1"]]

说明

 "123"出现了2次，记["123","2"]，"231"与"32"各出现1次，但是"231"字典序在"32"前面，记["231","1"]，最后返回[["123","2"],["231","1"]]

示例3

输入

["abcd","abcd","abcd","pwb2","abcd","pwb2","p12"],3

输出

[["abcd","4"],["pwb2","2"],["p12","1"]]

备注:

算法知识视频讲解

Python

不愿吃饼的三文鱼很调皮

#
# return topK string
# @param strings string字符串一维数组 strings
# @param k int整型 the k
# @return string字符串二维数组
#
class Solution:
    def topKstrings(self , strings , k ):
        if not strings:
            return None
        # 用字典统计每个单词的数量
        mydict = {}
        for word in strings:
            if word in mydict:
                mydict[word] += 1
            else:
                mydict[word] = 1
        # 按value域降序排列，返回值是list
        cur = sorted(mydict.items(), key=lambda x: x[1], reverse=True)
        mylist = []
        for item in cur:
            tar = []
            tar.append(item[0])
            tar.append(str(item[1]))
            mylist.append(tar)
        # 设置临时容器，存放相同value值的单词
        temp = [mylist[0]]
        i, j = 0, 1
        ans = []
        while j < len(mylist):
            # 若两个单词的频数相等
            if mylist[j][1] == mylist[i][1]:
                # value值相同则将单词放入temp中，准备字母排序
                temp.append(mylist[j])
                # 指针后移
                i += 1
                j += 1
            else:
                # 将相同value值的按字母排序，加到ans中
                temp.sort()
                ans += temp
                # temp清空，准备记录下一轮
                temp = []
                # 指针后移
                i += 1
                j += 1
                # 开始记录新的一轮
                temp.append(mylist[i])
        # 最后一轮的temp还没有处理，别落下
        temp.sort()
        ans += temp
        return ans[:k]

发表于 2021-07-17 14:00:30 回复(0)

Python

Keosu

from collections import  Counter
class Solution:
    def topKstrings(self , strings , k ):
      c = Counter(strings)
      d = sorted(c.items(), key = lambda x: (-x[1],x[0]))
      return d[:k]

发表于 2021-07-05 16:00:15 回复(0)

Python

HHHChen77

人生苦短我用python

class Solution:
    def topKstrings(self, strings, k ):
        # write code here
        # 哈希统计次数
        count = {}
        for char in strings:
            count[char] = count.get(char, 0) + 1
        # 定义排序函数 (按次数降序,字典序升序)
        result = sorted(count.items(), key = lambda x: (-x[1],x[0]))
        # 排序完的前k个
        return result[:k]

发表于 2021-06-24 02:43:18 回复(0)

Python

牛客523827377号

不知道什么原因，太耗时了，求解答

class Solution:

def topKstrings(self , strings , k ):
# write code here
if not strings:
return ""
lists={}
for i in set(strings):
lists[i]=strings.count(i)
last=sorted(lists.items(),key=lambda x:(-x[1],x[0]))
last=[list(i) for i in last]
return last[0:k]

发表于 2021-06-04 17:14:38 回复(0)

Python

白色纯度

用了 sort 真心木有意思，这里提供一个以空间换时间的方法，目前用时和空间均超过95%的人，并且符合题意要求：

1. 先用字典统计每个字符的词频，同时用 max_count 记录最大词频数

2. 创建一个元素为列表的长度为 max_count + 1 的数组ans，以词频为index，将字符串放到对应的数组中。如果词频相同，那就等于放到了同一index下的数组中。

3. 接下来从最高词频（最大索引）的元素开始，先排序，再根据k的大小按需从ans中抽取满足条件的字符，由于词频即为索引index的值，因此累加即可！

本算法亮点在于用数组的index作为词频，而且获取答案的时候按需索取，k用完之后即可返回答案。

class Solution:
    def topKstrings(self , strings , k):
        res = {}
        result = []
        max_count = 0
        for i in strings:
            res[i] = res.get(i, 0) + 1
            max_count = max(max_count, res[i])
        ans = [[] for _ in range(max_count + 1)]
        for key, value in res.items():
            ans[value].append(key)
        for j in range(max_count + 1):
            if ans[max_count - j] != []:
                ans[max_count - j] = sorted(ans[max_count - j])
                if len(ans[max_count - j]) < k:
                    for ele in ans[max_count - j]:
                        result += [[ele, str(max_count - j)]]
                    k -= len(ans[max_count - j])
                else:
                    for ele in range(k):
                        result += [[ans[max_count - j][ele], str(max_count - j)]]
                    return result

发表于 2021-05-20 23:25:28 回复(1)

Python

胡桃你带我走吧

Python3代码

1）用dict统计字符串出现次数，时间复杂度O(N)；

2）用heapq维护一个【字符串+出现次数】的大顶堆，时间复杂度O(N logK)。由于heapq库里用小于判断堆元素的大小，所有我们要另外写一个对象重载【<】这个操作符，即__lt__函数；

3）从堆顶弹出所有元素，然后逆序列表，时间复杂度O(K logK + K)。

综上，总的时间复杂度是O(N logK)。

class hobj:

def __init__(self, val, key):
self.val = val
self.key = key

def __lt__(self, other):
if self.val == other.val:
return self.key > other.key
return self.val < other.val

class Solution:
def topKstrings(self , strings, k):
import heapq as hp
from collections import defaultdict
sd = defaultdict(int)
for s in strings: sd[s] += 1
h = []
for key, val in sd.items():
if len(h) < k: hp.heappush(h, hobj(val, key))
else: hp.heappushpop(h, hobj(val, key))

ans = []
while h:
ho = hp.heappop(h)
ans.append([ho.key, str(ho.val)])
return ans[::-1]

编辑于 2021-04-18 02:18:39 回复(0)

Python

牛客四二三号

堆排序，前k小用大根堆

python又炸了，超**

#
# return topK string
# @param strings string字符串一维数组 strings
# @param k int整型 the k
# @return string字符串二维数组
#
class Solution:
    def topKstrings(self , strings , k ):
        # write code here
        def topk(li, k):
            heap = li[0:k]  # 取列表前k个元素
            for i in range((k - 2) // 2, -1, -1):  # 前k个元素构建堆
                sift(heap, i, k - 1)
            for i in range(k, len(li)):  # 从列表第k + 1个元素（index为k）继续遍历
                if li[i] < heap[0]:  # 如果遍历的当前元素小于堆顶，则替换堆顶并进行一次调整
                    heap[0] = li[i]
                    sift(heap, 0, k - 1)
            for i in range(k - 1, 0, -1):  # 挨个出数
                heap[0], heap[i] = heap[i], heap[0]
                sift(heap, 0, i - 1)
            return heap  # 返回排好序的前k个数
        
        def sift(li, high, low):
            """
            调整函数（大根堆）
            此时的列表li除根结点（low）外，其它部分已经无需调整了
            大根堆：上层大下层小
            :param li: 列表
            :param low: 堆的根结点位置（堆顶）
            :param high: 堆的最后一个元素位置，唯一作用为不让j越界，间接使得i不越界
            :return:
            """
            i = low  # i指向父亲（初始化时指向堆顶）
            j = 2 * i + 1  # j指向孩子，默认首先指向左孩子
            tmp = li[low]  # 存放堆顶
            while j <= high:  # 只要j没有超过堆的最后一个元素位置，继续循环
                if j + 1 <= high and li[j] < li[j + 1]:  # 如果右孩子存在且右孩子比左孩子更大
                    j += 1  # j指向右孩子
                if li[j] > tmp:  # 孩子比tmp大
                    li[i] = li[j]  # 孩子放到父亲的位置
                    i = j  # i向下移一层（指向原来孩子的位置）
                    j = 2 * i + 1  # 同时j也向下移一层
                else:  # tmp比孩子大
                    #           li[i] = tmp  # tmp放到父亲的位置
                    break
            #   else:  # j超过堆的最后一个元素位置
            #       li[i] = tmp  # 直接将tmp放到堆的最下一层
            li[i] = tmp  # 观察上面两个else，发现while循环结束时始终会执行一次li[i] = tmp，因此可简化
        
        li = list(map(int, strings))
        print(li)
        res = list()
        res = topk(li, k)
        dic = {}
        for i in res:
            dic[i] = dic.get(i, 0) + 1
        fin = list()
        for i in dic:
            fin.append([i, str(dic[i])])
        fin.sort()
        return fin

编辑于 2021-03-30 16:12:43 回复(0)

Python

猫牛

from collections import defaultdict

class Solution:
    def topKstrings(self , strings , k ):
        # write code here
        cnt = defaultdict(int)
        for s in strings:
            cnt[s] += 1
        
        stat = sorted(cnt.items(), key=lambda tp: (-tp[1], tp[0]))[:k]
        return [[k, str(n)] for k, n in stat]

发表于 2021-03-24 23:12:30 回复(0)

提交观点

问题信息

哈希

难度：

8条回答 647收藏 12210浏览

通过挑战的用户

查看代码

想去杭州的布拉...

2023-02-19 14:42:39
乌拉拉肥肥猪

2022-11-15 18:37:21
超级靠谱的内推

2022-11-03 16:38:27
牛客96021...

2022-09-22 11:58:41
_菜鸟_

2022-09-16 10:52:13

字符串出现次数的TopK问题

输入

输出

说明

输入

输出

说明

输入

输出

备注:

问题信息

热门推荐

通过挑战的用户

相关试题