1.背景

需求：统计8月每种商品类别的购买人数

select mer_type, count(distinct uid)
from t -- 表t在100G左右
where dt between '20230801' and '20230831'
group by mer_type

背景：这个任务跑了2h仍未跑出结果，就是因为count distinct在大数据量的情况下，性能巨差，于是想要使用bitmap来对其进行优化！

2.技术原理

2.1 BitMap

2.1.1 定义

BitMap的基本原理就是用一个bit来标记元素是否存在，因为仅用1个bit来存储一个数据，所以可以大大的节省空间；假设要使用BitMap来存储（1，5，1）这几个数字，如何存储呢？

0	1	2	3	4	5	6	7
0	1	0	0	0	1	0	0

2.1.2 使用场景

海量数据量下求不重复的整数的个数

2.1.3 代码实现

以下代码可以直接运行

class Bitmap:
	def __init__(self, size):
		self.size = size
		self.bitmap = [0] * ((size + 31) // 32)
	def set(self, num):
		index = num // 32
		offset = num % 32
		self.bitmap[index] |= (1 << offset)
	def test(self, num):
	  	index = num // 32
		offset = num % 32
		return (self.bitmap[index] & (1 << offset)) != 0
def remove_dup(nums):
  	bitmap = Bitmap(len(nums))
	res = []
	for num in nums:
	  	if not bitmap.test(num):
		  	bitmap.set(num)
			res.append(num)
	return res
 # 测试
nums = [1,2,3,4,1,3]
res = remove_dup(nums)
print(res) # [1,2,3,4]

2.2 RoaringBitMap

2.2.1 BitMap的问题

不管业务中实际的元素基数有多少，它占用的内存空间都恒定不变
数据越稀疏，空间浪费越严重

2.2.2 定义

将数据的前半部分，即216(这里为高16位)部分作为桶的编号，将分为216=65536个桶，RBM中将这些小桶叫做container
存储数据时，按照数据的高16位做为container的编号去找对应的container(找不到就创建对应的container)，再将低16位放入该container中
所以一个RBM是很多container的集合

2.2.3 代码实现

import pyroaring
def remove_dup(nums):
  	bitmap = pyroaring.BitMap()
	res = []
	for num in nums:
	  	if num not in bitmap:
		  	bitmap.add(num)
			res.append(num)
# 测试
nums = [1,2,3,4,1,3]
res = remove_dup(nums)
print(res) # [1,2,3,4]

3.案例分析

需求：统计8月每种商品类别的购买人数

3.1 定义UDF函数

import pyroaring
from pyhive import hive
def remove_dup(nums):
  	bitmap = pyroaring.BitMap()
	res = []
	for num in nums:
	  	if num not in bitmap:
		  	bitmap.add(num)
			res.append(num)
	return len(res)

3.2 创建UDF函数

CREATE FUNCTION remove_dup(nums array) RETURNS int
AS 'SELECT remove_dup(nums) FROM bitmap.py' LANGUAGE PYTHON;

3.3 使用UDF函数

select mer_type, remove_dup(collect_list(uid))
from t -- 表t在100G左右
where dt between '20230801' and '20230831'
group by mer_type

#数据人的面试交流地##晒一晒我的offer##我发现了面试通关密码##如何判断面试是否凉了##牛客在线求职答疑中心#

BitMap在数仓领域的应用【面试加分项】

1.背景

2.技术原理

2.1 BitMap

2.1.1 定义

2.1.2 使用场景

2.1.3 代码实现

2.2 RoaringBitMap

2.2.1 BitMap的问题

2.2.2 定义

2.2.3 代码实现

3.案例分析

3.1 定义UDF函数

3.2 创建UDF函数

3.3 使用UDF函数

全站热榜