2020-02-28 10:29 已编辑陕西理工大学大数据开发工程师

关注

Spark常见的Transformation算子（三）

Spark常见的Transformation算子（三）

`初始化数据`

println("======================= 原始数据 ===========================")
val data1: RDD[Int] = sc.parallelize(1 to 10, 3)
println(s"原始数据为：${data1.collect.toBuffer}")
val data2: RDD[Int] = sc.parallelize(5 to 15, 2)
println(s"原始数据为：${data2.collect.toBuffer}")
val data3: RDD[Int] = sc.parallelize(List(1, 2, 3, 4, 5, 5, 4, 3, 2, 1))
println(s"原数数据为：${data3.collect.toBuffer}")

结果

`distinct`

用于去重，生成的RDD可能有重复的元素，使用distinct方法可以去掉重复的元素，此方***打乱元素的顺序，操作开销很大

/**
 * Return a new RDD containing the distinct elements in this RDD.
 */
// 第一种实现：需要参数numPartitions，这个类似于一个因子，如果数据集中的元素可以被numPartitions整除，则排在前面，之后排被numPartitions整除余1的，以此类推，体现局部无序，整体有序
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}

/**
 * Return a new RDD containing the distinct elements in this RDD.
 */
// 第二种实现：调用了第一种实现，参数采用了默认的参数
def distinct(): RDD[T] = withScope {
  distinct(partitions.length)
}

Scala版本

println("======================= distinct-1 ===========================")
// 如果没有指定numPartitions参数，则为创建数据时的分区数量
val value1: RDD[Int] = data3.distinct()
println(s"经过distinct处理后的数据为：${value1.collect.toBuffer}")

println("======================= distinct-2 ===========================")
// 局部无序，整体有序。以传入的参数numPartitions作为因子，所有的元素除以numPartitions，模为0的排在第一位，之后排模为1的，以此类推
val value2: RDD[Int] = data3.distinct(2)
println(s"经过distinct处理后的数据为：${value2.collect.toBuffer}")

// 返回结果
// (4, 2, 1, 3, 5)
// 4, 2 ==> 模为0
// 1, 3, 5 ==> 模为1

运行结果

`union`

两个RDD进行合并，不去重

/**
 * Return the union of this RDD and another one. Any identical elements will appear multiple
 * times (use `.distinct()` to eliminate them).
 */
// 返回此RDD和另一个RDD的并集，不去重，顺序连接
def union(other: RDD[T]): RDD[T] = withScope {
  sc.union(this, other)
}

Scala版本

println("======================= union ===========================")
val value: RDD[Int] = data1.union(data2)
println(s"经过union处理后的数据为：${value3.collect.toBuffer}")

运行结果

`intersection`

对于两个RDD求交集，并去重，无序返回，操作开销很大

/**
 * Return the intersection of this RDD and another one. The output will not contain any duplicate
 * elements, even if the input RDDs did.
 *
 * @note This method performs a shuffle internally.
 */
// 第一种实现：一个参数，返回此RDD和另一个RDD的交集，不包含重复元素
// 最后返回也是局部无序，整体有序。分区大小采用两个RDD中分区数量较大的
def intersection(other: RDD[T]): RDD[T] = withScope {
  this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
      .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
      .keys
}

/**
 * Return the intersection of this RDD and another one. The output will not contain any duplicate
 * elements, even if the input RDDs did.
 *
 * @note This method performs a shuffle internally.
 *
 * @param partitioner Partitioner to use for the resulting RDD
 */
// 第二种实现：两个参数，另一个RDD和一个分区器
def intersection(
    other: RDD[T],
    partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  this.map(v => (v, null)).cogroup(other.map(v => (v, null)), partitioner)
      .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
      .keys
}

/**
 * Return the intersection of this RDD and another one. The output will not contain any duplicate
 * elements, even if the input RDDs did.  Performs a hash partition across the cluster
 *
 * @note This method performs a shuffle internally.
 *
 * @param numPartitions How many partitions to use in the resulting RDD
 */
// 第三种实现：两个参数，第二个参数传入numPartitions，内部调用调用第二种实现，使用默认分区器HashPartitioner(numPartitions)，并且返回结果局部无序，整体有序和distinct规则一样
def intersection(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
  intersection(other, new HashPartitioner(numPartitions))
}

Scala版本

println("======================= intersection-1 ===========================")
val value1: RDD[Int] = data1.intersection(data2)
println(s"分区数量为：${value1.getNumPartitions}")
println(s"经过intersection处理后的数据为：${value1.collect.toBuffer}")

println("======================= intersection-2 ===========================")
val value2: RDD[Int] = data1.intersection(data2, new HashPartitioner(4))
println(s"分区数量为：${value2.getNumPartitions}")
println(s"经过intersection处理后的数据为：${value2.collect.toBuffer}")

println("======================= intersection-3 ===========================")
val value3: RDD[Int] = data1.intersection(data2, 5)
println(s"分区数量为：${value3.getNumPartitions}")
println(s"经过intersection处理后的数据为：${value3.collect.toBuffer}")

运行结果

`subtract`

RDD1.substract(RDD2)，返回在RDD1中出现但是不在RDD2中出现的元素

/**
 * Return an RDD with the elements from `this` that are not in `other`.
 *
 * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
 * RDD will be &lt;= us.
 */
// 第一种实现：一个参数，调用了第三种实现
// 最后返回也是局部无序，整体有序。分区大小采用两个RDD中分区数量较大的
def subtract(other: RDD[T]): RDD[T] = withScope {
  subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))
}

/**
 * Return an RDD with the elements from `this` that are not in `other`.
 */
// 第二种实现，调用了第三种实现，使用默认分区器HashPartitioner(numPartitions)，并且返回结果局部无序，整体有序和distinct规则一样
def subtract(other: RDD[T], numPartitions: Int): RDD[T] = withScope {
  subtract(other, new HashPartitioner(numPartitions))
}

/**
 * Return an RDD with the elements from `this` that are not in `other`.
 */
// 第三种实现，两个参数，第二个参数为分区器
def subtract(
    other: RDD[T],
    p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  if (partitioner == Some(p)) {
    // Our partitioner knows how to handle T (which, since we have a partitioner, is
    // really (K, V)) so make a new Partitioner that will de-tuple our fake tuples
    val p2 = new Partitioner() {
      override def numPartitions: Int = p.numPartitions
      override def getPartition(k: Any): Int = p.getPartition(k.asInstanceOf[(Any, _)]._1)
    }
    // Unfortunately, since we're making a new p2, we'll get ShuffleDependencies
    // anyway, and when calling .keys, will not have a partitioner set, even though
    // the SubtractedRDD will, thanks to p2's de-tupled partitioning, already be
    // partitioned by the right/real keys (e.g. p).
    this.map(x => (x, null)).subtractByKey(other.map((_, null)), p2).keys
  } else {
    this.map(x => (x, null)).subtractByKey(other.map((_, null)), p).keys
  }
}

Scala版本

println("======================= subtract-1 ===========================")
val value1: RDD[Int] = data1.subtract(data2)
println(s"分区数量为：${value1.getNumPartitions}")
println(s"经过subtract处理后的数据为：${value1.collect.toBuffer}")

println("======================= subtract-2 ===========================")
val value2: RDD[Int] = data1.subtract(data2, new HashPartitioner(4))
println(s"分区数量为：${value2.getNumPartitions}")
println(s"经过subtract处理后的数据为：${value2.collect.toBuffer}")

println("======================= subtract-3 ===========================")
val value3: RDD[Int] = data1.subtract(data2, 5)
println(s"分区数量为：${value3.getNumPartitions}")
println(s"经过subtract处理后的数据为：${value3.collect.toBuffer}")

运行结果

`cartesian`

返回两个RDD的笛卡尔积，开销非常大

/**
 * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of
 * elements (a, b) where a is in `this` and b is in `other`.
 */
// 分区数量为两个RDD之积
def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {
  new CartesianRDD(sc, this, other)
}

Scala版本

println("======================= cartesian ===========================")
val value1: RDD[(Int, Int)] = data1.cartesian(data2)
println(s"分区数量为：${value1.getNumPartitions}")
println(s"经过cartesian处理后的数据为：${value1.collect.toBuffer}")

运行结果

`sample`

采样操作，用于从样本中取出部分数据

/**
 * Return a sampled subset of this RDD.
 *
 * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
 * @param fraction expected size of the sample as a fraction of this RDD's size
 *  without replacement: probability that each element is chosen; fraction must be [0, 1]
 *  with replacement: expected number of times each element is chosen; fraction must be greater
 *  than or equal to 0
 * @param seed seed for the random number generator
 *
 * @note This is NOT guaranteed to provide exactly the fraction of the count
 * of the given [[RDD]].
 */
// 返回此RDD的采样子集
// withReplacement 是否放回
// fraction，如果withReplacement为false，则fraction表示概率，介于(0,1]
// fraction，如果withReplacement为true，则fraction表示期望的次数，大于等于0
// seed 用于指定的随机数生成器的种子，一般情况下，seed不建议指定
def sample(
    withReplacement: Boolean,
    fraction: Double,
    seed: Long = Utils.random.nextLong): RDD[T] = {
  require(fraction >= 0,
    s"Fraction must be nonnegative, but got ${fraction}")

  withScope {
    require(fraction >= 0.0, "Negative fraction value: " + fraction)
    if (withReplacement) {
      new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
    } else {
      new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
    }
  }
}

Scala版本

println("======================= sample-1 ===========================")
val value1: RDD[Int] = data1.sample(withReplacement = false, 0.5)
println(s"分区数量为：${value1.getNumPartitions}")
println(s"经过sample抽样的结果为：${value1.collect.toBuffer}")

println("======================= sample-2 ===========================")
val data4: RDD[Int] = data1.repartition(2)
val value2: RDD[Int] = data4.sample(withReplacement = false, 0.5)
println(s"分区数量为：${value2.getNumPartitions}")
println(s"经过sample抽样的结果为：${value2.collect.toBuffer}")

运行结果

全部评论

推荐最新楼层

昨天 14:09

三环集团有限公司_市场策略研究(准入职员工)

三环集团内推-三环集团内推码

机电面试流程一面（技术面）面试官：两位，一位是机电部门的技术主管，另一位是资深工程师。开场：面试官先让我进行了 3 - 5 分钟的自我介绍，主要围绕学习背景、项目经验以及对机电岗位的理解。项目提问：接着针对我简历上写的一个机电设备改造项目展开提问，比如改造的原因、具体实施过程中遇到了哪些技术难题，是怎么解决的。我详细讲述了因为原设备能耗高、效率低，所以我们通过更换节能型电机、优化控制系统来解决问题。在解决技术难题时，我们查阅了大量资料，请教了学校的教授，最终采用了新的控制算法实现了预期效果。专业知识考查：问了一些关于电机正反转控制电路的原理，以及如何实现电机的调速，还有 PLC 常见的故障及排...

三环集团开奖42人在聊

点赞评论收藏

分享

05-14 19:32

广州灵犀互娱信息技术有限公司_游戏项目管理工程师(准入职员工)

灵犀互娱内推-灵犀互娱内推码

告诉你只有实习生才知道的五个理由👀 理由1️⃣ 师兄师姐1V1带教，职场经验快速get 理由2️⃣ 公司学习资源丰富，为秋招积累知识储备（bushi 理由3️⃣ 团队扁平化氛围友好，没有加班文化 理由4️⃣ 上下班5分钟一趟的班车接送，躺着上班真的很city 理由5️⃣ 福利待遇一级棒，异地入职酒店补房补餐补交通补通通有 入职一个月 本infj已经狠狠舍不得走了😭 🟡 实习岗位是平台游戏运营（官网投递） 面试小tips： 1️⃣ 提前了解灵犀业务范畴及发展历史 2️⃣ 过往经历成果可量化、挖亮点 3️⃣ 游戏经历提前顺，新游热游要了解 🖼 团队氛围 负责带教的师兄师姐非常负...

阿里巴巴灵犀互娱公司福利 261人发布

点赞评论收藏

分享

04-05 18:26

蚌埠坦克学院 golang

遂发牛牛空间

喜欢疯狂星期四的猫头鹰在研究求职打法：短作业优先

点赞评论收藏

分享

03-21 10:33

江西农业大学 Java

牛油给分析一下，感觉像是经典画大饼

小国际：笑死我了

点赞评论收藏

分享

昨天 23:51

已编辑

中国地质大学（武汉） Java

955和996的真正区别

作为一个已经上班一段时间的茶叶蛋，由于项目组的改革，茶叶蛋本人也是经历了从955和996的工作节奏（近2个月都是没有过正常双休）！！！茶叶蛋本人对这两种工作节奏的本质已经是看的透透的了，牛油们选offer时候一定要考虑到这些！！！先问牛油们一个问题，你觉得955和996的最大区别在哪里？如果你认为最大的区别在于996要上班的时间更久，那你就大错特错了！最大的区别在于“工作节奏”！！！955之所以是955，是因为它只能955，你听说过哪家中大厂是945或者954的吗？996之所以是996，是因为只能写到996，如果一家公司敢说出来是997或者9116这种，那就太逆天辣！当茶叶蛋过着955的生活时...

选offer应该考虑哪些因素

点赞评论收藏

分享

评论

点赞

收藏

全站热榜

更多

创作者周榜

更多

正在热议

更多

# 一人一个landing小技巧 #

29288次浏览 559人参与

# 你们公司哪个部门最累？ #

10423次浏览 83人参与

# 牛友们的论文几号送审 #

30830次浏览 672人参与

# 这些公司卡简历很严格 #

29959次浏览 149人参与

# 大学最后一个寒假，我想…… #

33276次浏览 371人参与

# 你们公司几号发工资 #

15654次浏览 106人参与

# Tplink求职进展汇总 #

125080次浏览 700人参与

# 国企还是互联网，你怎么选？ #

134472次浏览 987人参与

# 工作压力大怎么缓解 #

73117次浏览 914人参与

# 正在实习的你，有转正机会吗？ #

372707次浏览 2884人参与

# 写简历别走弯路 #

725120次浏览 7905人参与

# 秋招想进国企该如何准备 #

55943次浏览 359人参与

# bilibili求职进展汇总 #

59204次浏览 580人参与

# 夸夸我的求职搭子 #

187178次浏览 1882人参与

# 查收我的offer竞争力报告 #

178469次浏览 1121人参与

# 牛油的搬砖plog #

36296次浏览 206人参与

# 我在牛爱网找对象 #

181807次浏览 1386人参与

# 520告白墙 #

29714次浏览 425人参与

# 经纬恒润求职进展汇总 #

116979次浏览 1025人参与

# 面试被问第一学历差时该怎么回答 #

124519次浏览 775人参与

牛客网
牛客企业服务