Hadoop MapReduce核心原理与实战
Hadoop MapReduce 编程模型概述
Hadoop MapReduce 是一种分布式计算框架,用于处理大规模数据集的并行计算。其核心思想是将计算任务分解为两个阶段:Map 和 Reduce。Map 阶段负责数据的分发与初步处理,Reduce 阶段对 Map 的输出进行汇总和最终计算。
MapReduce 的核心组件
- JobTracker:负责调度和管理作业,分配任务到 TaskTracker。
- TaskTracker:执行具体的 Map 或 Reduce 任务,并向 JobTracker 汇报状态。
- InputFormat:定义输入数据的格式及如何分割数据。
- OutputFormat:定义输出数据的存储格式。
MapReduce 的工作流程
-
输入分片(Input Splits)
输入数据被划分为多个分片,每个分片由一个 Map 任务处理。分片大小通常与 HDFS 块大小一致(默认为 128MB)。 -
Map 阶段
每个 Map 任务处理一个输入分片,生成键值对(key-value pairs)作为中间结果。例如,在词频统计任务中,Map 的输出格式为<word, 1>。 -
Shuffle 和 Sort
Map 的输出经过分区(Partitioning)、排序(Sorting)和合并(Combining),确保相同 key 的数据发送到同一个 Reduce 任务。 -
Reduce 阶段
Reduce 任务对相同 key 的 value 列表进行聚合处理。例如,词频统计中 Reduce 的输出为<word, total_count>。
编写 MapReduce 程序示例
以下是一个简单的词频统计(WordCount)程序:
Mapper 类
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().split(" ");
for (String w : words) {
word.set(w);
context.write(word, one);
}
}
}
Reducer 类
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
Driver 类
public class WordCountDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
MapReduce 的优化策略
-
Combiner 的使用
在 Map 端本地聚合数据,减少网络传输开销。例如,在 WordCount 中可直接复用 Reducer 逻辑作为 Combiner。 -
合理设置 Reduce 任务数
避免数据倾斜,通过job.setNumReduceTasks(int)调整 Reduce 任务数量。 -
自定义分区器(Partitioner)
确保数据均匀分布到 Reduce 任务,避免某些 Reduce 任务过载。
MapReduce 的局限性
- 不适合迭代计算或实时处理场景。
- 中间结果需写入磁盘,性能较低。
- 复杂的 DAG(有向无环图)任务需多次 MapReduce 作业串联。
替代方案
对于更复杂的计算需求,可考虑 Apache Spark、Flink 等新一代计算框架,它们支持内存计算和更灵活的编程模型。
BbS.okacop092.info/PoSt/1120_071165.HtM
BbS.okacop093.info/PoSt/1120_666833.HtM
BbS.okacop094.info/PoSt/1120_915986.HtM
BbS.okacop095.info/PoSt/1120_642337.HtM
BbS.okacop096.info/PoSt/1120_345070.HtM
BbS.okacop097.info/PoSt/1120_901144.HtM
BbS.okacop098.info/PoSt/1120_691269.HtM
BbS.okacop099.info/PoSt/1120_476700.HtM
BbS.okacop114.info/PoSt/1120_839703.HtM
BbS.okacop829.info/PoSt/1120_454507.HtM
BbS.okacop092.info/PoSt/1120_194431.HtM
BbS.okacop093.info/PoSt/1120_167408.HtM
BbS.okacop094.info/PoSt/1120_445969.HtM
BbS.okacop095.info/PoSt/1120_838822.HtM
BbS.okacop096.info/PoSt/1120_028392.HtM
BbS.okacop097.info/PoSt/1120_293319.HtM
BbS.okacop098.info/PoSt/1120_209022.HtM
BbS.okacop099.info/PoSt/1120_604905.HtM
BbS.okacop114.info/PoSt/1120_235840.HtM
BbS.okacop829.info/PoSt/1120_080317.HtM
BbS.okacop092.info/PoSt/1120_409746.HtM
BbS.okacop093.info/PoSt/1120_434774.HtM
BbS.okacop094.info/PoSt/1120_325783.HtM
BbS.okacop095.info/PoSt/1120_222495.HtM
BbS.okacop096.info/PoSt/1120_083934.HtM
BbS.okacop097.info/PoSt/1120_501487.HtM
BbS.okacop098.info/PoSt/1120_334974.HtM
BbS.okacop099.info/PoSt/1120_203749.HtM
BbS.okacop114.info/PoSt/1120_885627.HtM
BbS.okacop829.info/PoSt/1120_833768.HtM
BbS.okacop092.info/PoSt/1120_309020.HtM
BbS.okacop093.info/PoSt/1120_853805.HtM
BbS.okacop094.info/PoSt/1120_084447.HtM
BbS.okacop095.info/PoSt/1120_072028.HtM
BbS.okacop096.info/PoSt/1120_837872.HtM
BbS.okacop097.info/PoSt/1120_932901.HtM
BbS.okacop098.info/PoSt/1120_774254.HtM
BbS.okacop099.info/PoSt/1120_001892.HtM
BbS.okacop114.info/PoSt/1120_075324.HtM
BbS.okacop829.info/PoSt/1120_477512.HtM
BbS.okacop092.info/PoSt/1120_806718.HtM
BbS.okacop093.info/PoSt/1120_556100.HtM
BbS.okacop094.info/PoSt/1120_017769.HtM
BbS.okacop095.info/PoSt/1120_482945.HtM
BbS.okacop096.info/PoSt/1120_835966.HtM
BbS.okacop097.info/PoSt/1120_953236.HtM
BbS.okacop098.info/PoSt/1120_826872.HtM
BbS.okacop099.info/PoSt/1120_661263.HtM
BbS.okacop114.info/PoSt/1120_804466.HtM
BbS.okacop829.info/PoSt/1120_155345.HtM
BbS.okacop000.info/PoSt/1120_333188.HtM
BbS.okacop001.info/PoSt/1120_531258.HtM
BbS.okacop002.info/PoSt/1120_096104.HtM
BbS.okacop003.info/PoSt/1120_572481.HtM
BbS.okacop004.info/PoSt/1120_181380.HtM
BbS.okacop005.info/PoSt/1120_044952.HtM
BbS.okacop006.info/PoSt/1120_918011.HtM
BbS.okacop007.info/PoSt/1120_931380.HtM
BbS.okacop008.info/PoSt/1120_305926.HtM
BbS.okacop009.info/PoSt/1120_671819.HtM
BbS.okacop000.info/PoSt/1120_739992.HtM
BbS.okacop001.info/PoSt/1120_665828.HtM
BbS.okacop002.info/PoSt/1120_437133.HtM
BbS.okacop003.info/PoSt/1120_869399.HtM
BbS.okacop004.info/PoSt/1120_470909.HtM
BbS.okacop005.info/PoSt/1120_136643.HtM
BbS.okacop006.info/PoSt/1120_555916.HtM
BbS.okacop007.info/PoSt/1120_674326.HtM
BbS.okacop008.info/PoSt/1120_548734.HtM
BbS.okacop009.info/PoSt/1120_359282.HtM
BbS.okacop000.info/PoSt/1120_947666.HtM
BbS.okacop001.info/PoSt/1120_661124.HtM
BbS.okacop002.info/PoSt/1120_989163.HtM
BbS.okacop003.info/PoSt/1120_619464.HtM
BbS.okacop004.info/PoSt/1120_268871.HtM
BbS.okacop005.info/PoSt/1120_410626.HtM
BbS.okacop006.info/PoSt/1120_345771.HtM
BbS.okacop007.info/PoSt/1120_671956.HtM
BbS.okacop008.info/PoSt/1120_420217.HtM
BbS.okacop009.info/PoSt/1120_654156.HtM