Hadoop的使用

Hadoop WordCount

wordCount 是Hadoop里面最经典的入门程序。

我们可以先写一个Word Count.java的程序:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);// 输出的value类型
private Text word = new Text();// 输出的key类型

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

TokenizerMapper:

IntSumReducer:

main函数中:

Job设置了6个类,首先是处理改作业的类,然后是设置负责Map,combine,reduce的类,以及输出的key,value的类。最后执行了一下。

打包执行

首先添加编译的环境变量:

1
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

然后编译打包(在正确的目录下):

1
2
root@huchi:/home/huchi/install_package/hadoop/hadoop-2.9.1# bin/hadoop com.sun.tools.javac.Main WordCount.java
root@huchi:/home/huchi/install_package/hadoop/hadoop-2.9.1# jar cf wc.jar WordCount*.class

生成了具体的文件后,我们就可以用了,不过我们要准备两个txt文件先上传到节点上。

1
hadoop fs -put file /

file 里面有几个txt.

1
2
3
4
5
6
root@huchi:/home/huchi/install_package/hadoop/hadoop-2.9.1# hadoop fs -cat /file/file1.txt
Hello huchi
Hello hadoop
Hello world
Hello guys
Hello nobody
文章目录
  1. 1. Hadoop WordCount
    1. 1.1. 打包执行
|