5.3 计数器(Counter)

2016-03-09 22:47:41 5,937 0

Hadoop中的计数器有点类似于日志，可以输出Hadoop在运行过程的运运算信息。

在之前运行的WordCount中，控制台输出的信息有以下内容(可以再运行一次WordCount案例进行查看)：

Counters: 38
	File System Counters #10个
		FILE: Number of bytes read=462
		FILE: Number of bytes written=541399
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=38
		HDFS: Number of bytes written=19
		HDFS: Number of read operations=15
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=6
	Map-Reduce Framework #20个
		Map input records=2
		Map output records=4
		Map output bytes=35
		Map output materialized bytes=49
		Input split bytes=109
		Combine input records=0
		Combine output records=0
		Reduce input groups=3
		Reduce shuffle bytes=49
		Reduce input records=4
		Reduce output records=3
		Spilled Records=8
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=59
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
		Total committed heap usage (bytes)=242360320
	Shuffle Errors #6个
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters  #1个
		Bytes Read=19
	File Output Format Counters #1个
		Bytes Written=19

可以看到输出的日志中，提示总共有38个计数器，分成了5组，File System Counters等称之为组名，可以看出每组分别有10、20、6、1、1个计数器。

对于这38个计数器，我们并不是每一个都关心，以下重点讲解部分计数器的作用

一、计数器讲解

1、File Input Format Counters

File Input Format Counters  #1个
		Bytes Read=19

表示的是我们从HDFS中读取的文件的字节数总共是19个字节

回归之前的word.txt中的文本内容

hello you
helo me

5+3+5+2=15，加上2个空格和一个换行，一个结束符也是19个字符。

2、Map-Reduce Framework

Map-Reduce Framework #20个
		Map input records=2
		Map output records=4
		Map output bytes=35
		Map output materialized bytes=49
		Input split bytes=109
		Combine input records=0
		Combine output records=0
		Reduce input groups=3
		Reduce shuffle bytes=49
		Reduce input records=4
		Reduce output records=3
		Spilled Records=8
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=59
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
		Total committed heap usage (bytes)=242360320

Map input records=2

hello you
hello me

刚好是2行

Map output records=4

由于我们的mapper中，是每读取一个单词，就输出一个键值对，因此map任务的输出是：

<hello,1>,
<you,1>,
<hello,1>,
<me,1>

刚好有四个

Reduce input records=4

map输出的记录就是reduce输入的记录数，因此也是四个

Reduce input groups=3

关于分组group的概念我们之后会详细讲解，实际上就是将mapper的输出的记录进行分组，即把相同key的分为一组，所以分组后是

<hello,{1,1}>
<me,{1}>
<you,{1}>

刚好分成3组。

Reduce output records=3

WordCount案例中的输出为

hello 2
you 1
me 1

刚好是3行。

Combine input records=0、Combine output records=0

这是属于规约，在后面我们会详细的讲解规约的概念。

二、自定义计数器

计数器用Counter对象表示，每个计数器都有一个组，只要组名(groupName)相同，那么这些计数器就自动属于一个组。并且每个计数器还有这自己的名字(counterName)，用以区分同一个组下的不同计数器。

获得一个计数器实例的方法如下：

Counter counter = context.getCounter(groupName, counterName);

例如，我们现在要进行敏感词统计，即分析某段文本内容中出现了多少次敏感词。假设我们把"hello"认为是一个敏感词。在WordCount案例的基础上，我们可以将TokenizerMapper的代码修改如下

public static class TokenizerMapper extends
			Mapper<LongWritable, Text, Text, IntWritable> {

		private final static IntWritable one = new IntWritable(1);
		private Text word = new Text();

		@Override
		public void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {
			// StringTokenizer是java工具类，将字符串按照空格进行分割
			StringTokenizer itr = new StringTokenizer(value.toString());
			
			//自定义计数器
			String groupName="Custom Group";//
			String counterName="Sensitive words";
			Counter counter = context.getCounter(groupName, counterName);
			
			// 每次出现一个单词，单词次数加1
			while (itr.hasMoreTokens()) {
				String nextToken = itr.nextToken();
				if(nextToken.equals("hello")){//假设"hello"为敏感词，每次输出，即加1
					counter.increment(1);
				}
				word.set(nextToken);
				context.write(word, one);
			}
		}
	}

再次运行WordCount案例，我们可以看到控制台中输出了我们自定义的计数器

Counters: 39
	File System Counters
		FILE: Number of bytes read=462
		FILE: Number of bytes written=541399
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=38
		HDFS: Number of bytes written=19
		HDFS: Number of read operations=15
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=6
	Map-Reduce Framework
		Map input records=2
		Map output records=4
		Map output bytes=35
		Map output materialized bytes=49
		Input split bytes=109
		Combine input records=0
		Combine output records=0
		Reduce input groups=3
		Reduce shuffle bytes=49
		Reduce input records=4
		Reduce output records=3
		Spilled Records=8
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=38
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
		Total committed heap usage (bytes)=242360320
	Custom Group #我们自定义的组名
		Sensitive words=2 #我们自定义的计数器的值为2
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=19
	File Output Format Counters 
		Bytes Written=19

上一篇：5.2 入门案例下一篇：5.4 规约(Combine)

欢迎转载,请注明出处!!!