5.3 计数器(Counter)
Hadoop中的计数器有点类似于日志,可以输出Hadoop在运行过程的运运算信息。
在之前运行的WordCount中,控制台输出的信息有以下内容(可以再运行一次WordCount案例进行查看):
Counters: 38 File System Counters #10个 FILE: Number of bytes read=462 FILE: Number of bytes written=541399 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=38 HDFS: Number of bytes written=19 HDFS: Number of read operations=15 HDFS: Number of large read operations=0 HDFS: Number of write operations=6 Map-Reduce Framework #20个 Map input records=2 Map output records=4 Map output bytes=35 Map output materialized bytes=49 Input split bytes=109 Combine input records=0 Combine output records=0 Reduce input groups=3 Reduce shuffle bytes=49 Reduce input records=4 Reduce output records=3 Spilled Records=8 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=59 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 Total committed heap usage (bytes)=242360320 Shuffle Errors #6个 BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters #1个 Bytes Read=19 File Output Format Counters #1个 Bytes Written=19
可以看到输出的日志中,提示总共有38个计数器,分成了5组,File System Counters等称之为组名
,可以看出每组分别有10、20、6、1、1个计数器。
对于这38个计数器,我们并不是每一个都关心,以下重点讲解部分计数器的作用
一、计数器讲解
1、File Input Format Counters
File Input Format Counters #1个 Bytes Read=19
表示的是我们从HDFS中读取的文件的字节数总共是19个字节
回归之前的word.txt中的文本内容
hello you helo me
5+3+5+2=15,加上2个空格和一个换行,一个结束符也是19个字符。
2、Map-Reduce Framework
Map-Reduce Framework #20个 Map input records=2 Map output records=4 Map output bytes=35 Map output materialized bytes=49 Input split bytes=109 Combine input records=0 Combine output records=0 Reduce input groups=3 Reduce shuffle bytes=49 Reduce input records=4 Reduce output records=3 Spilled Records=8 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=59 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 Total committed heap usage (bytes)=242360320
Map input records=2
hello you hello me
刚好是2行
Map output records=4
由于我们的mapper中,是每读取一个单词,就输出一个键值对,因此map任务的输出是:
<hello,1>, <you,1>, <hello,1>, <me,1>
刚好有四个
Reduce input records=4
map输出的记录就是reduce输入的记录数,因此也是四个
Reduce input groups=3
关于分组group的概念我们之后会详细讲解,实际上就是将mapper的输出的记录进行分组,即把相同key的分为一组,所以分组后是
<hello,{1,1}> <me,{1}> <you,{1}>
刚好分成3组。
Reduce output records=3
WordCount案例中的输出为
hello 2 you 1 me 1
刚好是3行。
Combine input records=0、Combine output records=0
这是属于规约,在后面我们会详细的讲解规约的概念。
二、自定义计数器
计数器用Counter
对象表示,每个计数器都有一个组,只要组名(groupName
)相同,那么这些计数器就自动属于一个组。并且每个计数器还有这自己的名字(counterName
),用以区分同一个组下的不同计数器。
获得一个计数器实例的方法如下:
Counter counter = context.getCounter(groupName, counterName);
例如,我们现在要进行敏感词统计,即分析某段文本内容中出现了多少次敏感词。假设我们把"hello"认为是一个敏感词。在WordCount案例的基础上,我们可以将TokenizerMapper
的代码修改如下
public static class TokenizerMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // StringTokenizer是java工具类,将字符串按照空格进行分割 StringTokenizer itr = new StringTokenizer(value.toString()); //自定义计数器 String groupName="Custom Group";// String counterName="Sensitive words"; Counter counter = context.getCounter(groupName, counterName); // 每次出现一个单词,单词次数加1 while (itr.hasMoreTokens()) { String nextToken = itr.nextToken(); if(nextToken.equals("hello")){//假设"hello"为敏感词,每次输出,即加1 counter.increment(1); } word.set(nextToken); context.write(word, one); } } }
再次运行WordCount案例,我们可以看到控制台中输出了我们自定义的计数器
Counters: 39 File System Counters FILE: Number of bytes read=462 FILE: Number of bytes written=541399 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=38 HDFS: Number of bytes written=19 HDFS: Number of read operations=15 HDFS: Number of large read operations=0 HDFS: Number of write operations=6 Map-Reduce Framework Map input records=2 Map output records=4 Map output bytes=35 Map output materialized bytes=49 Input split bytes=109 Combine input records=0 Combine output records=0 Reduce input groups=3 Reduce shuffle bytes=49 Reduce input records=4 Reduce output records=3 Spilled Records=8 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=38 CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 Total committed heap usage (bytes)=242360320 Custom Group #我们自定义的组名 Sensitive words=2 #我们自定义的计数器的值为2 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=19 File Output Format Counters Bytes Written=19