Hadoop学习笔记——PHP使用MapReduce

cuixiaogang

基于《Hadoop学习笔记——基础篇》的学习和了解,大概掌握了Hadoop的相关知识,下面开始应用

核心原理

Hadoop本身是Java开发的,但Hadoop Streaming(一个独立的Jar包)提供了一种通用机制:

  • Mapper/Reducer无需依赖Hadoop Java API,只需是能读取标准输入(stdin)、输出标准输出(stdout)的可执行程序/脚本(PHP、Python、Shell、Perl等都适用)
  • Hadoop 框架负责:
    • 分发输入数据到各个Map节点
    • 启动Mapper脚本,将数据通过stdin传给脚本
    • 读取Mapper脚本的stdout输出(需遵循key\tvalue格式),进行Shuffle/Sort
    • 将排序后的结果通过stdin传给Reducer脚本
    • 读取Reducer脚本的stdout输出,写入HDFS

核心条件

PHP作为脚本语言,要适配Hadoop Streaming,需满足2个核心条件:

  • 脚本可执行:给PHP脚本添加执行权限(chmod +x),并在脚本首行指定PHP解释器路径(如 #!/usr/bin/php)
  • 遵循 I/O 协议:
    • Mapper脚本:从stdin逐行读取输入,处理后以key\tvalue格式输出到stdout(\t是键值分隔符)
    • Reducer脚本:从stdin读取Mapper输出(已按key 排序),累加/聚合后同样以key\tvalue格式输出到stdout

实际案例

需求

  • 原始数据存储与HDFS中,且数据格式为JSON格式
  • 需要解析JSON,然后按A字段进行分类统计,计算每个A字段有多少种标签值(B字段)

PHP脚本

注意:如果标签值(B字段)的种类过多,就不能这样处理了,因为在Reduce阶段,每一个A字段的数据都会使用内存存储,只有当所有A字段都扫描完成后,才会写入到HDFS中,如果B字段的种类太多,有可能会造成内存溢出而失败

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
<?php

switch (strtolower(trim($argv[1]))) {
case "map":
Map();
break;
case "reduce":
Reduce();
break;
}

function Map() {
while (!feof(STDIN)) {
$line = trim(fgets(STDIN));
if (empty($line)) {
continue;
}
$data = json_decode($line, TRUE);
$A = isset($data['A']) ? $data['A'] : "";
$B = isset($data['B']) ? $data['B'] : "";
$createTime = isset($data['createtime']) ? $data['createtime'] : "";
if (empty($A) || empty($B) || empty($B)) {
continue;
}

if ($createTime < 1609430400) {
continue;
}

$B = implode(',', array_values(array_unique($B)));

echo sprintf("%s\t%s\t", $A, $B).PHP_EOL;
}
}

function Reduce() {
$prevA = "";
$prevB = [];
while (!feof(STDIN)) {
$line = trim(fgets(STDIN));
if (empty($line)) {
continue;
}
list($A, $B) = explode("\t", $line);

if ($prevA == "") {
$prevA = $A;
$prevB = [];
} else if ($A != $prevA) {
echo sprintf("%s\t%s", $prevA, implode(',', array_values(array_unique($prevB)))).PHP_EOL;
$prevA = $A;
$prevB = [];
}

$prevB = array_values(array_unique(array_merge($prevB, explode(',', $B))));
}

echo sprintf("%s\t%s", $prevA, implode(',', array_values(array_unique($prevB)))).PHP_EOL;

Shell脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#!/bin/bash
source "/home/用户名/.bashrc"

OUTPUT="原始数据存储的HDFS的目录地址"
INPUT="MR执行后的HDFS存储目录地址,最好是一个空目录"

# 先清理结果的地址
$HADOOP_HOME/bin/hadoop fs -rmr ${OUTPUT}

# 执行MR的命令
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \
-D mapred.job.name="[任务名称]@[hadoop的用户]" \
-D mapred.success.file.status=true \
# 限制单个任务的map的进程数量
-D mapred.job.max.map.running=3000 \
# 限制单个任务的reduce的进程数量
-D mapred.job.max.reduce.running=1000 \
# yarn的日志级别
-D yarn.app.mapreduce.am.log.level=ERROR \
-D mapred.compress.map.output=true \
-D mapred.linerecordreader.maxlength=500000000 \
-D stream.non.zero.exit.is.failure=false \
# 原始数据目录地址
-input "$INPUT" \
# 结构数据目录地址
-output "$OUTPUT" \
# MAP执行的命令,具体逻辑看上面的PHP脚本
-mapper "/usr/local/php7/bin/php test.php map" \
# REDUCE执行的命令,具体逻辑看上面的PHP脚本
-reducer "/usr/local/php7/bin/php test.php reduce" \
# 将这个脚本同步到所有的MR节点上
-file ./test.php \
# 指定 Reduce 任务的数量
-numReduceTasks 300

# 确认完成后是否成功
$HADOOP_HOME/bin/hadoop fs -stat $OUTPUT/_SUCCESS
if [ $? -ne 0 ]; then
echo "no hdfs data have done, pls check or waiting"
exit
fi

SHELL日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
22/02/03 10:42:49 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./20230203_empty_name.php] [/home/xitong/software/hadoop-2.7.2U22/share/hadoop/tools/lib/hadoop-streaming-2.7.2U23.jar] /tmp/streamjob7350146701406364883.jar tmpDir=null
22/02/03 10:42:53 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
22/02/03 10:42:55 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
22/02/03 10:42:55 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 1dd1dc2cf15665ed438e113fbfecc15772b71e70]
22/02/03 10:42:56 INFO mapred.FileInputFormat: set mapred.input.dir.error.pass : false
22/02/03 10:42:56 INFO mapred.FileInputFormat: Total input paths to process : 1000
22/02/03 10:42:56 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
22/02/03 10:42:56 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
22/02/03 10:42:56 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.104.22:9866
22/02/03 10:42:56 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.104.93:9866
22/02/03 10:42:56 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.85.58:9866
22/02/03 10:42:56 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.136.120:9866
22/02/03 10:42:56 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.131.156:9866
22/02/03 10:42:56 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.94.223:9866
22/02/03 10:42:56 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.130.236:9866
22/02/03 10:42:56 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.90.56:9866
22/02/03 10:42:56 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.96.35:9866
22/02/03 10:42:56 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.130.140:9866
22/02/03 10:42:57 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.145.200:9866
22/02/03 10:42:57 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.179.164:9866
22/02/03 10:42:57 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.107.205:9866
22/02/03 10:42:57 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.155.7:9866
22/02/03 10:42:57 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.168.228:9866
22/02/03 10:42:57 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.97.10:9866
22/02/03 10:42:57 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.172.100:9866
22/02/03 10:42:57 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.94.15:9866
22/02/03 10:42:57 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.155.45:9866
22/02/03 10:42:57 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.155.30:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.93.210:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.81.42:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.97.167:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.109.89:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.107.200:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.130.139:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.146.181:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.145.207:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.80.119:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.120.222:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.163.9.170:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.165.49:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.92.153:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.168.47:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.81.77:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.154.10:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.94.183:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.151.77:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.170.93:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.132.9:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.102.222:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.119.234:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.106.153:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.141.40:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.95.215:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.107.228:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.146.50:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.140.214:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.130.174:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.120.141:9866
22/02/03 10:42:58 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.106.161:9866
22/02/03 10:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.103.53:9866
22/02/03 10:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.161.153.21:9866
22/02/03 10:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.163.16:9866
22/02/03 10:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.111.58:9866
22/02/03 10:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.130.73:9866
22/02/03 10:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.141.8:9866
22/02/03 10:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.110.181:9866
22/02/03 10:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.140.92:9866
22/02/03 10:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.91.33:9866
22/02/03 10:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.119.138:9866
22/02/03 10:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.168.75:9866
22/02/03 10:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.89.91:9866
22/02/03 10:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.140.9:9866
22/02/03 10:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.90.107:9866
22/02/03 10:43:01 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.127.80:9866
22/02/03 10:43:01 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.168.50:9866
22/02/03 10:43:01 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.132.7:9866
22/02/03 10:43:01 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.136.122:9866
22/02/03 10:43:01 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.168.14:9866
22/02/03 10:43:01 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.152.238:9866
22/02/03 10:43:01 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.167.243:9866
22/02/03 10:43:01 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.177.236:9866
22/02/03 10:43:01 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.141.52:9866
22/02/03 10:43:01 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.86.14:9866
22/02/03 10:43:02 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.106.123:9866
22/02/03 10:43:02 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.90.87:9866
22/02/03 10:43:02 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.119.154:9866
22/02/03 10:43:02 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.128.24:9866
22/02/03 10:43:02 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.155.227:9866
22/02/03 10:43:02 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.165.11:9866
22/02/03 10:43:02 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.89.55:9866
22/02/03 10:43:02 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.92.245:9866
22/02/03 10:43:02 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.140.135:9866
22/02/03 10:43:02 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.87.24:9866
22/02/03 10:43:02 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.111.236:9866
22/02/03 10:43:02 INFO net.NetworkTopology: Adding a new node: /default-rack/10.161.126.50:9866
22/02/03 10:43:03 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.80.22:9866
22/02/03 10:43:03 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.130.167:9866
22/02/03 10:43:03 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.105.138:9866
22/02/03 10:43:03 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.82.27:9866
22/02/03 10:43:03 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.120.154:9866
22/02/03 10:43:03 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.90.31:9866
22/02/03 10:43:03 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.174.151:9866
22/02/03 10:43:03 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.105.140:9866
22/02/03 10:43:03 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.98.219:9866
22/02/03 10:43:04 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.89.152:9866
22/02/03 10:43:04 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.94.227:9866
22/02/03 10:43:04 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.136.89:9866
22/02/03 10:43:04 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.105.32:9866
22/02/03 10:43:04 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.147.135:9866
22/02/03 10:43:04 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.84.216:9866
22/02/03 10:43:04 INFO net.NetworkTopology: Adding a new node: /default-rack/10.163.9.8:9866
22/02/03 10:43:04 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.110.153:9866
22/02/03 10:43:04 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.106.181:9866
22/02/03 10:43:05 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.149.153:9866
22/02/03 10:43:05 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.149.205:9866
22/02/03 10:43:05 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.146.169:9866
22/02/03 10:43:05 INFO net.NetworkTopology: Adding a new node: /default-rack/10.160.109.73:9866
22/02/03 10:43:05 INFO net.NetworkTopology: Adding a new node: /default-rack/10.162.96.53:9866
22/02/03 10:43:07 INFO mapreduce.JobSubmitter: number of splits:9000
22/02/03 10:43:07 INFO mapreduce.JobSubmitter: Reset distributed cache file xxxxxxxxxxxxxxxxxxxxxxxxxx/test.php replication to 20
22/02/03 10:43:07 INFO Configuration.deprecation: mapred.linerecordreader.maxlength is deprecated. Instead, use mapreduce.input.linerecordreader.line.maxlength
22/02/03 10:43:07 INFO Configuration.deprecation: mapred.job.max.reduce.running is deprecated. Instead, use mapreduce.job.running.reduce.limit
22/02/03 10:43:07 INFO Configuration.deprecation: mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress
22/02/03 10:43:07 INFO Configuration.deprecation: mapred.job.max.map.running is deprecated. Instead, use mapreduce.job.running.map.limit
22/02/03 10:43:07 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
22/02/03 10:43:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1675284168475_79072
22/02/03 10:43:09 INFO mapred.YARNRunner: priority : submit job with pri null : null
22/02/03 10:43:09 INFO impl.YarnClientImpl: Submitted application application_1675284168475_79072
22/02/03 10:43:09 INFO mapreduce.Job: To kill this application: /usr/bin/hadoop/software/yarn/bin/yarn application -kill application_1675284168475_79072
22/02/03 10:43:09 INFO mapreduce.Job: The url to track the job: http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:8888/proxy/application_1675284168475_79072/
22/02/03 10:43:09 INFO mapreduce.Job: Running job: job_1675284168475_79072
22/02/03 10:43:37 INFO mapreduce.Job: Job job_1675284168475_79072 running in uber mode : false
22/02/03 10:43:37 INFO mapreduce.Job: map 0% reduce 0%
22/02/03 10:43:51 INFO mapreduce.Job: map 1% reduce 0%
22/02/03 10:43:52 INFO mapreduce.Job: map 4% reduce 0%
22/02/03 10:43:53 INFO mapreduce.Job: map 10% reduce 0%
22/02/03 10:43:54 INFO mapreduce.Job: map 15% reduce 0%
22/02/03 10:43:56 INFO mapreduce.Job: map 20% reduce 0%
22/02/03 10:43:57 INFO mapreduce.Job: map 24% reduce 0%
22/02/03 10:44:00 INFO mapreduce.Job: map 30% reduce 0%
22/02/03 10:44:01 INFO mapreduce.Job: map 31% reduce 0%
22/02/03 10:44:02 INFO mapreduce.Job: map 32% reduce 0%
22/02/03 10:44:05 INFO mapreduce.Job: map 39% reduce 0%
22/02/03 10:44:06 INFO mapreduce.Job: map 42% reduce 0%
22/02/03 10:44:07 INFO mapreduce.Job: map 46% reduce 0%
22/02/03 10:44:08 INFO mapreduce.Job: map 50% reduce 0%
22/02/03 10:44:09 INFO mapreduce.Job: map 53% reduce 0%
22/02/03 10:44:10 INFO mapreduce.Job: map 55% reduce 0%
22/02/03 10:44:16 INFO mapreduce.Job: map 58% reduce 0%
22/02/03 10:44:17 INFO mapreduce.Job: map 68% reduce 0%
22/02/03 10:44:18 INFO mapreduce.Job: map 71% reduce 0%
22/02/03 10:44:21 INFO mapreduce.Job: map 74% reduce 0%
22/02/03 10:44:22 INFO mapreduce.Job: map 79% reduce 0%
22/02/03 10:44:23 INFO mapreduce.Job: map 82% reduce 0%
22/02/03 10:44:24 INFO mapreduce.Job: map 85% reduce 0%
22/02/03 10:44:25 INFO mapreduce.Job: map 88% reduce 0%
22/02/03 10:44:26 INFO mapreduce.Job: map 91% reduce 0%
22/02/03 10:44:27 INFO mapreduce.Job: map 93% reduce 0%
22/02/03 10:44:28 INFO mapreduce.Job: map 95% reduce 0%
22/02/03 10:44:29 INFO mapreduce.Job: map 97% reduce 0%
22/02/03 10:44:30 INFO mapreduce.Job: map 98% reduce 0%
22/02/03 10:44:32 INFO mapreduce.Job: map 99% reduce 0%
22/02/03 10:44:37 INFO mapreduce.Job: map 100% reduce 3%
22/02/03 10:44:38 INFO mapreduce.Job: map 100% reduce 14%
22/02/03 10:44:39 INFO mapreduce.Job: map 100% reduce 23%
22/02/03 10:44:41 INFO mapreduce.Job: map 100% reduce 31%
22/02/03 10:44:42 INFO mapreduce.Job: map 100% reduce 33%
22/02/03 10:55:06 INFO mapreduce.Job: map 100% reduce 34%
22/02/03 10:57:16 INFO mapreduce.Job: map 100% reduce 33%
22/02/03 10:57:17 INFO mapreduce.Job: map 100% reduce 34%
22/02/03 10:58:09 INFO mapreduce.Job: map 100% reduce 35%
22/02/03 10:58:11 INFO mapreduce.Job: map 100% reduce 36%
22/02/03 11:04:44 INFO mapreduce.Job: map 100% reduce 85%
22/02/03 11:04:45 INFO mapreduce.Job: map 100% reduce 99%
22/02/03 11:04:47 INFO mapreduce.Job: map 100% reduce 100%
22/02/03 11:05:48 INFO mapreduce.Job: Job job_1675284168475_79072 completed successfully
22/02/03 11:05:48 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
22/02/03 11:05:48 INFO mapreduce.Job: Output: hdfs://XXXXXXXXXXXXXXXXXXXXXXXXXXXX/目录
22/02/03 11:05:51 INFO mapreduce.Job: Counters: 71
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=2362738655
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=0
HDFS: Number of bytes written=0
HDFS: Number of read operations=0
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
HDFSOLD: Number of bytes read=1195516955142
HDFSOLD: Number of bytes written=4385728
HDFSOLD: Number of read operations=27900
HDFSOLD: Number of large read operations=0
HDFSOLD: Number of write operations=600
Job Counters
Failed map tasks=14
Failed reduce tasks=1
Killed map tasks=16
Launched map tasks=9024
Launched reduce tasks=301
Other local map tasks=19
Data-local map tasks=6154
Rack-local map tasks=2852
Total time spent by all maps in occupied slots (ms)=203848430
Total time spent by all reduces in occupied slots (ms)=719666142
Total time spent by all map tasks (ms)=101924215
Total time spent by all reduce tasks (ms)=359833071
Total vcore-milliseconds taken by all map tasks=101924215
Total vcore-milliseconds taken by all reduce tasks=359833071
Total megabyte-milliseconds taken by all map tasks=156555594240
Total megabyte-milliseconds taken by all reduce tasks=736938129408
Map-Reduce Framework
Map input records=432106565
Map output records=128992
Map output bytes=4385728
Map output materialized bytes=40796982
Map read wallclock=9905444
Map write wallclock=1215619
Map task wallclock=51319692
Input split bytes=1413000
Combine input records=0
Combine output records=0
Reduce input groups=128992
Reduce shuffle bytes=40796982
Reduce input records=128992
Reduce output records=128992
Reduce read wallclock=2117
Reduce write wallclock=131111
Reduce task wallclock=270540
Reduce copy wallclock=357407719
Reduce sort wallclock=33886
Spilled Records=128992
Shuffled Maps =2700000
Failed Shuffles=1798
Merged Map outputs=2700000
GC time elapsed (ms)=1074185
CPU time spent (ms)=50583180
Physical memory (bytes) snapshot=9056058630144
Virtual memory (bytes) snapshot=30749134852096
Total committed heap usage (bytes)=11166241259520
Peak Map Physical memory (bytes)=1035931648
Peak Map Virtual memory (bytes)=3574439936
Peak Reduce Physical memory (bytes)=788824064
Peak Reduce Virtual memory (bytes)=3807383552
Shuffle Errors
BAD_ID=12
CONNECTION=0
IO_ERROR=1772
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
22/02/03 11:05:51 INFO mapreduce.Job: Write counters to _SUCCESS successful
22/02/03 11:05:51 INFO streaming.StreamJob: Output directory: XXXXXXXXXXXXXXXXXXXXXXXXXXXX/目录

监控与问题排查

在上面日志中,会发现一条日志:

1
22/02/03 10:43:09 INFO mapreduce.Job: The url to track the job: http://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:8888/proxy/application_1675284168475_79072/

可以通过这个WEB地址,来监控MR的运行状态

进度概览
进度概览

Map/Reduce进程概览
Map/Reduce进程概览

Map/Reduce数据数量统计
Map/Reduce数据数量统计