实验目的:
配置Kettle向Spark集群提交作业。
实验环境:
Spark History Server:
172.16.1.126
Spark Gateway:
172.16.1.124
172.16.1.125
172.16.1.126
172.16.1.127
PDI:
172.16.1.105
Hadoop版本:CDH 6.3.1
Spark版本:2.4.0-cdh6.3.1
PDI版本:8.3
Kettle连接CDH参见“https://wxy0327.blog.csdn.net/article/details/106406702”。配置步骤:
1. 将CDH中Spark的库文件复制到PDI所在主机
-
-- 在
172.16.
1.126上执行
-
cd /opt/cloudera/parcels/CDH-
6.3.
1-
1.cdh6.
3.1.p
0.
1470567/lib/spark
-
scp -r *
172.16.
1.105
:/root/spark/
2. 为Kettle配置Spark
以下操作均在172.16.1.105以root用户执行。
(1)备份原始配置文件
-
cp
spark-defaults
.conf
spark-defaults
.conf
.bak
-
cp
spark-env
.sh
spark-env
.sh
.bak
(2)编辑spark-defaults.conf文件.
vim /root/spark/conf/spark-defaults.conf
内容如下:
-
spark.yarn.archive=hdfs:
//manager:8020/user/spark/lib/spark_jars.zip
-
spark.hadoop.yarn.timeline-service.enabled=
false
-
spark.eventLog.enabled=
true
-
spark.eventLog.dir=hdfs:
//manager:8020/user/spark/applicationHistory
-
spark.yarn.historyServer.address=http:
//node2:18088
(3)编辑spark-env.sh文件
vim /root/spark/conf/spark-env.sh
内容如下:
-
#!/usr/bin/env bash
-
-
HADOOP_CONF_DIR=/root/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh61
-
SPARK_HOME=/root/spark
(4)编辑core-site.xml文件
vim /root/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh61/core-site.xml
去掉下面这段的注释:
-
<property>
-
<name>net.topology.script.file.name
</name>
-
<value>/etc/hadoop/conf.cloudera.yarn/topology.py
</value>
-
</property>
提交Spark作业:
1. 修改PDI自带的Spark例子
cp /root/data-integration/samples/jobs/Spark\ Submit/Spark\ submit.kjb /root/big_data/
在Kettle中打开/root/big_data/Spark\ submit.kjb文件,如图1所示。
编辑Spark Submit Sample作业项,如图2所示。
2. 保存行执行作业
日志如下:
-
2020/06/10 10:12:19 - Spoon - Starting job...
-
2020/06/10 10:12:19 - Spark submit -
Start
of job execution
-
2020/
06/
10
10:
12:
19 - Spark submit -
Starting entry [Spark
PI]
-
2020/
06/
10
10:
12:
19 - Spark
PI - Submitting Spark Script
-
2020/
06/
10
10:
12:
20 - Spark
PI -
Warning:
Master yarn-cluster
is deprecated since
2.0. Please
use
master
"yarn"
with specified deploy
mode instead.
-
2020/
06/
10
10:
12:
21 - Spark
PI -
20/
06/
10
10:
12:
21 WARN util.NativeCodeLoader: Unable
to
load
native-hadoop
library
for your platform...
using builtin-
java classes
where applicable
-
2020/
06/
10
10:
12:
21 - Spark
PI -
20/
06/
10
10:
12:
21 INFO client.RMProxy: Connecting
to ResourceManager
at manager/
172.16
.1
.124:
8032
-
2020/
06/
10
10:
12:
21 - Spark
PI -
20/
06/
10
10:
12:
21 INFO yarn.Client: Requesting a
new application
from cluster
with
3 NodeManagers
-
2020/
06/
10
10:
12:
21 - Spark
PI -
20/
06/
10
10:
12:
21 INFO conf.Configuration:
resource-types.xml
not
found
-
2020/
06/
10
10:
12:
21 - Spark
PI -
20/
06/
10
10:
12:
21 INFO resource.ResourceUtils: Unable
to find
'resource-types.xml'.
-
2020/
06/
10
10:
12:
21 - Spark
PI -
20/
06/
10
10:
12:
21 INFO yarn.Client: Verifying our application has
not requested more
than the maximum
memory capability
of the cluster (
2048 MB per
container)
-
2020/
06/
10
10:
12:
21 - Spark
PI -
20/
06/
10
10:
12:
21 INFO yarn.Client: Will
allocate AM
container,
with
1408 MB
memory
including
384 MB overhead
-
2020/
06/
10
10:
12:
21 - Spark
PI -
20/
06/
10
10:
12:
21 INFO yarn.Client: Setting up
container launch
context
for our AM
-
2020/
06/
10
10:
12:
21 - Spark
PI -
20/
06/
10
10:
12:
21 INFO yarn.Client: Setting up the launch environment
for our AM
container
-
2020/
06/
10
10:
12:
22 - Spark
PI -
20/
06/
10
10:
12:
22 INFO yarn.Client: Preparing resources
for our AM
container
-
2020/
06/
10
10:
12:
22 - Spark
PI -
20/
06/
10
10:
12:
22 INFO yarn.Client:
Source
and destination
file systems
are the same.
Not copying hdfs://manager:
8020/
user/spark/lib/spark_jars.zip
-
2020/
06/
10
10:
12:
22 - Spark
PI -
20/
06/
10
10:
12:
22 INFO yarn.Client: Uploading
resource
file:/root/spark/examples/jars/spark-examples_2
.11
-2.4
.0-cdh6
.3
.1.jar -> hdfs://manager:
8020/
user/root/.sparkStaging/application_1591323999364_0060/spark-examples_2
.11
-2.4
.0-cdh6
.3
.1.jar
-
2020/
06/
10
10:
12:
22 - Spark
PI -
20/
06/
10
10:
12:
22 INFO yarn.Client: Uploading
resource
file:/tmp/spark
-281973dd
-8233
-4f12-b416
-36d28b74159c/__spark_conf__2533521329006469303.zip -> hdfs://manager:
8020/
user/root/.sparkStaging/application_1591323999364_0060/__spark_conf__.zip
-
2020/
06/
10
10:
12:
22 - Spark
PI -
20/
06/
10
10:
12:
22 INFO spark.SecurityManager: Changing
view acls
to: root
-
2020/
06/
10
10:
12:
22 - Spark
PI -
20/
06/
10
10:
12:
22 INFO spark.SecurityManager: Changing
modify acls
to: root
-
2020/
06/
10
10:
12:
22 - Spark
PI -
20/
06/
10
10:
12:
22 INFO spark.SecurityManager: Changing
view acls
groups
to:
-
2020/
06/
10
10:
12:
22 - Spark
PI -
20/
06/
10
10:
12:
22 INFO spark.SecurityManager: Changing
modify acls
groups
to:
-
2020/
06/
10
10:
12:
22 - Spark
PI -
20/
06/
10
10:
12:
22 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(root); groups with view permissions:
Set(); users with modify permissions:
Set(root); groups with modify permissions:
Set()
-
2020/
06/
10
10:
12:
22 - Spark
PI -
20/
06/
10
10:
12:
22 INFO conf.HiveConf:
Found configuration
file
file:/root/
data-integration/plugins/pentaho-
big-
data-
plugin/hadoop-configurations/cdh61/hive-site.xml
-
2020/
06/
10
10:
12:
22 - Spark
PI -
20/
06/
10
10:
12:
22 INFO security.YARNHadoopDelegationTokenManager: Attempting
to
load
user
's ticket cache.
-
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO yarn.Client: Submitting application application_1591323999364_0060 to ResourceManager
-
2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO impl.YarnClientImpl: Submitted application application_1591323999364_0060
-
2020/06/10 10:12:23 - Spark PI - 20/06/10 10:12:23 INFO yarn.Client: Application report for application_1591323999364_0060 (state: ACCEPTED)
-
2020/06/10 10:12:23 - Spark PI - 20/06/10 10:12:23 INFO yarn.Client:
-
2020/06/10 10:12:23 - Spark PI - client token: N/A
-
2020/06/10 10:12:23 - Spark PI - diagnostics: AM container is launched, waiting for AM container to Register with RM
-
2020/06/10 10:12:23 - Spark PI - ApplicationMaster host: N/A
-
2020/06/10 10:12:23 - Spark PI - ApplicationMaster RPC port: -1
-
2020/06/10 10:12:23 - Spark PI - queue: root.users.root
-
2020/06/10 10:12:23 - Spark PI - start time: 1591755142818
-
2020/06/10 10:12:23 - Spark PI - final status: UNDEFINED
-
2020/06/10 10:12:23 - Spark PI - tracking URL: http://manager:8088/proxy/application_1591323999364_0060/
-
2020/06/10 10:12:24 - Spark submit - Starting entry [Success]
-
2020/06/10 10:12:24 - Spark submit - Finished job entry [Success] (result=[true])
-
2020/06/10 10:12:24 - Spark submit - Finished job entry [Spark PI] (result=[true])
-
2020/06/10 10:12:24 - Spark submit - Job execution finished
-
2020/06/10 10:12:24 - Spoon - Job has ended.
Spark History Server Web UI如图3所示。
点击“application_1591323999364_0061”,如图4所示。
参考:
- https://help.pentaho.com/Documentation/8.3/Products/Spark_Submit
- https://blog.csdn.net/wzy0623/article/details/51097471
转载:https://blog.csdn.net/wzy0623/article/details/106660089
查看评论