飞道的博客

Kettle与Hadoop(九)提交Spark作业

525人阅读  评论(0)

实验目的:
配置Kettle向Spark集群提交作业。

实验环境:
Spark History Server:
172.16.1.126

Spark Gateway:
172.16.1.124
172.16.1.125
172.16.1.126
172.16.1.127

PDI:
172.16.1.105

Hadoop版本:CDH 6.3.1
Spark版本:2.4.0-cdh6.3.1
PDI版本:8.3

Kettle连接CDH参见“https://wxy0327.blog.csdn.net/article/details/106406702”。配置步骤:
1. 将CDH中Spark的库文件复制到PDI所在主机
 


  
  1. -- 在 172.16. 1.126上执行
  2. cd /opt/cloudera/parcels/CDH- 6.3. 1- 1.cdh6. 3.1.p 0. 1470567/lib/spark
  3. scp -r * 172.16. 1.105 :/root/spark/

2. 为Kettle配置Spark
以下操作均在172.16.1.105以root用户执行。
(1)备份原始配置文件


  
  1. cp spark-defaults .conf spark-defaults .conf .bak
  2. cp spark-env .sh spark-env .sh .bak

(2)编辑spark-defaults.conf文件.

vim /root/spark/conf/spark-defaults.conf

内容如下:


  
  1. spark.yarn.archive=hdfs: //manager:8020/user/spark/lib/spark_jars.zip
  2. spark.hadoop.yarn.timeline-service.enabled= false
  3. spark.eventLog.enabled= true
  4. spark.eventLog.dir=hdfs: //manager:8020/user/spark/applicationHistory
  5. spark.yarn.historyServer.address=http: //node2:18088

(3)编辑spark-env.sh文件

vim /root/spark/conf/spark-env.sh

内容如下:


  
  1. #!/usr/bin/env bash
  2. HADOOP_CONF_DIR=/root/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh61
  3. SPARK_HOME=/root/spark

(4)编辑core-site.xml文件

vim /root/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh61/core-site.xml

去掉下面这段的注释:


  
  1. <property>
  2.   <name>net.topology.script.file.name </name>
  3.   <value>/etc/hadoop/conf.cloudera.yarn/topology.py </value>
  4. </property>

提交Spark作业:
1. 修改PDI自带的Spark例子
 

cp /root/data-integration/samples/jobs/Spark\ Submit/Spark\ submit.kjb /root/big_data/

在Kettle中打开/root/big_data/Spark\ submit.kjb文件,如图1所示。

图1

编辑Spark Submit Sample作业项,如图2所示。

图2

2. 保存行执行作业

日志如下:


  
  1. 2020/06/10 10:12:19 - Spoon - Starting job...
  2. 2020/06/10 10:12:19 - Spark submit - Start of job execution
  3. 2020/ 06/ 10 10: 12: 19 - Spark submit - Starting entry [Spark PI]
  4. 2020/ 06/ 10 10: 12: 19 - Spark PI - Submitting Spark Script
  5. 2020/ 06/ 10 10: 12: 20 - Spark PI - Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.
  6. 2020/ 06/ 10 10: 12: 21 - Spark PI - 20/ 06/ 10 10: 12: 21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin- java classes where applicable
  7. 2020/ 06/ 10 10: 12: 21 - Spark PI - 20/ 06/ 10 10: 12: 21 INFO client.RMProxy: Connecting to ResourceManager at manager/ 172.16 .1 .124: 8032
  8. 2020/ 06/ 10 10: 12: 21 - Spark PI - 20/ 06/ 10 10: 12: 21 INFO yarn.Client: Requesting a new application from cluster with 3 NodeManagers
  9. 2020/ 06/ 10 10: 12: 21 - Spark PI - 20/ 06/ 10 10: 12: 21 INFO conf.Configuration: resource-types.xml not found
  10. 2020/ 06/ 10 10: 12: 21 - Spark PI - 20/ 06/ 10 10: 12: 21 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
  11. 2020/ 06/ 10 10: 12: 21 - Spark PI - 20/ 06/ 10 10: 12: 21 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster ( 2048 MB per container)
  12. 2020/ 06/ 10 10: 12: 21 - Spark PI - 20/ 06/ 10 10: 12: 21 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
  13. 2020/ 06/ 10 10: 12: 21 - Spark PI - 20/ 06/ 10 10: 12: 21 INFO yarn.Client: Setting up container launch context for our AM
  14. 2020/ 06/ 10 10: 12: 21 - Spark PI - 20/ 06/ 10 10: 12: 21 INFO yarn.Client: Setting up the launch environment for our AM container
  15. 2020/ 06/ 10 10: 12: 22 - Spark PI - 20/ 06/ 10 10: 12: 22 INFO yarn.Client: Preparing resources for our AM container
  16. 2020/ 06/ 10 10: 12: 22 - Spark PI - 20/ 06/ 10 10: 12: 22 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://manager: 8020/ user/spark/lib/spark_jars.zip
  17. 2020/ 06/ 10 10: 12: 22 - Spark PI - 20/ 06/ 10 10: 12: 22 INFO yarn.Client: Uploading resource file:/root/spark/examples/jars/spark-examples_2 .11 -2.4 .0-cdh6 .3 .1.jar -> hdfs://manager: 8020/ user/root/.sparkStaging/application_1591323999364_0060/spark-examples_2 .11 -2.4 .0-cdh6 .3 .1.jar
  18. 2020/ 06/ 10 10: 12: 22 - Spark PI - 20/ 06/ 10 10: 12: 22 INFO yarn.Client: Uploading resource file:/tmp/spark -281973dd -8233 -4f12-b416 -36d28b74159c/__spark_conf__2533521329006469303.zip -> hdfs://manager: 8020/ user/root/.sparkStaging/application_1591323999364_0060/__spark_conf__.zip
  19. 2020/ 06/ 10 10: 12: 22 - Spark PI - 20/ 06/ 10 10: 12: 22 INFO spark.SecurityManager: Changing view acls to: root
  20. 2020/ 06/ 10 10: 12: 22 - Spark PI - 20/ 06/ 10 10: 12: 22 INFO spark.SecurityManager: Changing modify acls to: root
  21. 2020/ 06/ 10 10: 12: 22 - Spark PI - 20/ 06/ 10 10: 12: 22 INFO spark.SecurityManager: Changing view acls groups to:
  22. 2020/ 06/ 10 10: 12: 22 - Spark PI - 20/ 06/ 10 10: 12: 22 INFO spark.SecurityManager: Changing modify acls groups to:
  23. 2020/ 06/ 10 10: 12: 22 - Spark PI - 20/ 06/ 10 10: 12: 22 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
  24. 2020/ 06/ 10 10: 12: 22 - Spark PI - 20/ 06/ 10 10: 12: 22 INFO conf.HiveConf: Found configuration file file:/root/ data-integration/plugins/pentaho- big- data- plugin/hadoop-configurations/cdh61/hive-site.xml
  25. 2020/ 06/ 10 10: 12: 22 - Spark PI - 20/ 06/ 10 10: 12: 22 INFO security.YARNHadoopDelegationTokenManager: Attempting to load user 's ticket cache.
  26. 2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO yarn.Client: Submitting application application_1591323999364_0060 to ResourceManager
  27. 2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO impl.YarnClientImpl: Submitted application application_1591323999364_0060
  28. 2020/06/10 10:12:23 - Spark PI - 20/06/10 10:12:23 INFO yarn.Client: Application report for application_1591323999364_0060 (state: ACCEPTED)
  29. 2020/06/10 10:12:23 - Spark PI - 20/06/10 10:12:23 INFO yarn.Client:
  30. 2020/06/10 10:12:23 - Spark PI -      client token: N/A
  31. 2020/06/10 10:12:23 - Spark PI -      diagnostics: AM container is launched, waiting for AM container to Register with RM
  32. 2020/06/10 10:12:23 - Spark PI -      ApplicationMaster host: N/A
  33. 2020/06/10 10:12:23 - Spark PI -      ApplicationMaster RPC port: -1
  34. 2020/06/10 10:12:23 - Spark PI -      queue: root.users.root
  35. 2020/06/10 10:12:23 - Spark PI -      start time: 1591755142818
  36. 2020/06/10 10:12:23 - Spark PI -      final status: UNDEFINED
  37. 2020/06/10 10:12:23 - Spark PI -      tracking URL: http://manager:8088/proxy/application_1591323999364_0060/
  38. 2020/06/10 10:12:24 - Spark submit - Starting entry [Success]
  39. 2020/06/10 10:12:24 - Spark submit - Finished job entry [Success] (result=[true])
  40. 2020/06/10 10:12:24 - Spark submit - Finished job entry [Spark PI] (result=[true])
  41. 2020/06/10 10:12:24 - Spark submit - Job execution finished
  42. 2020/06/10 10:12:24 - Spoon - Job has ended.

Spark History Server Web UI如图3所示。

图3

点击“application_1591323999364_0061”,如图4所示。

图4

参考:


转载:https://blog.csdn.net/wzy0623/article/details/106660089
查看评论
* 以上用户言论只代表其个人观点,不代表本网站的观点或立场