Kettle与Hadoop（九）提交Spark作业_飞道的博客

Kettle与Hadoop（九）提交Spark作业

2020-07-01 16:18 659人阅读评论(0)

实验目的：
配置Kettle向Spark集群提交作业。

实验环境：
Spark History Server：
172.16.1.126

Spark Gateway：
172.16.1.124
172.16.1.125
172.16.1.126
172.16.1.127

PDI：
172.16.1.105

Hadoop版本：CDH 6.3.1
Spark版本：2.4.0-cdh6.3.1
PDI版本：8.3

Kettle连接CDH参见“https://wxy0327.blog.csdn.net/article/details/106406702”。配置步骤：
1. 将CDH中Spark的库文件复制到PDI所在主机


  
   
    
     
    
    
     
      -- 在
      172.16.
      1.126上执行
     
    
   
    
     
    
    
     
      cd /opt/cloudera/parcels/CDH-
      6.3.
      1-
      1.cdh6.
      3.1.p
      0.
      1470567/lib/spark
     
    
   
    
     
    
    
     
      scp -r * 
      172.16.
      1.105
      :/root/spark/

2. 为Kettle配置Spark
以下操作均在172.16.1.105以root用户执行。
（1）备份原始配置文件


  
   
    
     
    
    
     
      cp 
      spark-defaults
      .conf 
      spark-defaults
      .conf
      .bak
     
    
   
    
     
    
    
     
      cp 
      spark-env
      .sh 
      spark-env
      .sh
      .bak

（2）编辑spark-defaults.conf文件.

vim /root/spark/conf/spark-defaults.conf

内容如下：


  
   
    
     
    
    
     
      spark.yarn.archive=hdfs:
      //manager:8020/user/spark/lib/spark_jars.zip
     
    
   
    
     
    
    
     
      spark.hadoop.yarn.timeline-service.enabled=
      false
     
    
   
    
     
    
    
     
      spark.eventLog.enabled=
      true
     
    
   
    
     
    
    
     
      spark.eventLog.dir=hdfs:
      //manager:8020/user/spark/applicationHistory
     
    
   
    
     
    
    
     
      spark.yarn.historyServer.address=http:
      //node2:18088

（3）编辑spark-env.sh文件

vim /root/spark/conf/spark-env.sh

内容如下：


  
   
    
     
    
    
     
      #!/usr/bin/env bash
     
    
   
    
     
    
    
     
    
   
    
     
    
    
     
      HADOOP_CONF_DIR=/root/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh61
     
    
   
    
     
    
    
     
      SPARK_HOME=/root/spark

（4）编辑core-site.xml文件

vim /root/data-integration/plugins/pentaho-big-data-plugin/hadoop-configurations/cdh61/core-site.xml

去掉下面这段的注释：


  
   
    
     
    
    
     
      <property>
     
    
   
    
     
    
    
     
        
      <name>net.topology.script.file.name
      </name>
     
    
   
    
     
    
    
     
        
      <value>/etc/hadoop/conf.cloudera.yarn/topology.py
      </value>
     
    
   
    
     
    
    
     
      </property>

提交Spark作业：
1. 修改PDI自带的Spark例子

cp /root/data-integration/samples/jobs/Spark\ Submit/Spark\ submit.kjb /root/big_data/

在Kettle中打开/root/big_data/Spark\ submit.kjb文件，如图1所示。

编辑Spark Submit Sample作业项，如图2所示。

2. 保存行执行作业

日志如下：


  
   
    
     
    
    
     
      2020/06/10 10:12:19 - Spoon - Starting job...
     
    
   
    
     
    
    
     
      2020/06/10 10:12:19 - Spark submit - 
      Start 
      of job execution
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      19 - Spark submit - 
      Starting entry [Spark 
      PI]
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      19 - Spark 
      PI - Submitting Spark Script
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      20 - Spark 
      PI - 
      Warning: 
      Master yarn-cluster 
      is deprecated since 
      2.0. Please 
      use 
      master 
      "yarn" 
      with specified deploy 
      mode instead.
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      21 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      21 WARN util.NativeCodeLoader: Unable 
      to 
      load 
      native-hadoop 
      library 
      for your platform... 
      using builtin-
      java classes 
      where applicable
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      21 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      21 INFO client.RMProxy: Connecting 
      to ResourceManager 
      at manager/
      172.16
      .1
      .124:
      8032
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      21 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      21 INFO yarn.Client: Requesting a 
      new application 
      from cluster 
      with 
      3 NodeManagers
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      21 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      21 INFO conf.Configuration: 
      resource-types.xml 
      not 
      found
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      21 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      21 INFO resource.ResourceUtils: Unable 
      to find 
      'resource-types.xml'.
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      21 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      21 INFO yarn.Client: Verifying our application has 
      not requested more 
      than the maximum 
      memory capability 
      of the cluster (
      2048 MB per 
      container)
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      21 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      21 INFO yarn.Client: Will 
      allocate AM 
      container, 
      with 
      1408 MB 
      memory 
      including 
      384 MB overhead
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      21 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      21 INFO yarn.Client: Setting up 
      container launch 
      context 
      for our AM
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      21 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      21 INFO yarn.Client: Setting up the launch environment 
      for our AM 
      container
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      22 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      22 INFO yarn.Client: Preparing resources 
      for our AM 
      container
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      22 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      22 INFO yarn.Client: 
      Source 
      and destination 
      file systems 
      are the same. 
      Not copying hdfs://manager:
      8020/
      user/spark/lib/spark_jars.zip
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      22 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      22 INFO yarn.Client: Uploading 
      resource 
      file:/root/spark/examples/jars/spark-examples_2
      .11
      -2.4
      .0-cdh6
      .3
      .1.jar -> hdfs://manager:
      8020/
      user/root/.sparkStaging/application_1591323999364_0060/spark-examples_2
      .11
      -2.4
      .0-cdh6
      .3
      .1.jar
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      22 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      22 INFO yarn.Client: Uploading 
      resource 
      file:/tmp/spark
      -281973dd
      -8233
      -4f12-b416
      -36d28b74159c/__spark_conf__2533521329006469303.zip -> hdfs://manager:
      8020/
      user/root/.sparkStaging/application_1591323999364_0060/__spark_conf__.zip
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      22 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      22 INFO spark.SecurityManager: Changing 
      view acls 
      to: root
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      22 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      22 INFO spark.SecurityManager: Changing 
      modify acls 
      to: root
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      22 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      22 INFO spark.SecurityManager: Changing 
      view acls 
      groups 
      to:
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      22 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      22 INFO spark.SecurityManager: Changing 
      modify acls 
      groups 
      to:
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      22 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      22 INFO spark.SecurityManager: SecurityManager: 
      authentication disabled; ui acls disabled; users  with view permissions: 
      Set(root); groups with view permissions: 
      Set(); users  with modify permissions: 
      Set(root); groups with modify permissions: 
      Set()
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      22 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      22 INFO conf.HiveConf: 
      Found configuration 
      file 
      file:/root/
      data-integration/plugins/pentaho-
      big-
      data-
      plugin/hadoop-configurations/cdh61/hive-site.xml
     
    
   
    
     
    
    
     
      2020/
      06/
      10 
      10:
      12:
      22 - Spark 
      PI - 
      20/
      06/
      10 
      10:
      12:
      22 INFO security.YARNHadoopDelegationTokenManager: Attempting 
      to 
      load 
      user
      's ticket cache.
     
    
   
    
     
    
    
     
      2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO yarn.Client: Submitting application application_1591323999364_0060 to ResourceManager
     
    
   
    
     
    
    
     
      2020/06/10 10:12:22 - Spark PI - 20/06/10 10:12:22 INFO impl.YarnClientImpl: Submitted application application_1591323999364_0060
     
    
   
    
     
    
    
     
      2020/06/10 10:12:23 - Spark PI - 20/06/10 10:12:23 INFO yarn.Client: Application report for application_1591323999364_0060 (state: ACCEPTED)
     
    
   
    
     
    
    
     
      2020/06/10 10:12:23 - Spark PI - 20/06/10 10:12:23 INFO yarn.Client:
     
    
   
    
     
    
    
     
      2020/06/10 10:12:23 - Spark PI -      client token: N/A
     
    
   
    
     
    
    
     
      2020/06/10 10:12:23 - Spark PI -      diagnostics: AM container is launched, waiting for AM container to Register with RM
     
    
   
    
     
    
    
     
      2020/06/10 10:12:23 - Spark PI -      ApplicationMaster host: N/A
     
    
   
    
     
    
    
     
      2020/06/10 10:12:23 - Spark PI -      ApplicationMaster RPC port: -1
     
    
   
    
     
    
    
     
      2020/06/10 10:12:23 - Spark PI -      queue: root.users.root
     
    
   
    
     
    
    
     
      2020/06/10 10:12:23 - Spark PI -      start time: 1591755142818
     
    
   
    
     
    
    
     
      2020/06/10 10:12:23 - Spark PI -      final status: UNDEFINED
     
    
   
    
     
    
    
     
      2020/06/10 10:12:23 - Spark PI -      tracking URL: http://manager:8088/proxy/application_1591323999364_0060/
     
    
   
    
     
    
    
     
      2020/06/10 10:12:24 - Spark submit - Starting entry [Success]
     
    
   
    
     
    
    
     
      2020/06/10 10:12:24 - Spark submit - Finished job entry [Success] (result=[true])
     
    
   
    
     
    
    
     
      2020/06/10 10:12:24 - Spark submit - Finished job entry [Spark PI] (result=[true])
     
    
   
    
     
    
    
     
      2020/06/10 10:12:24 - Spark submit - Job execution finished
     
    
   
    
     
    
    
     
      2020/06/10 10:12:24 - Spoon - Job has ended.