伪分布式Hadoop+Spark安装与配置,并实现WordCount
在我们学习时更多的是用伪分布式环境来进行操作,以下就是伪分布式Hadoop+Spark安装与配置centos:7.4jdk:1.8hadoop:2.7.2scala:2.12.13spark:3.0.11、配置虚拟机下载centos-7,安装虚拟机1、配置静态ipvi /etc/sysconfig/network-scripts/ifcfg-ens33TYPE=EthernetPROXY_METH
在我们学习时更多的是用伪分布式环境来进行操作,以下就是伪分布式Hadoop+Spark安装与配置
- centos:7.4
- jdk:1.8
- hadoop:2.7.2
- scala:2.12.13
- spark:3.0.1
1、配置虚拟机
下载centos-7,安装虚拟机
1、配置静态ip
vi /etc/sysconfig/network-scripts/ifcfg-ens33
TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
# 修改为 static
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ens33
UUID=aec3fd78-3c06-4a77-8773-2667fe034ef4
DEVICE=ens33
# 修改为 yes
ONBOOT=yes
# 添加 ip和网关
IPADDR=192.168.75.120
GATEWAY=192.168.75.2
DNS1=192.168.75.2
service network restart
重启网络
2、关闭防火墙
//查看防火墙状态
systemctl status firewalld.service
active(running)
//关闭防火墙
systemctl stop firewalld.service
disavtive(dead)
//查看状态
systemctl status firewalld.service
//永久关闭防火墙
systemctl disable firewalld.service
3、测试 yum 使用
[root@localhost ~]# yum install vim
结果不能下载
已加载插件:fastestmirror
Determining fastest mirrors
Could not retrieve mirrorlist http://mirrorlist.centos.org/?release=7&arch=x86_64&repo=os&infra=stock error was
14: curl#6 - "Could not resolve host: mirrorlist.centos.org; 未知的错误"
One of the configured repositories failed (未知),
and yum doesn't have enough cached data to continue. At this point the only
safe thing yum can do is fail. There are a few ways to work "fix" this:
1. Contact the upstream for the repository and get them to fix the problem.
2. Reconfigure the baseurl/etc. for the repository, to point to a working
upstream. This is most often useful if you are using a newer
distribution release than is supported by the repository (and the
packages for the previous distribution release still work).
3. Run the command with the repository temporarily disabled
yum --disablerepo=<repoid> ...
4. Disable the repository permanently, so yum won't use it by default. Yum
will then just ignore the repository until you permanently enable it
again or use --enablerepo for temporary usage:
yum-config-manager --disable <repoid>
or
subscription-manager repos --disable=<repoid>
5. Configure the failing repository to be skipped, if it is unavailable.
Note that yum will try to contact the repo. when it runs most commands,
so will have to try and fail each time (and thus. yum will be be much
slower). If it is a very temporary problem though, this is often a nice
compromise:
yum-config-manager --save --setopt=<repoid>.skip_if_unavailable=true
Cannot find a valid baseurl for repo: base/7/x86_64
然后 ping 测试
[root@seckillmysql ~]# ping 114.114.114.114
PING 114.114.114.114 (114.114.114.114) 56(84) bytes of data.
64 bytes from 114.114.114.114: icmp_seq=1 ttl=128 time=36.6 ms
64 bytes from 114.114.114.114: icmp_seq=2 ttl=128 time=36.9 ms
[root@seckillmysql ~]# ping www.baidu.com
ping: www.baidu.com: 未知的名称或服务
发现ping不通百度,修改配置文件
[root@seckillmysql ~]# vi /etc/resolv.conf
//增加这两行
nameserver 223.5.5.5
nameserver 223.6.6.6
然后发现 ping 百度就可以了,yum 下载也可以执行了
2、安装hadoop
1、安装 jdk
1、上传 jdk-8u152-linux-x64.tar.gz
包到 opt/
2、解压
tar -zxvf jdk-8u152-linux-x64.tar.gz
3、改名字
mv jdk1.8.0_152 jdk
4、配置环境变量
vim /etc/profile
5、增加 JAVA_HOME
到路径
# jdk
export JAVA_HOME=/opt/jdk
export PATH=$PATH:$JAVA_HOME/bin
6、刷新路径
source /etc/profile
2、免密登录
1、生成密匙
ssh-keygen
一直按回车即可
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:O5cniA6gS5ovHr7MtIC65ED1pcJoEw463taWcUJF4m4 root@localhost.localdomain
The key's randomart image is:
+---[RSA 2048]----+
| ..o |
| . o |
|.. . o . |
|+ = + o |
|o*.o E .S |
|=oo.+ =. o . |
|=* o.+. + + . |
|#o+ .o o o |
|B%o . |
+----[SHA256]-----+
2、拷贝公钥
[root@localhost sbin]# cd /root/.ssh/
[root@localhost .ssh]# ssh-copy-id root@localhost
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@localhost's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'root@localhost'"
and check to make sure that only the key(s) you wanted were added.
3、hadoop
1、解压缩文件
1、上传 hadoop-2.7.2.tar.gz
压缩包至 opt/
2、解压缩
tar -zxvf hadoop-2.7.2.tar.gz
2、配置环境
1、配置环境变量
vim /etc/profile
2、增加 HADOOP_HOME
到路径
# hadoop
export HADOOP_HOME=/opt/hadoop-2.7.2
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
3、刷新环境变量
source /etc/profile
4、配置 hadoop
配置 hadoop-env.sh
vim /opt/hadoop-2.7.2/etc/hadoop/hadoop-env.sh
echo $JAVA_HOME
/opt/jdk
修改 JAVA_HOME 路径:
export JAVA_HOME=/opt/jdk
配置 core-site.xml
vim /opt/hadoop-2.7.2/etc/hadoop/core-site.xml
<!-- 指定 HDFS 中 NameNode 的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<!-- 指定 Hadoop 运行时产生文件的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop-2.7.2/data/tmp</value>
</property>
5、配置 hdfs
配置 hdfs.site.xml
vim /opt/hadoop-2.7.2/etc/hadoop/hdfs.site.xml
<!-- 指定 HDFS 副本的数量 -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!-- 指定 Hadoop 辅助名称节点主机配置 -->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>192.168.75.120:50090</value>
</property>
192.168.75.120 为你的虚拟机地址
6、启动测试
格式化
[root@localhost hadoop-2.7.2] bin/hdfs namenode -format
启动 namenode
[root@localhost hadoop-2.7.2] sbin/hadoop-daemon.sh start namenode
启动 datanode
[root@localhost hadoop-2.7.2] sbin/hadoop-daemon.sh start datanode
浏览器访问 http://192.168.75.120:50070
关闭
[root@localhost hadoop-2.7.2] sbin/hadoop-daemon.sh stop namenode
[root@localhost hadoop-2.7.2] sbin/hadoop-daemon.sh stop datanode
只练习 spark 以下可以不做
7、配置 yarn
配置 yarn-env.sh
vim /opt/hadoop-2.7.2/etc/hadoop/yarn-env.sh
修改 JAVA_HOME
export JAVA_HOME=/opt/jdk
配置 yarn-site.xml
vim /opt/hadoop-2.7.2/etc/hadoop/yarn-site.xml
<!-- Reducer 获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定 YARN 的 ResourceManager 的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>192.168.75.120</value>
</property>
8、配置 mapred
配置 mapred-env.sh
vim /opt/hadoop-2.7.2/etc/hadoop/mapred-env.sh
修改 JAVA_HOME
export JAVA_HOME=/opt/jdk
配置 mapred-site.xml
vim /opt/hadoop-2.7.2/etc/hadoop/mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
9、启动
[root@localhost hadoop-2.7.2] sbin/start-dfs.sh
3、安装spark
1、解压缩文件
1、上传 spark-3.0.1-bin-hadoop2.7.tgz
到 /opt
2、解压
tar -zxvf spark-3.0.1-bin-hadoop2.7.tgz
3、重命名
mv spark-3.0.1-bin-hadoop2.7 spark
2、配置环境
1、配置系统变量
vim /etc/profile
添加以下内容
# spark
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
export PATH=$PATH:$SPARK_HOME/sbin
刷新配置
source /etc/profile
2、修改 spark 配置
1、进入解压缩后路径的 conf 目录,修改 slaves.template 文件名为 slaves
mv slaves.template slaves
2、修改 spark-env.sh.template 文件名为 spark-env.sh
mv spark-env.sh.template spark-env.sh
3、修改 spark-env.sh 文件,添加 JAVA_HOME 环境变量和集群对应的 master 节点
export JAVA_HOME=/opt/jdk
SPARK_MASTER_HOST=192.168.75.120
SPARK_MASTER_PORT=7077
**注意: 7077 端口,相当于 hadoop3 内部通信的 8020 端口,此处的端口需要确认自己的 Hadoop配置 **
3、测试spark
启动 worker 和 Master
sbin/start-all.sh
测试是否成功
bin/spark-shell
21/12/13 04:15:13 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.75.120 instead (on interface ens33)
21/12/13 04:15:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/12/13 04:15:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.75.120:4040
Spark context available as 'sc' (master = local[*], app id = local-1639386925730).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.1
/_/
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
4、运行一个spark程序
1、创建工程
1、创建一个 maven 工程,jdk为 1.8,scala为 2.12.13
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.manster</groupId>
<artifactId>spark_demo</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<spark.version>3.0.1</spark.version>
<scala.version>2.12</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>com.huaban</groupId>
<artifactId>jieba-analysis</artifactId>
<version>1.0.2</version>
</dependency>
</dependencies>
<build>
<plugins><!-- 打jar插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.1</version>
<configuration>
<createDependencyReducedPom>false</createDependencyReducedPom>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
2、新建一个目录 scala
与 java
同级,点击右键 -> Mark Directory As -> Sources Root
2、wordcount程序
word.txt 放在主目录下,也就是和 pom.xml 一个级别
hello hadoop scala
spark hello scala
flume hbase spark
编写程序
package com.manster.spark.demo
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* @Author manster
* @Date 2021/12/13
* */
object WordCountDemo {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setMaster("local").setAppName("WordCountDemo")
val sparkContext = new SparkContext(sparkConf)
//获取文件
val file: RDD[String] = sparkContext.textFile("word.txt")
val split: RDD[String] = file.flatMap(_.split(" "))
val map: RDD[(String, Int)] = split.map(word => (word, 1))
val reduce: RDD[(String, Int)] = map.reduceByKey(_ + _)
val res: Array[(String, Int)] = reduce.collect()
res.foreach(println)
sparkContext.stop()
}
}
3、运行结果
(scala,2)
(spark,2)
(hadoop,1)
(flume,1)
(hello,2)
(hbase,1)
默认会打印出很多日志,我们可以在 resources 下新建 log4j.properties 文件来配置
log4j.rootCategory=ERROR, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n # Set the default spark-shell log level to ERROR. When running the spark-shell, the # log level for this class is used to overwrite the root logger's log level, so that # the user can have different defaults for the shell and regular Spark apps. log4j.logger.org.apache.spark.repl.Main=ERROR # Settings to quiet third party logs that are too verbose log4j.logger.org.spark_project.jetty=ERROR log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR log4j.logger.org.apache.parquet=ERROR log4j.logger.parquet=ERROR # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
3、打包
为了更好的测试效果,我们在代码中加入一个入口参数后再进行打包
package com.manster.spark.demo
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* @Author manster
* @Date 2021/12/13
* */
object WordCountDemo {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setMaster("local").setAppName("WordCountDemo")
val sparkContext = new SparkContext(sparkConf)
//获取文件
val file: RDD[String] = sparkContext.textFile(args(0))
val split: RDD[String] = file.flatMap(_.split(" "))
val map: RDD[(String, Int)] = split.map(word => (word, 1))
val reduce: RDD[(String, Int)] = map.reduceByKey(_ + _)
val res: Array[(String, Int)] = reduce.collect()
res.foreach(println)
sparkContext.stop()
}
}
在 maven 的插件中双击 package
,在 target 目录下生成 spark_demo-1.0-SNAPSHOT.jar
4、上传
1、先使用 xftp 或者 filezilla 将 word.txt
文件上传到 /opt/hadoop/
文件夹下
2、将这个文件上传到 hdfs(先启动 hadoop 的 start-dfs.sh
)
创建目录
hdfs dfs -mkdir /wordcount
上传文件至 hdfs
hdfs dfs -put word.txt /wordcount
3、上传 spark_demo-1.0-SNAPSHOT.jar
至 spark 文件夹下
4、跑一个 wordcount (先启动 spark 的 start-all.sh
)
bin/spark-submit \
--class com.manster.spark.demo.WordCount \
--master spark://192.168.75.120:7077 \
./spark_demo-1.0-SNAPSHOT.jar \
hdfs://192.168.75.120:9000/wordcount
- –class 表示要执行程序的主类
- –master spark://192.168.75.120:7077 独立部署模式,连接到 Spark 集群
- spark_demo-1.0-SNAPSHOT.jar 运行类所在的 jar 包
hdfs://192.168.75.120:9000/wordcount
表示程序的入口参数,用于设定当前要进行求 wordcount 的文件
执行任务时,会产生多个 Java 进程
执行任务时,默认采用服务器集群节点的总核数,每个节点内存 1024M
更多推荐
所有评论(0)