文档库 最新最全的文档下载
当前位置:文档库 › hadoop安装笔记

hadoop安装笔记

Hadoop安装和基本功能测试(linux,伪分布式配置)


参考官方文档和网上经验,终于完成hadoop单节点的搭建和测试,等待有集群环境再进一步练手。


一、安装JDK(要求1.6以上版本)

1.下载JDK,成功后上传至服务器任意目录
以下是JDK1.7的官方下载地址:
https://www.wendangku.net/doc/4b6404207.html,/technetwork/java/javase/downloads/java-se-jdk-7-download-432154.html
2. 在当前目录输入
sh jdk-6u17-linux-i586-rpm.bin
(如果下载的是rpm包,则:#chmod 755 jdk-7-linux-x64.rpm =>#rpm -ivh jdk-7-linux-x64.rpm 直接安装至/usr/java/下,jdk1.7.0,不需要下面第三步)
3.看到 安装程序在询问您是否尊守许可协议页面 ,回车,空格都可以,看完协议
出现一行字:Do you aggree to the above license terms? [yes or no]
安装程序在问您是否愿意遵守刚才看过的许可协议。当然要同意了,输入"y" 或 "yes" 回车。
4.在命令行输入:
vi /etc/profile
在里面添加如下内容
export JAVA_HOME=/usr/java/jdk1.7.0
export JAVA_BIN=/usr/java/jdk1.7.0/bin
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME JAVA_BIN PATH CLASSPATH

5.进入 /usr/bin/目录(建快捷方式。。)
cd /usr/bin
ln -s -f /usr/java/jdk1.7.0/jre/bin/java
ln -s -f /usr/java/jdk1.7.0/bin/javac
6.注销命令:su,或source /etc/profile使生效
7. 在命令行输入
java -version
屏幕输出:
ava version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
8. 安装JDK1.7完毕.


二、安装Hadoop

1.解压
tar -xzf hadoop-0.21.0.tar.gz 或 tar -zvxf hadoop-0.21.0.tar.gz -C 目标目录
# find / -name hadoop*
/root/hadoop/hadoop-0.21.0
2.设置环境变量
export JAVA_HOME=/usr/java/jdk1.7.0
export HADOOP_HOME=/root/hadoop/hadoop-0.21.0
export PATH=$PATH:$HADOOP_HOME/bin
将以上三项写入~/.bash_profile中(或者/etc/profile中,source /etc/profile使生效)
export --查看是否正确设置

3.验证安装
hadoop version
输出如下:
Hadoop 0.21.0
Subversion https://https://www.wendangku.net/doc/4b6404207.html,/repos/asf/hadoop/common/branches/branch-0.20 -r 911707
Compiled by chrisdo on Fri Feb 19 08:07:34 UTC 2010




三、Hadoop Configuration

Each component in Hadoop is configured using an XML file. Core properties go in core-site.xml, HDFS properties go in hdfs-site.xml, and MapReduce properties go in mapred-site.xml. These files are all located in the conf subdirectory.

default setting example: /HADOOP_INSTALL/docs/(core-default.html, hdfs-default.html, mapred-default.html)

Hadoop Running Mode:
(1) Standalone (or local) Mode: no daemons, run in a single JVM. local filesystem + local MapRe

duce job runner
(2) Pseudo-distributed Mode: daemons run on the local machine. HDFS + MapReduce daemons
(3) Fully distributed Mode: daemons runs on a cluster of machines. HDFS + MapReduce daemons

To run Hadoop in a particular mode, you need to do two things: set the appropriate properties, and start the Hadoop daemons.


Key configuration properties for different modes:

Component Property Standalone Pseudo-distributed Fully distributed
-------------------------------------------------------------------------------------------------------------------------------------------------------
Core https://www.wendangku.net/doc/4b6404207.html, file:/// (default) hdfs://localhost/ hdfs://namenode/
HDFS dfs.replication N/A 1 3 (default)
MapReduce mapred.job.tracker local (default) localhost:8021 jobtracker:8021


Standalone模式不需要配置参数文件。


1.Pseudo-distributed模式参数配置

/conf/下修改三个配置文件,添加如下属性:



https://www.wendangku.net/doc/4b6404207.html,
hdfs://localhost/






dfs.replication
1






mapred.job.tracker
localhost:8021





2.配置SSH

首先查看SSH是否安装,可以直接输入SSH命令查看。
去除SSH登录密码,密码设置为空:
#ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Created directory '/root/.ssh'.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
1f:3a:89:b4:6f:2f:e1:1e:3e:80:9c:53:7b:5f:ae:93 root@DW
key转存:
#cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

设置完后,通过#ssh localhost测试,第一次登录会有如下提示:
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is a2:44:5f:79:00:c9:17:3b:b4:b5:47:cf:66:be:c4:0d.
Are you sure you want to continue connecting (yes/no)?
输入yes后,之后就不需要了。



3.格式化HDFS文件系统 Formatting the HDFS filesystem
#hadoop namenode -format
只要之前3个配置文件没配错,这里就没问题。
其中会有如下提示:
11/10/09 11:23:35 INFO common.Storage: Storage directory /tmp/hadoop-root/dfs/name has been successfully formatted.



4.启动和关闭守护进程 Starting and stopping the daemons
To start the HDFS and MapReduce daemons, type:
% start-dfs.sh
% start-mapred.sh

执行# start-all.sh 则开启全部,但是这里遇到一个问题:
namenode running as process 17031. Stop it first.
localhost: Error:

JAVA_HOME is not set.
localhost: Error: JAVA_HOME is not set.
jobtracker running as process 17793. Stop it first.
localhost: Error: JAVA_HOME is not set.
两个脚本都跑不了,但是JAVA_HOME明明是设置了的!可能与我这里使用的版本有关,这里用的是hadoop 0.21.0

网上搜索的解决方案都是扯淡,最后查看脚本内容才发现,原来需要在conf/下的hadoop-env.sh脚本里手工添加JAVA_HOME,然后再执行就没问题了。
启动成功!

验证hadoop是否正常启动:
#jps
此语句执行后会列出已启动的东西NameNode,JobTracker,SecondaryNameNode...如果NameNode没有成功启动的话就要先执行"bin/stop-all.sh"停掉所有东西,然后重新格式化namenode,再启动
正常输出如下:
20738 TaskTracker
17793 JobTracker
20840 Jps
20495 SecondaryNameNode
20360 DataNode
17031 NameNode


启动成功后,查看集群情况:
#hadoop dfsadmin -report
http://10.80.18.191:50070


可以查看HDFS文件系统情况:
#hadoop fsck /

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------



四、测试hadoop功能(wordcount 测试)

1、准备需要进行wordcount的文件
vi /tmp/test.txt
(打开后随便输入一些内容,如"mu ha ha ni da ye da ye da",然后保存退出)


2、将准备的测试文件上传到dfs文件系统中的firstTest中
hadoop dfs -copyFromLocal /tmp/test.txt firstTest
(注:如dfs中不包含firstTest的话就会自动创建一个,关于查看dfs文件系统中已有目录的指令为"hadoop dfs -ls /")
输出如下:
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
11/10/09 14:38:05 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/10/09 14:38:05 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id

显示命令已过时=。=,现在已经用fs代替dfs命令
这里原作者说法有错误,firstTest不是目录,实际是放入HDFS的文件,可以如下查看文件内容:
hadoop fs -cat firstTest
与前面test.txt的内容一致
查看HDFS目录情况:
hadoop dfs -ls /

3、执行wordcount
hadoop jar hadoop-mapred-examples-0.20.2.jar wordcount firstTest result
hadoop jar hadoop-0.20.2-examples.jar wordcount firstTest result
(注:此语句意为“对firstTest文件执行wordcount,将统计结果输出到result文件中”,若result文件夹不存在则会自动创建一个)
这里执行报错:
Exception in thread "main" java.io.IOException: Error opening job jar: hadoop-mapred-example0.21.0.jar
提示找不到这个包
解决方法:
hadoop fs -rmr result 删除之前运行产生的firstTest文件
必须进入hadoop安装目

录下,即hadoop-mapred-examples-0.21.0.jar这个包所在目录:
#cd /root/hadoop/hadoop-0.21.0
#hadoop jar hadoop-mapred-examples-0.21.0.jar wordcount firstTest result
开始执行,如下输出:
11/10/09 15:28:29 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/10/09 15:28:29 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
11/10/09 15:28:29 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
11/10/09 15:28:30 INFO input.FileInputFormat: Total input paths to process : 1
11/10/09 15:28:30 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
11/10/09 15:28:30 INFO mapreduce.JobSubmitter: number of splits:1
11/10/09 15:28:30 INFO mapreduce.JobSubmitter: adding the following namenodes' delegation tokens:null
11/10/09 15:28:30 INFO mapreduce.Job: Running job: job_201110091132_0001
11/10/09 15:28:31 INFO mapreduce.Job: map 0% reduce 0%
11/10/09 15:28:43 INFO mapreduce.Job: map 100% reduce 0%
11/10/09 15:28:49 INFO mapreduce.Job: map 100% reduce 100%
11/10/09 15:28:51 INFO mapreduce.Job: Job complete: job_201110091132_0001
11/10/09 15:28:51 INFO mapreduce.Job: Counters: 33
FileInputFormatCounters
BYTES_READ=42
FileSystemCounters
FILE_BYTES_READ=78
FILE_BYTES_WRITTEN=188
HDFS_BYTES_READ=143
HDFS_BYTES_WRITTEN=52
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
Job Counters
Data-local map tasks=1
Total time spent by all maps waiting after reserving slots (ms)=0
Total time spent by all reduces waiting after reserving slots (ms)=0
SLOTS_MILLIS_MAPS=6561
SLOTS_MILLIS_REDUCES=3986
Launched map tasks=1
Launched reduce tasks=1
Map-Reduce Framework
Combine input records=5
Combine output records=5
Failed Shuffles=0
GC time elapsed (ms)=112
Map input records=5
Map output bytes=62
Map output records=5
Merged Map outputs=1
Reduce input groups=5
Reduce input records=5
Reduce output records=5
Reduce shuffle bytes=78
Shuffled Maps =1
Spilled Records=10
SPLIT_RAW_BYTES=101



4、查看结果
hadoop dfs -cat result/part-r-00000
(注:结果文件默认是输出到一个名为“part-r-*****”的文件中的,可用指令“hadoop dfs -ls result”查看result目录

下包含哪些文件)
hadoop dfs -ls result
输出如下:
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
11/10/09 15:50:52 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/10/09 15:50:52 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
Found 2 items
-rw-r--r-- 1 root supergroup 0 2011-10-09 15:28 /user/root/result/_SUCCESS
-rw-r--r-- 1 root supergroup 52 2011-10-09 15:28 /user/root/result/part-r-00000

part-r-00000就是结果文件,注意这个路径是在HDFS中的,OS里是不能直接访问到的。
查看结果:hadoop dfs -cat result/part-r-00000


5、重新执行,修改test.txt内容,然后删除原先的(或者新建文件名和目录名):
hadoop fs -rmr firstTest
hadoop fs -rmr result
再执行:
hadoop dfs -copyFromLocal /tmp/test.txt firstTest
hadoop jar /root/hadoop/hadoop-0.21.0/hadoop-mapred-examples-0.21.0.jar wordcount firstTest result
查看输出结果:
hadoop dfs -cat result/part-r-00000
输出如下:
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
11/10/09 16:25:02 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
11/10/09 16:25:02 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
ceshi 1

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


使用eclipse编写hadoop应用测试


三年多没写java了,非常生疏,练练手=。=

1.在eclipse中创建新的JAVA工程 HadoopTest,然后创建一个类(勾选自动生成main函数)

2.工程中导入hadoop-0.20.2-core.jar:右键工程=》Build Path=》Configure Build Path,添加外部jar包hadoop-0.20.2-core.jar

3.编写测试用源代码,如下:

import java.io.IOException;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;

public class TestOne {

/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
try {
FileSystem fs = FileSystem.get(conf);
Path f = new Path("hdfs://rhel-h1:9000/user/root/test1.txt");
FSDataOutputStream os =fs.create(f,true);
int i=0;
for (i=0;i<100;++i)
os.writeChars("test"+i);
os.close();
} catch (IOException e) {
e.printStackTrace();
}
}

}

4.编译工程,export为jar包,只导出classpath。

5.上传jar包到hadoop maseter,执行:
#hadoop jar HadoopTest.jar TestOne

由于原先测试代码中HDFS路径写错为"hdfs:///localhost/user/root/test1.txt"=。=,结果生成的测试文件放在了HDFS

的/localhost/user/root下,生成了test1文本,本意放在/user/root下,应该改为Path f = new Path("hdfs:///user/root/test1");
不过没关系,测试的目的达到了。
查看结果:
#hadoop fs -cat /user/root/test1.txt

如果要编写MR,则只要实现相应的接口即可,开发过程类似。MAP过程需要继承org.apache.hadoop.mapreduce包中的Mapper类并重写map方法,REDUCE则是Reducer类和reduce方法,然后在main中通过job.setMapperClass(map.class)和job.setReducerClass(reduce.class)来调用执行。



------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Hadoop常用指令


Hadoop fs –ls 就是查看/usr/root 目录下的内容,默认如果不填路径这就是当前用户路径;
Hadoop fs –ls / 查看HDFS根目录下内容
hadoop fs -lsr / 查看根目录及之下各级子目录
hadoop fs -put aatest.txt . 将当前目录下文件aatest.txt上传至HDFS中,注意,最后结尾有个. 表示放入HDFS默认的/user/$USER目录下,$USER表示登陆用户名
hadoop fs -cat aatest.txt 查看上传的文件
hadoop fs -cat aatest.txt | head
hadoop fs -tail aatest.txt 查看文件结尾的内容(最后一千字节的内容)
hadoop fs -get aatest.txt . 将HDFS上的文件下载到本地,这里结尾的.是指本地当前路径
hadoop fs –rm aatest.txt 文件删除
hadoop fsck / -files -blocks 文件块检查
hadoop dfsadmin –report 查看所有 DataNode的情况;
hadoop fs -copyFromLocal localfilename HDFSfilename 本地文件上传至HDFS
hadoop fs -copyToLocal xxxx
hadoop fs -mkdir xxx 创建目录

在-put命令这里做了个测试,分别传上2个几十M的文件到HDFS(两节点分布式部署),通过http://10.80.18.191:50070监控窗口可以看到,第一次文件放在了rhel-h3上,多了一个block,第二次文件放在了rhel-h2上,相应也多了一个block。在HDFS命令界面中查看,完全看不出文件分布在哪个节点上。

Hadoop fs –rmr xxx 删除目录;
Hadoop job 后面增加参数是对于当前运行的 Job的操作,例如 list,kill等;
Hadoop balancer 均衡磁盘负载。


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Windows下hadoop部署


1.安装JDK

2.安装cygwin
cygwin通过https://www.wendangku.net/doc/4b6404207.html,/install.html在线安装,安装过程中,download site建议添加https://www.wendangku.net/doc/4b6404207.html,/pub,然后选择该站点安装。
在“Select Packages”对话框中,一定要在“Net”类下,勾上open ssh和open ssl两项,以及“Base”类下的sed项。
建议勾上“Editors”下的vim,“Devel”下的subversion。

3.配置环境变量
添加JAVA_HOME,然后在PATH中

添加JDK的bin、cygwin的bin和cygwi的usr/bin目录

4.安装SSH服务
进入cygwin,输入ssh-host-config,按提示进行安装,注意第一个提示“should privilege separation be used?”选择no
安装成功后在计算机管理-服务中,开启相应服务。

5.SSH配置(类似linux下,设置互信)
$ssh-keygen
一路回车,生成keys文件:/home/Administrator/.ssh/id_rsa 以及 id_rsa.pub
cp /home/Administrator/.ssh/id_rsa.pub /home/Administrator/.ssh/authorized_keys

重进cygwin,执行ssh localhost,但是遇到以下错误:
connection closed by ::1

暂时解决不了



------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

分布式Hadoop安装步骤


集群环境;
10.80.18.191 rhel-h1 作为namenode和jobtracker
10.80.18.192 rhel-h2
10.80.18.193 rhel-h3


1.安装JDK

java -version 要求1.6以上版本
三台机器全部为/usr/java/jdk1.7.0
#vi /etc/profile
在里面添加如下内容
export JAVA_HOME=/usr/java/jdk1.7.0
export JAVA_BIN=/usr/java/jdk1.7.0/bin
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME JAVA_BIN PATH CLASSPATH

#source /etc/profile --使生效


2.三机器都创建hadoop用户和组
(实际测试未建,还是用root)
#groupadd hadoop
#useradd hadoop -g hadoop
#passwd hadoop

在三台机器的/etc/hosts中,都添加各个节点的IP和机器名:
10.80.18.191 rhel-h1
10.80.18.192 rhel-h2
10.80.18.193 rhel-h3


3.SSH配置

配置SSH使用户不需要密码登陆:
#ssh-keygen -t rsa -f ~/.ssh/id_rsa (伪分布式安装中为 #ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa,将密码设置为空)
私有密钥放在-f定义的文件中,公共的密钥是以相同的名称后,附加.pub后缀存储的。
#cd ~/.ssh
#cp id_rsa.pub authorized_keys
#scp authorized_keys rhel-h2:/root/.ssh --将密钥发送至其他节点
#scp authorized_keys rhel-h3:/root/.ssh
然后进入其他2个节点.ssh目录,修改authorized_keys权限:
#chmod 644 authorized_keys
这时从rhel-h1向其他节点发起连接,只有第一次需要密码

(如果hadoop用户的主目录是一个HFS文件系统,那么可以如下一次在整个集群中共享密钥:
#cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)

从master登陆slave进行ssh安全验证,然后运行ssh-add保存密码。
#ssh rhel-h1
#ssh-add

我这里额外将rhel-h2和rhel-h3也同时做了配置,使3台节点相互之间的SSH链接都不需要密码输入。


4.安装hadoop

建议不要安装在hadoop用户主目录下,安装在/usr/local或/opt下。这里统一安装到/opt下,即/opt/hadoop-0.20.2
改变文件拥有者:
chown -R hadoop:hadoop hadoop-0.20.2
设置环境变量:
export HADOOP_HOME=/opt/had

oop-0.20.2
export PATH=$PATH:$HADOOP_HOME/bin


5.hadoop配置

配置文件(/conf下):
hadoop-env.sh 环境变量设置
core-site.xml HDFS、common核心配置
hdfs-site.xml HDFS名称节点、第二名称节点和数据节点
mapred-site.xml MR的jobtracker和tasktracker配置
masters 运行第二名称节点的机器列表
slaves 记录运行数据节点和tasktracker的机器列表
hadoop-metrics.properties 控制hadoop发布metrics
log4j.properties 系统日志、审计日志的属性
以上配置文件可以通过-config选项定义到文件系统其他位置

这里在namenode上的三大配置文件,配置可以和伪分布式时一样。如下:



https://www.wendangku.net/doc/4b6404207.html,
hdfs://rhel-h1:9000/






https://www.wendangku.net/doc/4b6404207.html,.dir --配置名称节点HDFS位置
/hdfs
true



dfs.data.dir --配置数据节点HDFS位置
/hdfs
true



fs.checkpoint.dir
/hdfs/checkpoint
true



dfs.replication
1






mapred.job.tracker
rhel-h1:9001
true



mapred.tasktracker.map.tasks.maximum
8
true



mapred.tasktracker.reduce.tasks.maximum
2
true




编辑HADOOP_HOME/conf/下的masters,修改为master的主机名,rhel-h1
编辑HADOOP_HOME/conf/下的slaves,修改为所有slaves的主机名,一个节点一行
添加删除节点,也是修改这2个配置文件

拷贝HADOOP文件到其他节点上:
#scp -r /opt/hadoop-0.20.2 rhel-h2:/opt/hadoop-0.20.2
#scp -r /opt/hadoop-0.20.2 rhel-h3:/opt/hadoop-0.20.2

编辑所有节点上HADOOP_HOME/conf/下的hadoop-env.sh,添加JAVA_HOME=/usr/java/jdk1.7.0

格式化HDFS
#hadoop namenode -format
这里要注意,配置的HDFS目录不能提前自己手工建,不然这里格式化会失败。这里折腾我好久,格式化失败导致namenode一直起不来。

然后启动hadoop:
#start-all.sh

成功启动后,用jps命令确认,
namenode上会启动SecondaryNameNode、NameNode、JobTracker进程
datanode上会启动DataNode、TaskTracker进程

启动成功后,查看集群情况:
#hadoop dfsadmin -rep

ort
http://10.80.18.191:50070
查看datanode
http://10.80.18.192:50060/tasktracker.jsp

可以查看HDFS文件系统情况:
#hadoop fsck /


6.集群测试

可以打开如下页面监控jobtracker活动、查看运行日志:
http://10.80.18.191:50030/jobtracker.jsp

进行和伪分布式时一样的测试

(1)、准备需要进行wordcount的文件
vi /tmp/test.txt
(打开后随便输入一些内容,如"ceshi",然后保存退出)


(2)、将准备的测试文件上传到dfs文件系统中的firstTest中
hadoop dfs -copyFromLocal /tmp/test.txt firstTest
(注:如dfs中不包含firstTest的话就会自动创建一个,关于查看dfs文件系统中已有目录的指令为"hadoop dfs -ls /")
输出如下:

可以如下查看文件内容:
hadoop dfs -cat firstTest
与前面test.txt的内容一致
查看HDFS目录情况:
hadoop dfs -ls /

(3)、执行wordcount
hadoop jar hadoop-mapred-examples-0.20.2.jar wordcount firstTest result
(注:此语句意为“对firstTest下的所有文件执行wordcount,将统计结果输出到result文件夹中”,若result文件夹不存在则会自动创建一个)
这里执行报错:
Exception in thread "main" java.io.IOException: Error opening job jar: hadoop-mapred-example0.21.0.jar
提示找不到这个包
解决方法:
hadoop fs -rmr result 删除之前运行产生的firstTest文件
必须进入hadoop安装目录下,即hadoop-mapred-examples-0.21.0.jar这个包所在目录:
#cd /root/hadoop/hadoop-0.21.0
#hadoop-0.20.2]# hadoop jar hadoop-0.20.2-examples.jar wordcount firstTest result
如下输出:
[root@rhel-h1 hadoop-0.20.2]# hadoop jar hadoop-0.20.2-examples.jar wordcount firstTest result
11/10/20 17:48:07 INFO input.FileInputFormat: Total input paths to process : 1
11/10/20 17:48:07 INFO mapred.JobClient: Running job: job_201110201745_0001
11/10/20 17:48:08 INFO mapred.JobClient: map 0% reduce 0%
11/10/20 17:48:21 INFO mapred.JobClient: map 100% reduce 0%
11/10/20 17:48:30 INFO mapred.JobClient: map 100% reduce 33%
11/10/20 17:48:36 INFO mapred.JobClient: map 100% reduce 100%
11/10/20 17:48:38 INFO mapred.JobClient: Job complete: job_201110201745_0001
11/10/20 17:48:38 INFO mapred.JobClient: Counters: 17
11/10/20 17:48:38 INFO mapred.JobClient: Job Counters
11/10/20 17:48:38 INFO mapred.JobClient: Launched reduce tasks=1
11/10/20 17:48:38 INFO mapred.JobClient: Rack-local map tasks=1
11/10/20 17:48:38 INFO mapred.JobClient: Launched map tasks=1
11/10/20 17:48:38 INFO mapred.JobClient: FileSystemCounters
11/10/20 17:48:38 INFO mapred.JobClient: FILE_BYTES_READ=18
11/10/20 17:48:38 INFO mapred.JobClient: HDFS_BYTES_READ=6
11/10/20 17:48:38 INFO mapred.JobClient: FILE_BYTES_WRITTEN=68
11/10/20 17:48:38 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=8
11/10/20 17:48:38 INFO mapred.JobClient: Map-Reduce Framework
11/10/20 17:48:38 INFO mapred.JobClient

: Reduce input groups=1
11/10/20 17:48:38 INFO mapred.JobClient: Combine output records=1
11/10/20 17:48:38 INFO mapred.JobClient: Map input records=1
11/10/20 17:48:38 INFO mapred.JobClient: Reduce shuffle bytes=18
11/10/20 17:48:38 INFO mapred.JobClient: Reduce output records=1
11/10/20 17:48:38 INFO mapred.JobClient: Spilled Records=2
11/10/20 17:48:38 INFO mapred.JobClient: Map output bytes=10
11/10/20 17:48:38 INFO mapred.JobClient: Combine input records=1
11/10/20 17:48:38 INFO mapred.JobClient: Map output records=1
11/10/20 17:48:38 INFO mapred.JobClient: Reduce input records=1

这时候http://10.80.18.191:50070/中可以看到活动的数据节点数,这里显示: Live Nodes 2
这里的输出显然与伪分布式下不同了。


(4)、查看结果
hadoop dfs -cat result/part-r-00000
(注:结果文件默认是输出到一个名为“part-r-*****”的文件中的,可用指令“hadoop dfs -ls result”查看result目录下包含哪些文件)
hadoop dfs -ls result
输出如下:
ceshi 1

part-r-00000就是结果文件,注意这个路径是在HDFS中的,OS里是不能直接访问到的。
查看结果:hadoop dfs -cat result/part-r-00000


显然,搭建已经成功。



7.动态添加节点

按前面的方式配置新节点,包括hadoop安装、master/slaves节点添加、SSH互信等
然后执行:
#hadoop-daemon.sh start datanode



------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

hadoop源码载入


新建java project,Create project from existing source,选择目录D:\hadoop\common\trunk,项目名common

然后在项目属性中,Builders中新建Ant Builder,Buildfile项选择D:\hadoop\common\trunk\hadoop-mapreduce-project\build.xml。然后在Targets标签中,Manual Build栏点击Set Targets,勾上(jar Make hadoop-core.jar),去掉compile[default] 的勾。
退到项目属性Builders中,去掉原来Java Builder的勾,保存关闭。
然后关闭自动编译,点击手动编译project,Ant开始自动从网上下载所依赖的库。

编译完成后,会在build目录下生成文件hadoop-core-0.21.0-dev.jar


导入其他hadoop反编译代码时,编译报错的导入包
import https://www.wendangku.net/doc/4b6404207.html,uncher.JavaApplicationLaunchShortcut;
改为:
import https://www.wendangku.net/doc/4b6404207.html,unchConfigurations.JavaApplicationLaunchShortcut;

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


HDFS相关概念


Safemode:
On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state. The NameNode receives Heartbeat and Blockreport

messages from the DataNodes. A Blockreport contains the list of data blocks that a DataNode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the NameNode. After a configurable percentage of safely replicated data blocks checks in with the NameNode (plus an additional 30 seconds), the NameNode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The NameNode then replicates these blocks to other DataNodes.
在刚启动时,命名节点会进入安全模式这么个特殊状态。安全模式下不进行数据块的复制。在这个模式下,命名节点从数据节点接收心跳信号和数据块报告。数据块报告包含数据节点上的数据块列表。每个数据块需要满足最小冗余数。命名节点会根据数据块的冗余数来判断是否已被安全备份。当一定比例(可配置)的数据块冗余度检查完成后,命名节点将退出安全模式,接下来,命名节点将对不满足冗余度的数据块进行复制。

checkpoint:
When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory representation of the FsImage, and flushes out this new version into a new FsImage on disk. It can then truncate the old EditLog because its transactions have been applied to the persistent FsImage. This process is called a checkpoint.


相关文档