Hadoop Interacting with HDFS
-
Upload
datatorrent -
Category
Technology
-
view
168 -
download
1
Transcript of Hadoop Interacting with HDFS
DataTorrent
HADOOPInteracting with HDFS
1
→ What's the “Need” ? ←
❏ Big data Ocean
❏ Expensive hardware
❏ Frequent Failures and Difficult recovery
❏ Scaling up with more machines
2
→ Hadoop ←Open source software
- a Java framework- initial release: December 10, 2011
It provides both,
Storage → [HDFS]
Processing → [MapReduce]
HDFS: Hadoop Distributed File System
3
→ How Hadoop addresses the need? ← Big data Ocean
Have multiple machines. Each will store some portion of data, not the entire data.
Expensive hardware
Use commodity hardware. Simple and cheap.
Frequent Failures and Difficult recovery
Have multiple copies of data. Have the copies in different machines.
Scaling up with more machines
If more processing is needed, add new machines on the fly4
→ HDFS ←Runs on Commodity hardware: Doesn't require expensive machines
Large Files; Write-once, Read-many (WORM)
Files are split into blocksActual blocks go to DataNodesThe metadata is stored at NameNode
Replicate blocks to different node
Default configuration:
Block size = 128MB
Replication Factor = 3 5
6
7
8
→ Where NOT TO use Hadoop/HDFS ← Low latency data access
HDFS is optimized for high throughput of data at the expense of latency.
Large number of small filesNamenode has the entire file-system metadata in memory.
Too much metadata as compared to actual data.
Multiple writers / Arbitrary file modificationsNo support for multiple writers for a file
Always append to end of a file9
→ Some Key Concepts ←❏ NameNode
❏ DataNodes
❏ JobTracker (MR v1)
❏ TaskTrackers (MR v1)
❏ ResourceManager (MR v2)
❏ NodeManagers (MR v2)
❏ ApplicationMasters (MR v2)10
→ NameNode & DataNodes ←❏ NameNode:
Centerpiece of HDFS: The Master
Only stores the block metadata: block-name, block-location etc.
Critical component; When down, whole cluster is considered down; Single point of failure
Should be configured with higher RAM
❏ DataNode:
Stores the actual data: The Slave
In constant communication with NameNode
When down, it does not affect the availability of data/cluster
Should be configured with higher disk space
❏ SecondaryNameNode:
Doesn't actually act as a NameNode
Stores the image of primary NameNode at certain checkpoint
Used as backup to restore NameNode
11
12
→ JobTracker & TaskTrackers ←❏ JobTracker:
Talks to the NameNode to determine location of the data
Monitors all TaskTrackers and submits status of the job back to the client
When down, HDFS is still functional; no new MR job; existing jobs halted
Replaced by ResourceManager/ApplicationMaster in MRv2
❏ TaskTracker:
Runs on all DataNodes
TaskTracker communicates with JobTracker signaling the task progress
TaskTracker failure is not considered fatal
Replaced by NodeManager in MRv2
13
→ ResourceManager & NodeManager ←❏ Present in Hadoop v2.0
❏ Equivalent of JobTracker & TaskTracker in v1.0
❏ ResourceManager (RM):
Runs usually at NameNode; Distributes resources among applications.
Two main components: Scheduler and ApplicationsManager (AM)
❏ NodeManager (NM):
Per-node framework agent
Responsible for containers
Monitors their resource usage
Reports the stats to RM
Central ResourceManager and Node specific Manager together is called YARN
14
15
→ Hadoop 1.0 vs. 2.0 ←HDFS 1.0:
Single point of failure
Horizontal scaling performance issue
HDFS 2.0:
HDFS High Availability
HDFS Snapshot
Improved performance
HDFS Federation16
→ Interacting with HDFS ←Command prompt:
Similar to Linux terminal commands
Unix is the model, POSIX is the API
Web Interface:
Similar to browsing a FTP site on web
18
Interacting With HDFS
On Command Prompt
19
→ Notes ←File Paths on HDFS:
hdfs://<namenode>:<port>/path/to/file.txt
hdfs://127.0.0.1:8020/user/USERNAME/demo/data/file.txt
hdfs://localhost:8020/user/USERNAME/demo/data/file.txt
/user/USERNAME/demo/file.txt
demo/file.txt
File System:Local: local file system (linux)HDFS: hadoop file system
At some places:The terms “file” and “directory” has the same meaning.
20
→ Before we start ←Command:
hdfs
Usage:
hdfs [--config confdir] COMMAND
Example:
hdfs dfs
hdfs dfsadmin
hdfs fsck
hdfs namenode
hdfs datanode
21
hdfs `dfs` commands
22
→ In general Syntax for `dfs` commands ← hdfs
dfs
-<COMMAND>
-[OPTIONS]
<PARAMETERS>
e.g.
hdfs dfs -ls -R /user/USERNAME/demo/data/23
0. Do It yourselfSyntax:
hdfs dfs -help [COMMAND … ]
hdfs dfs -usage [COMMAND … ]
Example:
hdfs dfs -help cat
hdfs dfs -usage cat
24
1. List the file/directorySyntax:
hdfs dfs -ls [-d] [-h] [-R] <hdfs-dir-path>
Example:
hdfs dfs -ls
hdfs dfs -ls /
hdfs dfs -ls /user/USERNAME/demo/list-dir-example
hdfs dfs -ls -R /user/USERNAME/demo/list-dir-example
25
2. Creating a directorySyntax:
hdfs dfs -mkdir [-p] <hdfs-dir-path>
Example:
hdfs dfs -mkdir /user/USERNAME/demo/create-dir-example
hdfs dfs -mkdir -p /user/USERNAME/demo/create-dir-example/dir1/dir2/dir3
26
3. Create a file on local & put it on HDFSSyntax:
vi filename.txt
hdfs dfs -put [options] <local-file-path> <hdfs-dir-path>
Example:
vi file-copy-to-hdfs.txt
hdfs dfs -put file-copy-to-hdfs.txt /user/USERNAME/demo/put-example/
27
4. Get a file from HDFS to localSyntax:
hdfs dfs -get <hdfs-file-path> [local-dir-path]
Example:
hdfs dfs -get /user/USERNAME/demo/get-example/file-copy-from-hdfs.txt ~/demo/
28
5. Copy From LOCAL To HDFSSyntax:
hdfs dfs -copyFromLocal <local-file-path> <hdfs-file-path>
Example:
hdfs dfs -copyFromLocal file-copy-to-hdfs.txt /user/USERNAME/demo/copyFromLocal-example/
29
6. Copy To LOCAL From HDFSSyntax:
hdfs dfs -copyToLocal <hdfs-file-path> <local-file-path>
Example:
hdfs dfs -copyToLocal /user/USERNAME/demo/copyToLocal-example/file-copy-from-hdfs.txt ~/demo/
30
7. Move a file from local to HDFSSyntax:
hdfs dfs -moveFromLocal <local-file-path> <hdfs-dir-path>
Example:
hdfs dfs -moveFromLocal /path/to/file.txt /user/USERNAME/demo/moveFromLocal-example/
31
8. Copy a file within HDFSSyntax:
hdfs dfs -cp <hdfs-source-file-path> <hdfs-dest-file-path>
Example:
hdfs dfs -cp /user/USERNAME/demo/copy-within-hdfs/file-copy.txt /user/USERNAME/demo/data/
32
9. Move a file within HDFSSyntax:
hdfs dfs -mv <hdfs-source-file-path> <hdfs-dest-file-path>
Example:
hdfs dfs -mv /user/USERNAME/demo/move-within-hdfs/file-move.txt /user/USERNAME/demo/data/
33
10. Merge files on HDFSSyntax:
hdfs dfs -getmerge [-nl] <hdfs-dir-path> <local-file-path>
Examples:
hdfs dfs -getmerge -nl /user/USERNAME/demo/merge-example/ /path/to/all-files.txt
34
11. View file contentsSyntax:
hdfs dfs -cat <hdfs-file-path>
hdfs dfs -tail <hdfs-file-path>
hdfs dfs -text <hdfs-file-path>
Examples:
hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt
hdfs dfs -cat /user/USERNAME/demo/data/cat-example.txt | head
35
12. Remove files/dirs from HDFSSyntax:
hdfs dfs -rm [options] <hdfs-file-path>
Examples:
hdfs dfs -rm /user/USERNAME/demo/remove-example/remove-file.txt
hdfs dfs -rm -R /user/USERNAME/demo/remove-example/
hdfs dfs -rm -R -skipTrash /user/USERNAME/demo/remove-example/
36
13. Change file/dir propertiesSyntax:
hdfs dfs -chgrp [-R] <NewGroupName> <hdfs-file-path>
hdfs dfs -chmod [-R] <permissions> <hdfs-file-path>
hdfs dfs -chown [-R] <NewOwnerName> <hdfs-file-path>
Examples:
hdfs dfs -chmod -R 777 /user/USERNAME/demo/data/file-change-properties.txt
37
14. Check the file sizeSyntax:
hdfs dfs -du <hdfs-file-path>
Examples:
hdfs dfs -du /user/USERNAME/demo/data/file.txt
hdfs dfs -du -s -h /user/USERNAME/demo/data/
38
15. Create a zero byte file in HDFS
Syntax:
hdfs dfs -touchz <hdfs-file-path>
Examples:
hdfs dfs -touchz /user/USERNAME/demo/data/zero-byte-file.txt
39
16. File test operationsSyntax:
hdfs dfs -test -[defsz] <hdfs-file-path>
Examples:
hdfs dfs -test -e /user/USERNAME/demo/data/file.txt
echo $?
40
17. Get FileSystem StatisticsSyntax:
hdfs dfs -stat [format] <hdfs-file-path>
Format Options:
%b - file size in blocks, %g - group name of owner
%n - filename %o - block size
%r - replication %u - user name of owner
%y - modification date
41
18. Get File/Dir CountsSyntax:
hdfs dfs -count [-q] [-h] [-v] <hdfs-file-path>
Example:
hdfs dfs -count -v /user/USERNAME/demo/
42
19. Set replication factorSyntax:
hdfs dfs -setrep -w -R n <hdfs-file-path>
Examples:
hdfs dfs -setrep -w -R 2 /user/USERNAME/demo/data/file.txt
43
20. Set Block SizeSyntax:
hdfs dfs -D dfs.blocksize=blocksize -copyFromLocal <local-file-path> <hdfs-file-path>
Examples:
hdfs dfs -D dfs.blocksize=67108864 -copyFromLocal /path/to/file.txt /user/USERNAME/demo/block-example/
44
21. Empty the HDFS trashSyntax:
hdfs dfs -expunge
Location:
45
Other hdfs commands (admin)
46
22. HDFS Admin Commands: fsckSyntax:
hdfs fsck <hdfs-file-path>
Options:
[-list-corruptfileblocks |[-move | -delete | -openforwrite][-files [-blocks [-locations | -racks]]][-includeSnapshots]
47
48
23. HDFS Admin Commands: dfsadminSyntax:
hdfs dfsadmin
Options:
[-report [-live] [-dead] [-decommissioning]] [-safemode enter | leave | get | wait] [-refreshNodes] [-refresh <host:ipc_port> <key> [arg1..argn]] [-shutdownDatanode <datanode:port> [upgrade]] [-getDatanodeInfo <datanode_host:ipc_port>] [-help [cmd]]
Examples:
hdfs dfsadmin -report -live
49
50
24. HDFS Admin Commands: namenodeSyntax:
hdfs namenode
Options:
[-checkpoint] | [-format [-clusterid cid ] [-force] [-nonInteractive] ] | [-upgrade [-clusterid cid] ] | [-rollback] | [-recover [-force] ] | [-metadataVersion ]
Examples:
hdfs namenode -help
51
25. HDFS Admin Commands: getconfSyntax:
hdfs getconf [-options]
Options:
[ -namenodes ] [ -secondaryNameNodes ][ -backupNodes ] [ -includeFile ][ -excludeFile ] [ -nnRpcAddresses ][ -confKey [key] ]
52
Again,,, THE most important commands !!Syntax:
hdfs dfs -help [options]
hdfs dfs -usage [options]
Examples:
hdfs dfs -help help
hdfs dfs -usage usage
53
Interacting With HDFS
In Web Browser
54
Web HDFSURL:
http://namenode:50070/explorer.html
Examples:
http://localhost:50070/explorer.html
http://ec2-52-23-214-111.compute-1.amazonaws.com:50070/explorer.html
55
References1. http://www.hadoopinrealworld.com
2. http://www.slideshare.net/sanjeeb85/hdfscommandreference
3. http://www.slideshare.net/jaganadhg/hdfs-10509123
4. http://www.slideshare.net/praveenbhat2/adv-os-presentation
5. http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html
6. http://www.snia.org/sites/default/files/Hadoop2_New_And_Noteworthy_SNIA_v3.pdf
7. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html
8. http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
9. http://hadoop.apache.org/docs/r1.2.1/distcp.html
56
Resources
57
• Apache Apex - http://apex.apache.org/• Subscribe - http://apex.apache.org/community.html• Download - https://www.datatorrent.com/download/• Twitter
ᵒ @ApacheApex; Follow - https://twitter.com/apacheapexᵒ @DataTorrent; Follow – https://twitter.com/datatorrent
• Meetups - http://www.meetup.com/topics/apache-apex• Webinars - https://www.datatorrent.com/webinars/• Videos - https://www.youtube.com/user/DataTorrent• Slides - http://www.slideshare.net/DataTorrent/presentations • Startup Accelerator Program - Full featured enterprise product
ᵒ https://www.datatorrent.com/product/startup-accelerator/
We Are Hiring
58
• [email protected]• Developers/Architects• QA Automation Developers• Information Developers• Build and Release• Community Leaders
Thank You!!
59
APPENDIX
60
Copy data from one cluster to anotherDescription:
Copy data between hadoop clusters
Syntax:
hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs://nn2:8020/bar/foo
hadoop distcp -f hdfs://nn1:8020/srclist.file hdfs://nn2:8020/bar/foo
Where srclist.file contains
hdfs://nn1:8020/foo/a
hdfs://nn1:8020/foo/b61