Hadoop Questions

8/20/2019 Hadoop Questions

1/41

BIG DATA HADOOP BANK

1 | P a g e https://www.facebook.com/chatchindia

FAQ’s For Data Science 1. What is the biggest data set that you have processed and how did you process it? What was the result?

2. Tell me two success stories about your analytic or computer science projects? How was the lift (or success

measured?

3. How do you optimize a web crawler to run much faster, extract better information and summarize data toproduce cleaner databases?

4. What is probabilistic merging (AKA fuzzy merging)? Is it easier to handle with SQL or other languages? And

which languages would you choose for semi-structured text data reconciliation?

5. State any 3 positive and negative aspects about your favorite statistical software.

6. You are about to send one million email (marketing campaign). How do you optimize delivery and its

response? Can both of these be done separately?

7. How would you turn unstructured data into structured data? Is it really necessary? Is it okay to store data as

flat text files rather than in an SQL-powered RDBMS?

8. In terms of access speed (assuming both fit within RAM) is it better to have 100 small hash tables or one big

hash table in memory? What do you think about in-database analytics?

9. Can you perform logistic regression with Excel? If yes, how can it be done? Would the result be good?

10. Give examples of data that does not have a Gaussian distribution, or log-normal. Also give examples of data

that has a very chaotic distribution?

11. How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not

doing anything? How familiar are you with A/B testing?

12. What is sensitivity analysis? Is it better to have low sensitivity and low predictive power? How do you perform

good cross-validation? What do you think about the idea of injecting noise in your data set to test the sensitivity

of your models?

13. Compare logistic regression with decision trees and neural networks. How have these technologies improvedover the last 15 years?

14. What is root cause analysis? How to identify a cause Vs a correlation? Give examples.

15. How to detect the best rule set for a fraud detection scoring technology? How do you deal with rule

redundancy, rule discovery and the combinatorial nature of the problem? Can an approximate solution to the

rule set problem be okay? How would you find an okay approximate solution? What factors will help you decide

that it is good enough and stop looking for a better one?

16. Which tools do you use for visualization? What do you think of Tableau, R and SAS? (for graphs). How to

efficiently represent 5 dimension in a chart or in a video?

17. Which is better: Too many false positives or too many false negatives?

18. Have you used any of the following: Time series models, Cross-correlations with time lags, Correlograms

Spectral analysis, Signal processing and filtering techniques? If yes, in which context?

19. What is the computational complexity of a good and fast clustering algorithm? What is a good clustering

algorithm? How do you determine the number of clusters? How would you perform clustering in one million

unique keywords, assuming you have 10 million data points and each one consists of two keywords and a

metric measuring how similar these two keywords are? How would you create this 10 million data points table

in the first place?

20. How can you fit Non-Linear relations between X (say, Age) and Y (say, Income) into a Linear Model?

https://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindiahttps://www.facebook.com/chatchindia


2/41


3/41



Yes, but only for the members of an object. A null cannot be added to the database collection as it isn’t an object. But

{}can be added.

Does an update fsync to disk immediately? No. Writes to disk are lazy by default. A write may only hit the disk a couple of seconds later. For example, if the

database receives thousand increments to an object within one second, it will only be flushed to disk once. (Note: fsyncoptions are available both at the command line and via getLastError_old.)

How do I do transactions/locking? MongoDB does not use traditional locking or complex transactions with rollback, as it is designed to be light weight

fast and predictable in its performance. It can be thought of how analogous is to the MySQL’s MyISAM autocommi

model. By keeping transaction support extremely simple, performance is enhanced, especially in a system that may

run across many servers.

Why are data files so large? MongoDB does aggressive preallocation of reserved space to avoid file system fragmentation.

How long does replica set failover take? It may take 10-30 seconds for the primary to be declared down by the other members and a new primary to be elected

During this window of time, the cluster is down for primary operations i.e writes and strong consistent reads. However,

eventually consistent queries may be executed to secondaries at any time (in slaveOk mode), including during this

window.

What’s a Master or Primary? This is a node/member which is currently the primary and processes all writes for the replica set. During a failover

event in a replica set, a different member can become primary.

What’s a Secondary or Slave? A secondary is a node/member which applies operations from the current primary. This is done by tailing the replication

oplog (local.oplog.rs). Replication from primary to secondary is asynchronous, however, the secondary will try to stay

as close to current as possible (often this is just a few milliseconds on a LAN).

Is it required to call ‘getLastError’ to make a write durable? No. If ‘getLastError’ (aka ‘Safe Mode’) is not called, the server does exactly behave the way as if it has been called.

The ‘getLastError’ call simply allows one to get a confirmation that the write operation was successfully committed. O

course, often you will want that confirmation, but the safety of the write and its durability is independent.

Should you start out with Sharded or with a Non-Sharded MongoDB environment? We suggest starting with Non-Sharded for simplicity and quick startup, unless your initial data set will not fit on single

servers. Upgrading to Sharded from Non-sharded is easy and seamless, so there is not a lot of advantage in setting

up Sharding before your data set is large.

How does Sharding work with replication?



4/41



Each Shard is a logical collection of partitioned data. The shard could consist of a single server or a cluster of replicas.

Using a replica set for each Shard is highly recommended.

When will data be on more than one Shard? MongoDB Sharding is range-based. So all the objects in a collection lie into a chunk. Only when there is more than 1

chunk there is an option for multiple Shards to get data. Right now, the default chunk size is 64mb, so you need atleast 64mb for migration.

What happens when a document is updated on a chunk that is being migrated? The update will go through immediately on the old Shard and then the change will be replicated to the new Shard

before ownership transfers.

What happens when a Shard is down or slow when querying? If a Shard is down, the query will return an error unless the ‘Partial’ query options is set. If a shard is responding slowly

Mongos will wait for it.

Can the old files in the ‘moveChunk’ directory be removed? Yes, these files are made as backups during normal Shard balancing operations. Once the operations are done then

they can be deleted. The clean-up process is currently manual so this needs to be taken care of to free up space.

How do you see the connections used by Mongos? The following command needs to be used: db._adminCommand(“connPoolStats”);

If a ‘moveChunk’ fails, is it necessary to cleanup the partially moved docs?

No, chunk moves are consistent and deterministic. The move will retry and when completed, the data will be only onthe new Shard.

What are the disadvantages of MongoDB? 1. A 32-bit edition has 2GB data limit. After that it will corrupt the entire DB, including the existing data. A 64-bi

edition won’t suffer from this bug/feature.

2. Default installation of MongoDB has asynchronous and batch commits turned on. Meaning, it lies when asked

to store something in DB and commits all changes in a batch at a later time in future. If there is a server crash

or power failure, all those commits buffered in memory will be lost. This functionality can be disabled, but then

it will perform as good as or worse than MySQL.

3. MongoDB is only ideal for implementing things like analytics/caching where impact of small data loss is

negligible.

4. In MongoDB, it’s difficult to represent relationships between data so you end up doing that manually by creating

another table to represent the relationship between rows in two or more tables.



5/41



FAQ’s For Hadoop Administration

Explain check pointing in Hadoop and why is it important? Check pointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient

Namenode recovery and restart and is an important indicator of overall cluster health.

Namenode persists filesystem metadata. At a high level, namenode’s primary responsibility is to store the HDFS

namespace. Meaning, things like the directory tree, file permissions and the mapping of files to block IDs. It is essentia

that this metadata are safely persisted to stable storage for fault tolerance.

This filesystem metadata is stored in two different parts: the fsimage and the edit log. The fsimage is a file that

represents a point-in-time snapshot of the filesystem’s metadata. However, while the fsimage file format is very

efficient to read, it’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than

writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation

in the edit log for durability. This way, if the NameNode crashes, it can restore its state by first loading the fsimage

then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state

of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the

namesystem modifications made since the creation of the fsimage.

What is default block size in HDFS and what are the benefits of

having smaller block sizes? Most block-structured file systems use a block size on the order of 4 or 8 KB. By contrast, the default block size in

HDFS is 64MB – and larger. This allows HDFS to decrease the amount of metadata storage required per file.

Furthermore, it allows fast streaming reads of data, by keeping large amounts of data sequentially organized on thedisk. As a result, HDFS is expected to have very large files that are read sequentially. Unlike a file system such as

NTFS or EXT which has numerous small files, HDFS stores a modest number of very large files: hundreds of

megabytes, or gigabytes each.

What are two main modules which help you interact with HDFS and

what are they used for? user@machine:hadoop$ bin/hadoop moduleName-cmdargs…

The moduleName tells the program which subset of Hadoop functionality to use. -cmd is the name of a specific

command within this module to execute. Its arguments follow the command name.

The two modules relevant to HDFS are : dfs and dfsadmin.

The dfs module, also known as ‘FsShell’, provides basic file manipulation operations and works with objects within

the file system. The dfsadmin module manipulates or queries the file system as a whole.



6/41



How can I setup Hadoop nodes (data nodes/namenodes) to use

multiple volumes/disks? Datanodes can store blocks in multiple directories typically located on different local disk drives. In order to setup

multiple directories one needs to specify a comma separated list of pathnames as values under config paramters

dfs.data.dir/dfs.datanode.data.dir. Datanodes will attempt to place equal amount of data in each of the directories.

Namenode also supports multiple directories, which stores the name space image and edit logs. In order to setup

multiple directories one needs to specify a comma separated list of pathnames as values under config paramters

dfs.name.dir/dfs.namenode.data.dir. The namenode directories are used for the namespace data replication so that

image and log could be restored from the remaining disks/volumes if one of the disks fails.

How do you read a file from HDFS? The following are the steps for doing this:

Step 1: The client uses a Hadoop client program to make the request.

Step 2: Client program reads the cluster config file on the local machine which tells it where the namemode is located

This has to be configured ahead of time.

Step 3: The client contacts the NameNode and requests the file it would like to read.

Step 4: Client validation is checked by username or by strong authentication mechanism like Kerberos.

Step 5: The client’s validated request is checked against the owner and permissions of the file.

Step 6: If the file exists and the user has access to it then the NameNode responds with the first block id and provides

a list of datanodes a copy of the block can be found, sorted by their distance to the client (reader).

Step 7: The client now contacts the most appropriate datanode directly and reads the block data. This process repeats

until all blocks in the file have been read or the client closes the file stream.

If while reading the file the datanode dies, library will automatically attempt to read another replica of the data from

another datanode. If all replicas are unavailable, the read operation fails and the client receives an exception. In case

the information returned by the NameNode about block locations are outdated by the time the client attempts to

contact a datanode, a retry will occur if there are other replicas or the read will fail.

What are schedulers and what are the three types of schedulers that

can be used in Hadoop cluster? Schedulers are responsible for assigning tasks to open slots on tasktrackers. The scheduler is a plug-in within the

jobtracker. The three types of schedulers are:



7/41



FIFO (First in First Out) Scheduler

Fair Scheduler

Capacity Scheduler

How do you decide which scheduler to use? The CS scheduler can be used under the following situations:

1. When you know a lot about your cluster workloads and utilization and simply want to enforce resource

allocation.

2. When you have very little fluctuation within queue utilization. The CS’s more rigid resource allocation makes

sense when all queues are at capacity almost all the time.

3. When you have high variance in the memory requirements of jobs and you need the CS’s memory-based

scheduling support.

4. When you demand scheduler determinism.

The Fair Scheduler can be used over the Capacity Scheduler under the following conditions:

1. When you have a slow network and data locality makes a significant difference to a job runtime, features like

delay scheduling can make a dramatic difference in the effective locality rate of map tasks.

2. When you have a lot of variability in the utilization between pools, the Fair Scheduler’s pre -emption model

affects much greater overall cluster utilization by giving away otherwise reserved resources when they’re no

used. 3. When you require jobs within a pool to make equal progress rather than running in FIFO order.

Why are ‘dfs.name.dir’ and ‘dfs.data.dir’ parameters used ? Where

are they specified and what happens if you don’t specify these

parameters? DFS.NAME.DIR specifies the path of directory in Namenode’s local file system to store HDFS’s metadata and

DFS.DATA.DIR specifies the path of directory in Datanode’s local file system to store HDFS’s file blocks. These

paramters are specified in HDFS-SITE.XML config file of all nodes in the cluster, including master and slave nodes.

If these paramters are not specified, namenode’s metadata and Datanode’s file blocks related information gets stored

in /tmp under HADOOP-USERNAME directory. This is not a safe place, as when nodes are restarted, data will be los

and is critical if Namenode is restarted, as formatting information will be lost.

What is file system checking utility FSCK used for? What kind of

information does it show? Can FSCK show information about files

which are open for writing by a client? FileSystem checking utility FSCK is used to check and display the health of file system, files and blocks in it. When

used with a path ( bin/Hadoop fsck / -files –blocks –locations -racks) it recursively shows the health of all files unde

the path. And when used with ‘/’ , it checks the entire file system. By Default FSCK ignores files still open for writing

by a client. To list such files, run FSCK with -openforwrite option.



8/41



FSCK checks the file system, prints out a dot for each file found healthy, prints a message of the ones that are less

than healthy, including the ones which have over replicated blocks, under-replicated blocks, mis-replicated blocks

corrupt blocks and missing replicas.

What are the important configuration files that need to be

updated/edited to setup a fully distributed mode of Hadoop cluster 1.x

( Apache distribution)? The Configuration files that need to be updated to setup a fully distributed mode of Hadoop are:

Hadoop-env.sh

Core-site.xml

Hdfs-site.xml

Mapred-site.xml

Masters

Slaves

These files can be found in your Hadoop>conf directory. If Hadoop daemons are started individually using

‘bin/Hadoop-daemon.sh start xxxxxx’ where xxxx is the name of daemon, then masters and slaves file need not be

updated and can be empty. This way of starting daemons requires command to be issued on appropriate nodes to

start appropriate daemons. If Hadoop daemons are started using ‘bin/start-dfs.sh’ and ‘bin/start-mapred.sh’, then

masters and slaves configurations files on namenode machine need to be updated.

Masters – Ip address/hostname of node where secondarynamenode will run.

Slaves –Ip address/hostname of nodes where datanodes will be run and eventually task trackers.

FAQ’s For Hadoop HDFS What is Big Data?

Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture, store

process, retrieve and analyze it with the help of on-hand database management tools or traditional data processing

techniques.

Can you give some examples of Big Data?

There are many real life examples of Big Data Facebook is generating 500+ terabytes of data per day, NYSE (New York

Stock Exchange) generates about 1 terabyte of new trade data per day, a jet airline collects 10 terabytes of censor

data for every 30 minutes of flying time. All these are day to day examples of Big Data! Can you give a detailed

overview about the Big Data being generated by Facebook?



9/41



As of December 31, 2012, there are 1.06 billion monthly active users on facebook and 680 million mobile users. On

an average, 3.2 billion likes and comments are posted every day on Facebook. 72% of web audience is on Facebook

And why not! There are so many activities going on facebook from wall posts, sharing images, videos, writing

comments and liking posts, etc. In fact, Facebook started using Hadoop in mid-2009 and was one of the initial users

of Hadoop.

What are the four characteristics of Big Data?

According to IBM, the three characteristics of Big Data are: Volume: Facebook generating 500+ terabytes of data pe

day. Velocity: Analyzing 2 million records each day to identify the reason for losses. Variety: images, audio, video

sensor data, log files, etc. Veracity: biases, noise and abnormality in data

How Big is ‘Big Data’?

With time, data volume is growing exponentially. Earlier we used to talk about Megabytes or Gigabytes. But time has

arrived when we talk about data volume in terms of terabytes, petabytes and also zettabytes! Global data volume was

around 1.8ZB in 2011 and is expected to be 7.9ZB in 2015. It is also known that the global information doubles in

every two years!

How is analysis of Big Data useful for organizations?

Effective analysis of Big Data provides a lot of business advantage as organizations will learn which areas to focus

on and which areas are less important. Big data analysis provides some early key indicators that can prevent the

company from a huge loss or help in grasping a great opportunity with open hands! A precise analysis of Big Data

helps in decision making! For instance, nowadays people rely so much on Facebook and Twitter before buying any

product or service. All thanks to the Big Data explosion.

Who are ‘Data Scientists’?

Data scientists are soon replacing business analysts or data analysts. Data scientists are experts who find solutions

to analyze data. Just as web analysis, we have data scientists who have good business insight as to how to handle a

business challenge. Sharp data scientists are not only involved in dealing business problems, but also choosing therelevant issues that can bring value-addition to the organization.

What is Hadoop?

Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity

computers using a simple programming model.



10/41



Why do we need Hadoop?



11/41


11 | P a g e A N T R I X S H G U P T A

Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to

store large data sets in our systems but to retrieve and analyze the big data in the organizations that too data presen

in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to

analyze the data present in different machines at different locations very quickly and in a very cost effective way. It

uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel

This is also known as parallel computing. The following link Why Hadoop gives a detailed explanation about whyHadoop is gaining so much popularity!

What are some of the characteristics of Hadoop framework?

Hadoop framework is written in Java. It is designed to solve problems that involve analyzing large data (e.g. petabytes)

The programming model is based on Google’s MapReduce. The infrastructure is based on Google’s Big Data and

Distributed File System. Hadoop handles large files/data throughput and supports data intensive distributed

applications. Hadoop is scalable as more nodes can be easily added to it.

Give a brief overview of Hadoop history.

In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published MapReduce, GFS

papers. In 2006, Doug Cutting developed the open source, Mapreduce and HDFS project. In 2008, Yahoo ran 4,000

node Hadoop cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched SQL support for

Hadoop.

Give examples of some companies that are using Hadoop structure?

A lot of companies are using the Hadoop structure such as Cloudera, EMC, MapR, Hortonworks, Amazon, Facebook

eBay, Twitter, Google and so on.

What is the basic difference between traditional RDBMS and Hadoop?

Traditional RDBMS is used for transactional systems to report and archive the data, whereas Hadoop is an approach

to store huge amount of data in the distributed file system and process it. RDBMS will be useful when you want to

seek one record from Big data, whereas, Hadoop will be useful when you want Big data in one shot and perform

analysis on that later.

What is structured and unstructured data?

Structured data is the data that is easily identifiable as it is organized in a structure. The most common form ofstructured data is a database where specific information is stored in tables, that is, rows and columns. Unstructured

data refers to any data that cannot be identified easily. It could be in the form of images, videos, documents, email,

logs and random text. It is not in the form of rows and columns. What are the core components of Hadoop?

Core components of Hadoop are HDFS and MapReduce. HDFS is basically used to store large data sets and

MapReduce is used to process such large data sets.


12/41



Now, let’s get cracking with the hard Stuff: What is HDFS?

HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on

commodity hardware.

What are the key features of HDFS?

HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to

file system data and can be built out of commodity hardware.

What is Fault Tolerance?

Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there

is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature

of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also

So even if one or two of the systems collapse, the file is still available on the third system.

Replication causes data redundancy, then why is it pursued in HDFS?

HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed

any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places

Any data on HDFS gets stored at least 3 different locations. So, even if one of them is corrupted and the other is

unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance

of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.

Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be

replicated on the other two?

Since there are 3 nodes, when we send the MapReduce programs, calculations will be done only on the original data

The master node will know which node exactly has that particular data. In case, if one of the nodes is not responding

it is assumed to be failed. Only then, the required calculation will be done on the second replica.

What is throughput? How does HDFS get a good throughput?

Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from thesystem and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an

action, then the work is divided and shared among different systems. So all the systems will be executing the tasks

assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this

way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data

tremendously.

What is streaming access?


13/41



As HDFS works on the principle of ‘Write Once, Read many‘, the feature of streaming access is extremely important

in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially

while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single

record from the data.

What is a commodity hardware? Does commodity hardware include RAM?

Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be

installed in any average commodity hardware. We don’t need super computers or high -end hardware to work on

Hadoop. Yes, Commodity hardware includes RAM because there will be some services which will be running on

RAM.

What is a Namenode?

Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the

blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS.

Is Namenode also a commodity?

No. Namenode can never be a commodity hardware because the entire HDFS rely on it. It is the single point of failure

in HDFS. Namenode has to be a high-availability machine.

What is a metadata?

Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on

What is a Datanode?

Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible

for serving read and write requests for the clients.

Why do we use HDFS for applications having large data sets and not when there are lot of small files?

HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across

multiple files. This is because Namenode is a very expensive high performance system, so it is not prudent to occupy

the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files. So, when

there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized

performance, HDFS supports large data sets instead of multiple small files.

What is a daemon?

Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The

equivalent of Daemon in Windows is “services” and in Dos is ” TSR”.


14/41



What is a job tracker?

Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns

the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. I

is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are

halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task iscompleted or not.

What is a task tracker?

Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on

slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to

different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be

simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbea

from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to anothe

task tracker in the cluster.

Is Namenode machine same as datanode machine as in terms of hardware?

It depends upon the cluster you are trying to create. The Hadoop VM can be there on the same machine or on anothe

machine. For instance, in a single node cluster, there is only one machine, whereas in the development or in a testing

environment, Namenode and datanodes are on different machines.

What is a heartbeat in HDFS?

A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send

its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there

is some problem in datanode or task tracker is unable to perform the assigned task.

Are Namenode and job tracker on the same host?

No, in practical environment, Namenode is on a separate host and job tracker is on a separate host.

What is a ‘block’ in HDFS?

A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as

contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which

are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost

of seeks. If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size? No, not at all! 64


15/41



mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS

block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient

manner.

What are the benefits of block transfer?

A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored

on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block

rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against

corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate

machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is

transparent to the client.

If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks,

can the blocks be broken at the time of replication?

In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node

will figure out what is the actual amount of space required, how many block are being used, how much space is

available, and it will allocate the blocks accordingly.

How indexing is done in HDFS?

Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on

storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS

If a data Node is full how it’s identified?

When data is stored in datanode, then the metadata of that data will be stored in the Namenode. So Namenode wil

identify if the data node is full.

If datanodes increase, then do we need to upgrade Namenode?

While installing the Hadoop system, Namenode is determined based on the size of the clusters. Most of the time, we

do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a

requirement rarely arise.

Are job tracker and task trackers present in separate machines?


16/41



Yes, job tracker and task tracker are present in different machines. The reason is job tracker is a single point of failure

for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

When we send a data to a node, do we allow settling in time, before sending another data to that node?

Yes, we do.

Does hadoop always require digital data to process?

Yes. Hadoop always require digital data to be processed.

On what basis Namenode will decide which datanode to write on?

As the Namenode has the metadata (information) related to all the data nodes, it knows which datanode is free.

Doesn’t Google have its very own version of DFS?

Yes, Google owns a DFS known as “Google File System (GFS)” developed by Google Inc. for its own use. Who

is a ‘user’ in HDFS?

A user is like you or me, who has some query or who needs some kind of data.

Is client the end user in HDFS?

No, Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker) or

datanode (task tracker).

What is the communication channel between client and namenode/datanode?

The mode of communication is SSH.

What is a rack?

Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different

places. Rack is a physical collection of datanodes which are stored at a single location. There can be mult iple racks

in a single location.

On what basis data will be stored on a rack?

When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the clien

consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be

stored. While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack

third copy in a different rack“. This rule is known as “Replica Placement Policy“.


17/41



Do we need to place 2nd and 3rd data in rack 2 only?

Yes, this is to avoid datanode failure.

What if rack 2 and datanode fails?

If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such

situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by

changing the value in replication factor which is set to 3 by default.

What is a Secondary Namenode? Is it a substitute to the Namenode?

The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk

or the file system. It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes

down.

What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?

In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and

Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.

What is MapReduce?

Map Reduce is the ‘heart‘ of Hadoop that consists of two parts – ‘map’ and ‘reduce’. Maps and reduces are programs

for processing data. ‘Map’ processes the data first to give some intermediate ou tput which is further processed by

‘Reduce’ to generate the final output. Thus, MapReduce allows for distributed processing of the map and reduction

operations.

Can you explain how do ‘map’ and ‘reduce’ work?

Namenode takes the input and divide it into parts and assign them to data nodes. These datanodes process the tasks

assigned to them and make a key-value pair and returns the intermediate output to the Reducer. The reducer collects

this key value pairs of all the datanodes and combines them and generates the final output.

What is ‘Key value pair’ in HDFS?

Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output. Wha

is the difference between MapReduce engine and HDFS cluster?

HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. Map Reduce

Engine is the programming module which is used to retrieve and analyze data.

Is map like a pointer?


18/41



No, Map is not like a pointer.

Do we require two servers for the Namenode and the datanodes?

Yes, we need two different servers for the Namenode and the datanodes. This is because Namenode requires highly

configurable system as it stores information about the location details of all the files stored in different datanodes andon the other hand, datanodes require low configuration system.

Why are the number of splits equal to the number of maps?

The number of maps is equal to the number of input splits because we want the key and value pairs of all the input

splits.

Is a job split into maps?

No, a job is not split into maps. Spilt is created for the file. The file is placed on datanodes in blocks. For each split, amap is needed.

Which are the two types of ‘writes’ in HDFS?

There are two types of writes in HDFS: posted and non-posted write. Posted Write is when we write it and forget abou

it, without worrying about the acknowledgement. It is similar to our traditional Indian post. In a Non-posted

Write, we wait for the acknowledgement. It is similar to the today’s courier services. Naturally, non -posted write is

more expensive than the posted write. It is much more expensive, though both writes are asynchronous.

Why ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?

Reading is done in parallel because by doing so we can access the data fast. But we do not perform the write operation

in parallel. The reason is that if we perform the write operation in parallel, then it might result in data inconsistency

For example, you have a file and two nodes are trying to write data into the file in parallel, then the first node does not

know what the second node has written and vice-versa. So, this makes it confusing which data to be stored and

accessed.

Can Hadoop be compared to NOSQL database like Cassandra?

Though NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no

DFS in NOSQL. Hadoop is not a database. It’s a filesystem (HDFS) and distributed programming framework

(MapReduce).

FAQ’s For Hadoop Cluster Which are the three modes in which Hadoop can be run?

The three modes in which Hadoop can be run are:


19/41



1. standalone (local) mode

2. Pseudo-distributed mode

3. Fully distributed mode

What are the features of Stand alone (local) mode?

In stand-alone mode there are no daemons, everything runs on a single JVM. It has no DFS and utilizes the local filesystem. Stand-alone mode is suitable only for running MapReduce programs during development. It is one of the

most least used environments.

What are the features of Pseudo mode?

Pseudo mode is used both for development and in the QA environment. In the Pseudo mode all the daemons run on

the same machine.

Can we call VMs as pseudos?

No, VMs are not pseudos because VM is something different and pesudo is very specific to Hadoop.

What are the features of Fully Distributed mode?

Fully Distributed mode is used in the production environment, where we have ‘n’ number of machines forming a

Hadoop cluster. Hadoop daemons run on a cluster of machines. There is one host onto which Namenode is running

and another host on which datanode is running and then there are machines on which task tracker is running. We

have separate masters and separate slaves in this distribution.

Does Hadoop follows the UNIX pattern?

Yes, Hadoop closely follows the UNIX pattern. Hadoop also has the ‘conf ‘ directory as in the case of UNIX. Inwhich directory Hadoop is installed?

Cloudera and Apache has the same directory structure. Hadoop is installed in cd /usr/lib/hadoop-0.20/.

What are the port numbers of Namenode, job tracker and task tracker?

The port number for Namenode is ’70′, for job tracker is ’30′ and for task tracker is ’60′.

What is the Hadoop-core configuration?

Hadoop core is configured by two xml files:

1. hadoop-default.xml which was renamed to 2. hadoop-site.xml.

These files are written in xml format. We have certain properties in these xml files, which consist of name and value

But these files do not exist now.

What are the Hadoop configuration files at present?

There are 3 configuration files in Hadoop:

1. core-site.xml


20/41



2. hdfs-site.xml

3. mapred-site.xml These files are located in the conf/ subdirectory. How to exit the Vi editor?

To exit the Vi Editor, press ESC and type :q and then press enter.

What is a spill factor with respect to the RAM?

Spill factor is the size after which your files move to the temp file. Hadoop-temp directory is used for this.

Is fs.mapr.working.dir a single directory?

Yes, fs.mapr.working.dir it is just one directory.

Which are the three main hdfs-site.xml properties?

The three main hdfs-site.xml properties are:

1. dfs.name.dir which gives you the location on which metadata will be stored and where DFS is located – on disk o

onto the remote.

2. dfs.data.dir which gives you the location where the data is going to be stored.

3. fs.checkpoint.dir which is for secondary Namenode.

How to come out of the insert mode?

To come out of the insert mode, press ESC, type :q (if you have not written anything) OR type :wq (if you have written

anything in the file) and then press ENTER.

What is Cloudera and why it is used?

Cloudera is the distribution of Hadoop. It is a user created on VM by default. Cloudera belongs to Apache and is used

for data processing.

What happens if you get a ‘connection refused java exception’ when you type hadoop fsck /?

It could mean that the Namenode is not working on your VM.

We are using Ubuntu operating system with Cloudera, but from where we can downloadHadoop or does it come by default with Ubuntu?

This is a default configuration of Hadoop that you have to download from Cloudera or from Edureka’s dropbox and

the run it on your systems. You can also proceed with your own configuration but you need a Linux box, be it Ubuntu

or Red hat. There are installation steps present at the Cloudera location or in Edureka’s Drop box. You can go either

ways.

What does ‘jps’ command do?

This command checks whether your Namenode, datanode, task tracker, job tracker, etc are working or not.


21/41



How can I restart Namenode?

1. Click on stop-all.sh and then click on start-all.sh OR

2. Write sudo hdfs (press enter), su-hdfs (press enter), /etc/init.d/ha (press enter) and then

/etc/init.d/hadoop0.20-namenode start (press enter). What is the full form of fsck?

Full form of fsck is File System Check.

How can we check whether Namenode is working or not?

To check whether Namenode is working or not, use the command /etc/init.d/hadoop-0.20-namenode status or as

simple as jps.

What does the command mapred.job.tracker do?

The command mapred.job.tracker lists out which of your nodes is acting as a job tracker.

What does /etc /init.d do?

/etc /init.d specifies where daemons (services) are placed or to see the status of these daemons. It is very LINUX

specific, and nothing to do with Hadoop.

How can we look for the Namenode in the browser?

If you have to look for Namenode in the browser, you don’t have to give localhost:8021, the port number to look for

Namenode in the brower is 50070.

How to change from SU to Cloudera?

To change from SU to Cloudera just type exit.

Which files are used by the startup and shutdown commands?

Slaves and Masters are used by the startup and the shutdown commands. Whatdo slaves consist of?

Slaves consist of a list of hosts, one per line, that host datanode and task tracker servers.

What do masters consist of?

Masters contain a list of hosts, one per line, that are to host secondary namenode servers.

What does hadoop-env.sh do?

hadoop-env.sh provides the environment for Hadoop to run. JAVA_HOME is set over here.

Can we have multiple entries in the master files?

Yes, we can have multiple entries in the Master files.

Where is hadoop-env.sh file present?

hadoop-env.sh file is present in the conf location.

In Hadoop_PID_DIR, what does PID stands for?

PID stands for ‘Process ID’.


22/41



What does /var/hadoop/pids do?

It stores the PID.

What does hadoop-metrics.properties file do?

hadoop-metrics.properties is used for ‘ Reporting ‘ purposes. It controls the reporting for Hadoop. The default status

is ‘not to report ‘.

What are the network requirements for Hadoop?

The Hadoop core uses Shell (SSH) to launch the server processes on the slave nodes. It requires passwordless

SSH connection between the master and all the slaves and the secondary machines.

Why do we need a password-less SSH in Fully Distributed environment?

We need a password-less SSH in a Fully-Distributed environment because when the cluster is LIVE and running in

Fully

Distributed environment, the communication is too frequent. The job tracker should be able to send a task to task

tracker quickly.

Does this lead to security issues?

No, not at all. Hadoop cluster is an isolated cluster. And generally it has nothing to do with an internet. It has a

different kind of a configuration. We needn’t worry about that kind of a security breach, for instance, someone

hacking through the internet, and so on. Hadoop has a very secured way to connect to other machines to fetch and

to process data.

On which port does SSH work?

SSH works on Port No. 22, though it can be configured. 22 is the default Port number.

Can you tell us more about SSH?SSH is nothing but a secure shell communication, it is a kind of a protocol that works on a Port No. 22, and when you

do an SSH, what you really require is a password.

Why password is needed in SSH localhost?

Password is required in SSH for security and in a situation where password-less communication is not set. Dowe need to give a password, even if the key is added in SSH?

Yes, password is still required even if the key is added in SSH.

What if a Namenode has no data?If a Namenode has no data it is not a Namenode. Practically, Namenode will have some data.

What happens to job tracker when Namenode is down?

When Namenode is down, your cluster is OFF, this is because Namenode is the single point of failure in HDFS.

What happens to a Namenode, when job tracker is down?


23/41



When a job tracker is down, it will not be functional but Namenode will be present. So, cluster is accessible if

Namenode is working, even if the job tracker is not working.

Can you give us some more details about SSH communication between Masters and theSlaves?

SSH is a password-less secure communication where data packets are sent across the slave. It has some format intowhich data is sent across. SSH is not only between masters and slaves but also between two hosts.

What is formatting of the DFS?

Just like we do for Windows, DFS is formatted for proper structuring. It is not usually done as it formats the Namenode

too.

Does the HDFS client decide the input split or Namenode?

No, the Client does not decide. It is already specified in one of the configurations through which input split is already

configured.

In Cloudera there is already a cluster, but if I want to form a cluster on Ubuntu can we do it?

Yes, you can go ahead with this! There are installation steps for creating a new cluster. You can uninstall your presen

cluster and install the new cluster.

Can we create a Hadoop cluster from scratch?

Yes we can do that also once we are familiar with the Hadoop environment.

Can we use Windows for Hadoop?

Actually, Red Hat Linux or Ubuntu are the best Operating Systems for Hadoop. Windows is not used frequently for

installing Hadoop as there are many support problems attached with Windows. Thus, Windows is not a preferred

environment for Hadoop.

FAQ’s For Hadoop MapReduce What is MapReduce?

It is a framework or a programming model that is used for processing large data sets over clusters of computers using

distributed programming.

What are ‘maps’ and ‘reduces’?

‘ Maps‘ and ‘ Reduces‘ are two phases of solving a query in HDFS. ‘Map’ is responsible to read data from input

location, and based on the input type, it will generate a key value pair, that is, an intermediate output in local machine

’Reducer’ is responsible to process the intermediate output received from the mapper and generate the final output.

What are the four basic parameters of a mapper?


24/41



The four basic parameters of a mapper are LongWritable, text, text and IntWritable. The first two represent input

parameters and the second two represent intermediate output parameters. What are the four basic

parameters of a reducer?

The four basic parameters of a reducer are text, IntWritable, text, IntWritable. The first two represent intermediate

output parameters and the second two represent final output parameters.What do the master class and the

output class do?

Master is defined to update the Master or the job tracker and the output class is defined to write data onto the output

location.

What is the input type/format in MapReduce by default? By default the type input type in MapReduce is ‘text’.

Is it mandatory to set input and output type/format in MapReduce?

No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster takes the input

and the output type as ‘text’.

What does the text input format do?

In text input format, each line will create a line object, that is an hexa-decimal number. Key is considered as a line

object and value is considered as a whole line text. This is how the data gets processed by a mapper. The mapper

will receive the ‘key’ as a ‘ LongWritable‘ parameter and value as a ‘text ‘ parameter.

What does job conf class do?

MapReduce needs to logically separate different jobs running on the same cluster. ‘ Job conf class‘ helps to do job

level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive

and represent the type of job that is being executed. What does conf.setMapper Class do?

Conf.setMapper class sets the mapper class and all the stuff related to map job such as reading a data and

generating a key-value pair out of the mapper. What do sorting and shuffling do?

Sorting and shuffling are responsible for creating a unique key and a list of values. Making similar keys at one

location is known as Sorting . And the process by which the intermediate output of the mapper is sorted and sent

across to the reducers is known as Shuffling . What does a split do?

Before transferring the data from hard disk location to map method, there is a phase or method called the ‘Split

Method ‘. Split method pulls a block of data from HDFS to the framework. The Split class does not write anything, but

reads data from the block and pass it to the mapper. Be default, Split is taken care by the framework. Split method isequal to the block size and is used to divide block into bunch of splits.

How can we change the split size if our commodity hardware has less storage space? If our commodity hardware has less storage space, we can change the split size by writing the ‘custom splitter ‘. There

is a feature of customization in Hadoop which can be called from the main method.

What does a MapReduce partitioner do?


25/41



A MapReduce partitioner makes sure that all the value of a single key goes to the same reducer, thus allows evenly

distribution of the map output over the reducers. It redirects the mapper output to the reducer by determining which

reducer is responsible for a particular key.

How is Hadoop different from other data processing tools?

In Hadoop, based upon your requirements, you can increase or decrease the number of mappers without bothering

about the volume of data to be processed. this is the beauty of parallel processing in contrast to the other data

processing tools available.

Can we rename the output file?

Yes we can rename the output file by implementing multiple format output class.

Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that? We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on

the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose

the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again

gets divided into mapper, thus we do not have a track of the previous row value.

What is Streaming?

Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any

programming language which can accept standard input and can produce standard output. It could be Perl, Python,

Ruby and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any

other programming language.

What is a Combiner? A ‘Combiner’ is a mini reducer that performs the local reduce task. It receives the input from the mapper on a

particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of MapReduce by

reducing the quantum of data that is required to be sent to the reducers.

What is the difference between an HDFS Block and Input Split?

HDFS Block is the physical division of the data and Input Split is the logical division of the data.

What happens in a textinputformat?

In textinputformat , each line in the text file is a record. Key is the byte offset of the line andvalue is the content of the

line. For instance, Key: longWritable, value: text.

What do you know about keyvaluetextinputformat?

In keyvaluetextinputformat , each line in the text file is a ‘record ‘. The first separator character divides each line.

Everything before the separator is the key and everything after the separator is the value. For instance, Key: text,

value: text.

What do you know about Sequencefileinputformat?

Sequencefileinputformat is an input format for reading in sequence files. Key and value are user defined. It is a specific

compressed binary file format which is optimized for passing the data between the output of one MapReduce job to

the input of some other MapReduce job.


26/41



What do you know about Nlineoutputformat?

Nlineoutputformat splits ‘n’ lines of input as one split.

FAQ’s For Hadoop PIG Can you give us some examples how Hadoop is used in real time environment?

Let us assume that the we have an exam consisting of 10 Multiple-choice questions and 20 students appear for that

exam. Every student will attempt each question. For each question and each answer option, a key will be generated

So we have a set of key-value pairs for all the questions and all the answer options for every student. Based on the

options that the students have selected, you have to analyze and find out how many students have answered correctly

This isn’t an easy task. Here Hadoop comes into picture! Hadoop helps you in solving these problems quickly and

without much effort. You may also take the case of how many students have wrongly attempted a particular questionWhat is BloomMapFile used for?

The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile. BloomMapFile uses

dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.

What is PIG?

PIG is a platform for analyzing large data sets that consist of high level language for expressing data analysis

programs, coupled with infrastructure for evaluating these programs. PIG’s infrastructure layer consists of a compiler

that produces sequence of MapReduce Programs.

What is the difference between logical and physical plans?Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic

parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have

to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the

physical operators that are needed to execute the script.

Does ‘ILLUSTRATE’ run MR job?

No, illustrate will not pull any MR, it will pull the internal data. On the console, illustrate will not do any job. It just shows

output of each stage and not the final output.

Is the keyword ‘DEFINE’ like a function name?Yes, the keyword ‘DEFINE’ is like a function name. Once you have registered, you have to define it. Whatever logic

you have written in Java program, you have an exported jar and also a jar registered by you. Now the compiler wil

check the function in exported jar. When the function is not present in the library, it looks into your jar.

Is the keyword ‘FUNCTIONAL’ a User Defined Function (UDF)?


27/41



No, the keyword ‘FUNCTIONAL’ is not a User Defined Function (UDF). While using UDF, we have to override some functions

Certainly you have to do your job with the help of these functions only. But the keyword ‘FUNCTIONAL’ is a built -in function i.e a

pre-defined function, therefore it does not work as a UDF.

Why do we need MapReduce during Pig programming?

Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. The language we use

for this platform is: Pig Latin. A program written in Pig Latin is like a query written in SQL, where we need an execution

engine to execute the query. So, when a program is written in Pig Latin, Pig compiler will convert the program into

MapReduce jobs. Here, MapReduce acts as the execution engine.

Are there any problems which can only be solved by MapReduce and cannot be solved byPIG? In which kind of scenarios MR jobs will be more useful than PIG?

Let us take a scenario where we want to count the population in two cities. I have a data set and sensor list of different

cities. I want to count the population by using one mapreduce for two cities. Let us assume that one is Bangalore and

the other is Noida. So I need to consider key of Bangalore city similar to Noida through which I can bring the population

data of these two cities to one reducer. The idea behind this is some how I have to instruct map reducer program –

whenever you find city with the name ‘ Bangalore‘ and city with the name ‘ Noida’ , you create the alias name which wilbe the common name for these two cities so that you create a common key for both the cities and it get passed to

the same reducer. For this, we have to write custom partitioner .

In mapreduce when you create a ‘key’ for city, you have to consider ’city’ as the key. So, whenever the framework

comes across a different city, it considers it as a different key. Hence, we need to use customized partitioner. There

is a provision in mapreduce only, where you can write your custom partitioner and mention if city = bangalore or noida

then pass similar hashcode. However, we cannot create custom partitioner in Pig. As Pig is not a framework, we

cannot direct execution engine to customize the partitioner. In such scenarios, MapReduce works better than Pig.

Does Pig give any warning when there is a type mismatch or missing field?

No, Pig will not show any warning if there is no matching field or a mismatch. If you assume that Pig gives such awarning, then it is difficult to find in log file. If any mismatch is found, it assumes a null value in Pig.

What co-group does in Pig?

Co-group joins the data set by grouping one particular data set only. It groups the elements by their common field and

then returns a set of records containing two separate bags. The first bag consists of the record of the first data set

with the common data set and the second bag consists of the records of the second data set with the common data

set.

Can we say cogroup is a group of more than 1 data set?

Cogroup is a group of one data set. But in the case of more than one data sets, cogroup will group all the data sets

and join them based on the common field. Hence, we can say that cogroup is a group of more than one data set and

join of that data set as well.

What does FOREACH do?

FOREACH is used to apply transformations to the data and to generate new data items. The name itself is indicating

that for each element of a data bag, the respective action will be performed.


28/41



Syntax : FOREACH bagname GENERATE expression1, expression2, ….. The meaning of this

statement is that the expressions mentioned after GENERATE will be applied to the current record of the data bag.

What is bag?

A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags

are used to store collections while grouping. The size of bag is the size of the local disk, this means that the size o

the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag

in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.

Real BIG DATA Use CasesBig Data Exploration

Big Data exploration deals with the challenges like information stored in different systems and access to this data to

complete day-to-day tasks, faced by large organization. Big Data exploration allows you to analyse data and gainvaluable insights from them.

Enhanced 360º Customer Views

Enhancing existing customer views helps to gain complete understanding of customers, addressing questions like

why they buy, how they prefer to shop, why they change, what they’ll buy next, and what features make them to

recommend a company to others.

Security/Intelligence Extension

Enhancing cyber security and intelligence analysis platforms with Big Data technologies to process and analyze new

types from social media, emails, sensors and Telco, reduce risks, detect fraud and monitor cyber security in realtime

to significantly improve intelligence, security and law enforcement insights.

Operations Analysis

Operations analysis is about using Big Data technologies to enable a new generation of applications that analyze

large volumes of multi-structured, like machine and operational data to improve business. These data can include

anything from IT machines to sensors and meters and GPS devices requires complex analysis and correlation across

different types of data sets.

Data Warehouse Modernization

Big Data needs to be integrated with data warehouse capabilities to increase operational efficiency. Getting rid of

rarely accessed or old data from warehouse and application databases can be done using information integration

software and tools.


29/41



Companies and their Big Data Applications: Guangdong Mobiles:

A popular mobile group in China, Guangdong uses Hadoop to remove data access bottlenecks and uncover

customer usage pattern for precise and targeted market promotions and Hadoop HBase for automatically splitting

data tables across nodes to expand data storage.

Red Sox:

The World Series champs come across huge volumes of structured and unstructured data related to the game like

on the weather, opponent team and pre-game promotions. Big Data allows them to provide forecasts about the

game and how to allocate resources based on expected variations in the oncoming game.

Nokia:

Big Data has helped Nokia make effective use of their data to understand and improve users’ experience with their

products. The company leverages data processing and complex analyses to build maps with predictive traffic and

layered elevation models. Nokia uses Cloudera’s Hadoop platform and Hadoop components like HBase, HDFS,

Sqoop and Scribe for the above application.

Huawei:

Huawei OceanStor N8000-Hadoop Big Data solution is developed based on advanced clustered architecture and

enterprise-level storage capability and integrating it with Hadoop computing framework. This innovative combination

helps enterprises get real-time analysis and processing results from exhaustive data computation and analysis,improves decision-making and efficiency, make management easier and reduce the cost of networking.

SAS:

SAS has combined with Hadoop to help data scientists transform Big Data in to bigger insights. As a result, SAS has

come up with an environment that provides visual and interactive experience, making it easier to gain insights and

explore new trends. The potent analytical algorithms extract valuable insights from the data while the in-memory

technology allows faster access to data.

CERN:

Big Data plays a vital part in CERN, home of the large Hadron Supercollider, as it collects unbelievable amount of

data from its 40 million pictures per second from its 100 megapixel cameras, which gives out 1 petabyte of data per

second. The data from these cameras needs to be analysed. The lab is experimenting with ways to place more data

from its experiments in both relational databases and data stores based on NoSQL technologies, such as Hadoop

and Dynamo in Amazon’s S3′s cloud storage service Buzzdata:


30/41



Buzzdata is working on a Big Data project where it needs to combine all the sources and integrate them in a safe

location. This creates a great place for journalists to connect and normalize public data.

Department of Defence:

The Department of Defense (DoD) has invested approximately $250 million for harnessing and utilizing colossalamount of data to come up with a system that can make control and make autonomous decisions and assist

analysts to provide support to operations. The department has plans to increase their analytical abilities by 100

folds, to extract information from texts in any language and an equivalent increase in the number of objects,

activities, and events that analysts can analyze.

Defence Advanced Research Projects Agency (DARPA):

DARPA intends to invest approximately $25 million to improve computational techniques and software tools fo

analyzing large amounts of semi-structured and unstructured data.

National Institutes of Health:

At 200 terabytes of data contained in the 1000 Genomes Project, it is all set to be a prime example of Big Data. The

datasets are so massive that very few researchers have the computational power to analyses the data.

Big Data Application Examples in different Industries: Retail/Consumer:

1. Market Basket Analysis and Pricing Optimization2. Merchandizing and market analysis

3. Supply-chain management and analytics

4. Behavior-based targeting

5. Market and consumer segmentations Finances & Frauds Services:

1. Customer Segmentation 2.

Compliance and regulatory reporting

3. Risk analysis and management.

4. Fraud detection and security analytics

5. Medical insurance fraud

6. CRM

7. Credit risk, scoring and analysis

8. Trade surveillance and abnormal trading pattern analysis Health & Life Sciences:

1. Clinical trials data analysis

2. Disease pattern analysis

3. Patient care quality analysis


31/41



4. Drug development analysis Telecommunications:

1. Price optimization

2. Customer churn prevention

3. Call detail record (CDR) analysis

4. Network performance and optimization5. Mobile user location analysis Enterprise Data Warehouse:

1. Enhance EDW by offloading processing and storage

2. Pre-processing hub before getting to EDW

Gaming:

1. Behavioral Analytics High

Tech:

1. Optimize Funnel Conversion

2. Predictive Support

3. Predict Security Threats

4. Device Analytics

FACEBOOK

Facebook today is a world-wide phenomenon that has caught up with young and old alike. Launched in 2004 by abunch of Harvard University students, it was least expected to be such a rage. In a span of just a decade, how did it

manage this giant leap?

With around 1.23 billion users and counting, Facebook definitely has an upper hand over other social media websites

What is the reason behind this success? This blog is an attempt to answer some of these queries.

It is quite evident that the existence of a durable storage system and high technological expertise has contributed to

the support of various user data like managing messages, applications, personal information etc, without which all of

it would have come to a staggering halt.So what does a website do when its user count exceeds the number of cars

in the world? How does it manage such a massive data?

Data Centre: The Crux of Facebook Facebook’s data center is spread across an area of 300,000 sq ft in cutting edge servers and huge memory banks; i

has data spread over 23 million ft of fiber optic cables. Their systems are designed to run data at the speed of ligh

making sure that once a user logs into his profile, everything works faster. With 30 MW of electricity, they have to

make sure that they’re never out of power. The warehouse stores up to 300 PB of Hive data with an incoming daily rate of

600 TB.


32/41



Every computer is cooled by heat sync not bigger than a match box, but for Facebook computers, the picture is

evidently bigger. Spread over a huge field, there are cooling systems and fans that help balance the temperature of

these systems. As the count increases, trucks of storage systems keep pouring in on a daily basis and employees are

now losing a count of it.

Hadoop & Cassandra: The Technology Wizards The use of big data has evolved and for Facebook’s existence Big Data is crucial. A platform as big as this, requires

a number of technologies that will enable them to solve problems and store massive data. Hadoop is one of the many

Big Data technologies employed at Facebook, which is insufficient for a company that is growing every minute of the

day. Hadoop is a highly scalable open-source framework that uses clusters of low-cost servers to solve problems

One of the other technologies used and preferred is Cassandra.

Apache Cassandra was initially developed at Facebook to power their Inbox Search feature by two proficient Indians

Avinash Lakshman and Prashant Malik, the former being an author( Amazon Dynamo) and the latter a techie. It is an

open-source distributed database management system designed to handle large amounts of data across manycommodity servers, providing high availability with no single point of failure.

Cassandra offers robust support for clusters spanning multiple data centers. Hence, Cassandra aims to run on top of

an infrastructure of hundreds of nodes. There are failures at some point of time, but the manner in which Cassandra

manages it, makes it possible for anyone to rely on this service.

Facebook, along with the other social media websites, avoids using MySQL due to the complexity in getting good

results. Cassandra has overpowered the rest and has proved its capability in terms of getting quick results. Facebook

had originally developed Cassandra to solve the problem of engine search and to be fast and reliable in terms of

handling the ability to read and write requests at the same time. Facebook is a platform that instantly helps you connec

to people far and near and for this, it requires a system that performs and matches the brand.

WHAT IS HADOOP

So, What exactly is Hadoop? It is truly said that ‘Necessity is the mother of all inventions’ and ‘Hadoop’ is amongst the finest inventions in the world

of Big Data! Hadoop had to be developed sooner or later as there was an acute need of a framework that can handle

and process Big Data efficiently.

Technically speaking, Hadoop is an open source software framework that supports data-intensive distributed

applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop

Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies

concepts of functional programming. Hadoop is written in the Java programming language and is the highest-leve

Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug

Cutting and Michael J. Cafarella. And the charming yellow elephant you see is basically named after Doug’s son’s toy

elephant!


33/41



Hadoop Ecosystem: Once you are familiar with ‘What is Hadoop’, let’s probe into its ecosystem. Hadoop Ecosystem is nothing but various

components that make up Hadoop so powerful, among which HDFS and MapReduce are the core components!

1. HDFS: The Hadoop Distributed File System (HDFS) is a very robust feature of Apache Hadoop. HDFS is designed to amass

gigantic amount of data unfailingly, and to transfer the data at an amazing speed among nodes and facilitates the

system to continue working smoothly even if any of the nodes fail to function. HDFS is very competent in writing

programs, handling their allocation, processing the data and generating the final outcomes. In fact, HDFS manages

around 40 petabytes of data at Yahoo! The key components of HDFS are NameNode, DataNodes and Secondary

NameNode.

2. MapReduce: It all started with Google applying the concept of functional programming to solve the problem of how to manage large

amounts of data on the internet. Google named it as the ‘MapReduce’ system and was penned down in a pape

published by Google. With the ever increasing amount of data generated on the web, MapReduce was created in

2004 and Yahoo stepped in to develop Hadoop in order to implement the MapReduce technique in Hadoop. The

function of MapReduce is to help Google in searching and indexing the large quantity of web pages in matter of a few

seconds or even in a fraction of a second. The key components of MapReduce are JobTracker, TaskTrackers and

JobHistoryServer.

3. Apache Pig: Apache Pig is another component of Hadoop, which is used to evaluate huge data sets made up of high-leve

language. In fact, Pig was initiated with the idea of creating and executing commands on Big Data sets. The basic

attribute of Pig programs is ‘parallelization’ which helps them to manage large data sets. Apache Pig consists of a

compiler that generates a series of MapReduce program and a ‘Pig Latin’ language layer that facilitates SQL-like

queries to be run on distributed databases in Hadoop.

http://cdn.edureka.co/blog/wp-content/uploads/2013/03/Hadoop-ecosystem-1.png


34/41



4. Apache Hive: As the name suggests, Hive is Hadoop’s data warehouse system that enables quick data summarization for Hadoop

handle queries and evaluate huge data sets which are located in Hadoop’s file systems and also maintains full suppor

for map/reduce. Another striking feature of Apache Hive is to provide indexes such as bitmap indexes in order to

speed up queries. Apache Hive was originally developed by Facebook, but now it is developed and used by othe

companies too, including Netflix.

5. Apache HCatalog Apache HCatalog is another important component of Apache Hadoop which provides a table and storage

management service for data created with the help of Apache Hadoop. HCatalog offers features like a shared schema

and data type mechanism, a table abstraction for users and smooth functioning across other components of Hadoop

such as such as Pig, Map Reduce, Streaming, and Hive.

6. Apache HBase HBase is acronym for Hadoop DataBase. HBase is a distributed, column oriented database that uses HDFS forstorage purposes. On one hand it manages batch style computations using MapReduce and on the other hand it

handles point queries (random reads). The key components of Apache HBase are HBase Master and the

RegionServer.

7. Apache Zookeeper Apache ZooKeeper is another significant part of Hadoop ecosystem. Its major funciton is to keep a record o

configuration information, naming, providing distributed synchronization, and providing group services which are

immensely crucial for various distributed systems. Infact, HBase is dependent upon ZooKeeper for its functioning.

WHY HADOOPHadoop can be contagious. It’s implementation in one organization can lead to another one elsewhere. Thanks to

Hadoop being robust and cost-effective, handling humongous data seems much easier now. The ability to include

HIVE in an EMR workflow is yet another awesome point. It’s incredibly easy to boot up a cluster, install HIVE, and be

doing simple SQL analytics in no time. Let’s take a look at why Hadoop can be so incredible.

Key features that answer – Why Hadoop? 1. Flexible:

As it is a known fact that only 20% of data in organizations is structured, and the rest is all unstructured, it is very

crucial to manage unstructured data which goes unattended.

Hadoop Questions

Documents

Transcript of Hadoop Questions