Introduction to Apache Hadoop


Presentation Description

A short intro to the Hadoop open source platform that allows us to manage and process BigData


By: WaqarSher (92 month(s) ago)

Thanks Tariq; for this nice presentation...

Presentation Transcript

Introduction to Apache Hadoop:

Introduction to Apache Hadoop


Agenda Need for a new processing platform ( BigData ) Origin of Hadoop What is Hadoop & what it is not ? Hadoop architecture Hadoop components (Common/HDFS/ MapReduce ) Hadoop ecosystem When should we go for Hadoop ? Real world use cases Questions

Need for a new processing platform (BigData) :

Need for a new processing platform ( BigData ) What is BigData ? - Twitter (over 7 TB/day) - Facebook (over 10 TB/day) - Google (over 20 PB/day) Where does it come from ? Why to take so much of pain ? - Information everywhere, but where is the knowledge? Existing systems (vertical scalibility ) Why Hadoop (horizontal scalibility )?

Origin of Hadoop :

Origin of Hadoop Seminal whitepapers by Google in 2004 on a new programming paradigm to handle data at internet scale Hadoop started as a part of the Nutch project. In Jan 2006 Doug Cutting started working on Hadoop at Yahoo Factored out of Nutch in Feb 2006 First release of Apache Hadoop in September 2007 Jan 2008 - Hadoop became a top level Apache project

Hadoop distributions:

Hadoop distributions Amazon Cloudera MapR HortonWorks Microsoft Windows Azure. IBM InfoSphere B iginsights Datameer EMC Greenplum HD Hadoop distribution Hadapt

What is Hadoop ?:

What is Hadoop ? Flexible infrastructure for large scale computation & data processing on a network of commodity hardware Completely written in java Open source & distributed under Apache license Hadoop Common, HDFS & MapReduce

What Hadoop is not:

What Hadoop is not A replacement for existing data warehouse systems An online transaction processing (OLTP) system A database

Hadoop architecture :

Hadoop architecture High level view (NN, DN, JT, TT) –


HDFS Hadoop distributed file system Default storage for the Hadoop cluster NameNode / DataNode The File System Namespace (similar to our local file system) Master/slave architecture (1 master 'n' slaves) Virtual not physical Provides configurable replication (user specific) Data is stored as chunks (64 MB default, but configurable) across all the nodes

HDFS architecture:

HDFS architecture

Data replication in HDFS.:

Data replication in HDFS.

Rack awareness:

Rack awareness


MapReduce Framework provided by Hadoop to process large amount of data across a cluster of machines in a parallel manner Comprises of three classes – Mapper class Reducer class Driver class Tasktracker / Jobtracker Reducer phase will start only after mapper is done Takes ( k,v ) pairs and emits ( k,v ) pair

MapReduce structure:

MapReduce structure

MapReduce job flow:

MapReduce job flow

Modes of operation:

Modes of operation Standalone mode Pseudo-distributed mode Fully-distributed mode

Hadoop ecosystem:

Hadoop ecosystem

When should we go for Hadoop ? :

When should we go for Hadoop ? Data is too huge Processes are independent Online analytical processing (OLAP) Better scalability Parallelism Unstructured data

Real world use cases:

Real world use cases Clickstream analysis Sentiment analysis Recommendation engines Ad Targeting Search Quality





authorStream Live Help