Hadoop - An Introduction


Presentation Description

Hadoop - An introduction, Big Data, Map Reduce


Presentation Transcript

Hadoop An introduction:

Hadoop An introduction Rahul Singh VP Engineering WIZIQ


Agenda Big Data Evolution of Hadoop Hadoop Architecture What is Map Reduce HDFS Hadoop Ecosystem Learning Resources

Big Data:

Big Data

What is Apache Hadoop:

What is Apache Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.

Evolution of Hadoop:

Evolution of Hadoop Concept came from Google’s paper http ://static.googleusercontent.com/media/research.google.com/en// archive/gfs-sosp2003.pdf Conceptualized by Doug Cutting, named after his son’s toy elephant Apache Open Source Project Supported by Yahoo! Latest release codenamed YARN http://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/

Introduction to Apache Hadoop:

Introduction to Apache Hadoop

Hadoop Architecture:

Hadoop Architecture

What is Map Reduce ?:

What is Map Reduce ?

Anagram Problem - Sample Map Class:

Anagram Problem - Sample Map Class

Anagram Problem – Sample Reducer Class:

Anagram Problem – Sample Reducer Class

Map Reduce Stages:

Map Reduce Stages

Introduction to HDFS:

Introduction to HDFS https://docs.google.com/file/d/0B-zw6KHOtbT4MmRkZWJjYzEtYjI3Ni00NTFjLWE0OGItYTU5OGMxYjc0N2M1/edit

Hadoop Ecosystem:

Hadoop Ecosystem

Hadoop Use Cases http://wiki.apache.org/hadoop/PoweredBy:

Hadoop Use Cases http://wiki.apache.org/hadoop/PoweredBy “build Amazon's product search indices” “build the recommender system for behavioral targeting” “ETL style processing and statistics generation” “information extraction & search” “searching and analysis of millions of rental bookings” “we use Hadoop to summarize of user's tracking data” “we use Hadoop to store ad serving logs” “the freedom to query the data in an ad-hoc manner” “generating web graphs on 100 nodes” “we use Hadoop for batch-processing large RDF datasets” “facial similarity and recognition across large datasets“ “We are using Hadoop and Nutch to crawl Blog posts

What is Hadoop good for?:

What is Hadoop good for? Works best for Very large Unstructured Data sets in Batch Processing Mode

What does Hadoop not do well?:

What does Hadoop not do well? Real Time Streaming Applications Does not handle large number of small files.

Learning Resources:

Learning Resources Books: Hadoop- The Definitive Guide by Tom White Hadoop in Action Others: http ://www.thecloudavenue.com/p/hadoopresources.html#!

authorStream Live Help