Grid_TRAINING

Views:
 
Category: Education
     
 

Presentation Description

Big data for Business Intelligence

Comments

Presentation Transcript

Slide 1: 

Introduction to Grid & Fitment for next Generation BI By: Vivek Singh Data Architect and Author of “The Reverse Journey”

Slide 2: 

Agenda : Introduction – PreReq of this session Thinking of Scale - Need for Grid Thinking MapReduce paradigm – Basis of Grid Grid (Hadoop Architecture) – Deep Dive into Grid Grid Framework – Ecosystem for Grid Introduction of Pig Latin – ETL on Grid Pig Latin Vs SQL – High level declaration Grid Shortcomings for BI – building your plug-in

Slide 3: 

Thinking of Scale - Need for Grid Rdbms (Structured Data) Mysql ~ 200 GB Infobright (columnar ) ~ 1 TB Oracle ~ 50 TB Teradata ~ 100 TB Storage (Unstructured data) Raid 0 – 10 ~1 TB Filer – ~ 500 TB Archival Tapes – Not RT ETL tools (data processing) Informatica DataStage Talend Netezza Lucene – Index based Typical Storage cost – 40 K / TB Typical Processing cost – 1 GB/Minute Traditional Systems To store 1 PB – 40K * 1000 = 40 Million $ To process 1 TB = 1000 minutes ~ 17 hrs

Slide 4: 

Thinking of Scale - Need for Grid Tradition System - How they achieve Scalability Multi Threading Multiple CPU – Parallel Processing Distributed Programming – SMP & MPP ETL Load Distribution – Assigning jobs to different nodes Improved Throughput Oracle RAC Architecture

Slide 5: 

Thinking of Scale - Need for Grid A typical web2.0 log collection workflow dc 1 ………. ………. lb bp n1 -nn dc 2 100 mps/pipe Log storage & processing Web server Log Data Highway Datamart Datamart

Slide 6: 

Thinking of Scale - Need for Grid Think Numbers 1000 Nodes / DC 10 DC 1K byte webserver log record 1 second / row In one day 1000 * 10 * 1K * 60 * 60 * 24 = 864 GB Storage for a year 864 GB * 365 = 315 TB

Slide 7: 

Thinking of Scale - Need for Grid Tradition systems were too costly and slow to handle data explosion. Linear scalability was not possible 10 * 1 CPU ne 10 CPU A new paradigm was needed to - use all processing power - paralleling the function - make data available to all functions Business need was to be - cost effective ( Commodity hardware) - scalable - deal with fail-over, redundancy Needed a new paradigm

Slide 8: 

Next Topic : Introduction – PreReq of this session Thinking of Scale - Need for Grid Thinking MapReduce paradigm – Basis of Grid Grid (Hadoop Architecture) – Deep Dive into Grid Grid Framework – Ecosystem for Grid Introduction of Pig Latin – ETL on Grid Pig Latin Vs SQL – High level declaration Grid Shortcomings for BI – building your plug-in

Slide 9: 

Thinking MapReduce paradigm – Basis of Grid MapReduce is a functioning programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model. Map –Reduce Definition

Slide 10: 

Thinking MapReduce paradigm – Basis of Grid Output_List = Map (Input_List) Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) = (1, 4, 9, 16, 25, 36,49, 64, 81, 100) Output_Element = Reduce (Input_List) Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) =385 map (k1,v1)  list(k2,v2) reduce (k2,list(v2))  list(v2) Map –Reduce Example

Slide 11: 

Thinking MapReduce paradigm – Basis of Grid map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); map() functions run in parallel, creating different intermediate values from different input data sets. reduce() functions also run in parallel, each working on a different output key All values are processed independently. Bottleneck: reduce phase can’t start until map phase is completely finished. Map –Reduce Pseudo Code

Slide 12: 

Thinking Map-Reduce paradigm – Basis of Grid Logical Architecture

Slide 13: 

Thinking MapReduce paradigm – Basis of Grid Logical Workflow

Slide 14: 

Thinking MapReduce paradigm – Basis of Grid Physical Architecture

Slide 15: 

(brown, (1,1)) (fox, (1,1)) (how, (1)) (now,(1)) (the, (1,1,1)) Thinking MapReduce paradigm – Basis of Grid Physical Architecture

Slide 16: 

Thinking MapReduce paradigm – Basis of Grid Execution Overview

Slide 17: 

Master-Slave architecture Master: JobTracker Accepts MR jobs submitted by users Assigns Map and Reduce tasks to TaskTrackers (slaves) Monitors task and TaskTracker status, re-executes tasks upon failure Worker: TaskTrackers Run Map and Reduce tasks upon instruction from the Jobtracker Manage storage and transmission of intermediate output Thinking MapReduce paradigm – Basis of Grid Map – Reduce Execution Recap

Slide 18: 

Thinking MapReduce paradigm – Basis of Grid Example of map functions – Individual Count, Filter, Transformation, Sort, Pig load Example of reduce functions – Group Count, Sum, Aggregator A job can have many map and reducers functions. Q. Why Map Reduce is a paradigm shift though it is - it is functional programming - structured and procedural - may be written in Object Oriented Language Take Example of Date Converter Job - unixDate to UTC Map – Reduce Paradigm Recap

Slide 19: 

Thinking MapReduce paradigm – Basis of Grid The run-time system (Grid) takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, shuffling and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Most of the BI reports uses transformation and aggregation. So Map Reduce is a very good fit to model them. Work of Grid Framework

Slide 20: 

Next Topic : Introduction – PreReq of this session Thinking of Scale - Need for Grid Thinking MapReduce paradigm – Basis of Grid Grid (Hadoop Architecture) – Deep Dive into Grid Grid Framework – Ecosystem for Grid Introduction of Pig Latin – ETL on Grid Pig Latin Vs SQL – High level declaration Grid Shortcomings for BI – building your plug-in

Slide 21: 

Grid (Hadoop Architecture) – Deep Dive into Grid Map Reduce architecture drive Grid (Hadoop) architecture –master-slave Hadoop consists of - Map-Reduce support architecture - Distributed file system (DFS) - Support facilities like – combiner, shuffle, sort etc - Use access, Job tracking etc Hadoop Definition

Map-Reduce Architecture : 

Map-Reduce Architecture Grid (Hadoop Architecture) – Deep Dive into Grid

Job Submission : 

Job Submission Grid (Hadoop Architecture) – Deep Dive into Grid

Job Initialization : 

Job Initialization Grid (Hadoop Architecture) – Deep Dive into Grid

Job Scheduling : 

13. Launch Task Job Scheduling Grid (Hadoop Architecture) – Deep Dive into Grid

Task Execution : 

Task Execution Grid (Hadoop Architecture) – Deep Dive into Grid

Map Task : 

Map Task Grid (Hadoop Architecture) – Deep Dive into Grid

Reduce Task : 

Assign Task Reduce Task Grid (Hadoop Architecture) – Deep Dive into Grid

Slide 29: 

Grid (Hadoop Architecture) – Deep Dive into Grid Data is organized into files and directories Files are divided into uniform sized blocks (default 128MB) and distributed across cluster nodes HDFS exposes block placement so that computation can be migrated to data Blocks are replicated (default 3) to handle hardware failure Replication for performance and fault tolerance (Rack-Aware placement) HDFS keeps checksums of data for corruption detection and recovery Master-Worker Architecture Single NameNode Many (Thousands) DataNodes Hadoop File System

Slide 30: 

Grid (Hadoop Architecture) – Deep Dive into Grid HDFS Architecture

Slide 31: 

Grid (Hadoop Architecture) – Deep Dive into Grid NameNode • Manages filesystem namespace • File metadata (i.e. “inode”) • Mapping inode to list of blocks + locations • Authorization & Authentication • Checkpoint & journal namespace changes Mapping of datanode to list of blocks • Monitor datanode health • Replicate missing blocks • Keeps ALL namespace in memory Datanodes • Handle block storage on multiple volumes & block integrity • Clients access the blocks directly from data nodes • Periodically send heartbeats and block reports to Namenode • Blocks are stored as underlying OS’s files Node work Assignment

Slide 32: 

Grid (Hadoop Architecture) – Deep Dive into Grid • A file’s replication factor can be changed dynamically (default 3) • Block placement is rack aware • Block under-replication & over-replication is detected by Namenode • Balancer application rebalances blocks to balance datanode utilization HDFS Block data replication

Slide 33: 

Grid (Hadoop Architecture) – Deep Dive into Grid hadoop fs [-fs <local | file system URI>] [-conf <configuration file>] [-D <property=value>] [-ls <path>] [-lsr <path>] [-du <path>] [-dus <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm <src>] [-rmr <src>] [-put <localsrc> ... <dst>] [-copyFromLocal <localsrc> ... <dst>] [-moveFromLocal <localsrc> ... <dst>] [-get [-ignoreCrc] [-crc] <src> <localdst> [-getmerge <src> <localdst> [addnl]] [-cat <src>] [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] [-moveToLocal <src> <localdst>] [-mkdir <path>] [-report] [-setrep [-R] [-w] <rep> <path/file>] [-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>] [-tail [-f] <path>] [-text <path>] [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-chgrp [-R] GROUP PATH...] [-count[-q] <path>] [-help [cmd]] Accessing HDFS dfs –ls / -lsr / -mkdir / -cat dfs –cp dfs –get / -copyToLocal dfs -put / -copyFromLocal dfs –count [-q] dfs –rm/-rmr Moves to .Trash/Current dfs –chmod / -chgrp / dfs –test –[ezd] is file / is zero / is dir

Slide 34: 

Grid (Hadoop Architecture) – Deep Dive into Grid Some support facilities of hadoop are : - Intermediate partitioning – hash(key) mod R - Combiner Function - Global Counter Grid framework functionality

Slide 35: 

Grid (Hadoop Architecture) – Deep Dive into Grid Recap Next Job Tracking

Jobtracker front page : 

Jobtracker front page Grid (Hadoop Architecture) – Deep Dive into Grid

Job counters : 

Job counters Grid (Hadoop Architecture) – Deep Dive into Grid

Task status : 

Task status Grid (Hadoop Architecture) – Deep Dive into Grid

Drilling down : 

Drilling down Grid (Hadoop Architecture) – Deep Dive into Grid

Drilling down -- logs : 

Drilling down -- logs Grid (Hadoop Architecture) – Deep Dive into Grid

Slide 41: 

Next Topic : Introduction – PreReq of this session Thinking of Scale - Need for Grid Thinking MapReduce paradigm – Basis of Grid Grid (Hadoop Architecture) – Deep Dive into Grid Grid Framework – Ecosystem for Grid Introduction of Pig Latin – ETL on Grid Pig Latin Vs SQL – High level declaration Grid Shortcomings for BI – building your plug-in

Active Projects on Grid : 

Grid Framework – Ecosystem for Grid Active Projects on Grid

Slide 43: 

Grid Framework – Ecosystem for Grid Example of Map Reduce Code 1.package org.myorg; 2. 3. import java.io.IOException; 4. import java.util.*; 5. 6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*; 11. 12. public class WordCount { 13. 14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text(); 17. 18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. } 27. 28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. } 37. 38. public static void main(String[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName("wordcount"); 41. 42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class); 44. 45. conf.setMapperClass(Map.class); 46. conf.setCombinerClass(Reduce.class); 47. conf.setReducerClass(Reduce.class); 48. 49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class); 51. 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); 54. 55. JobClient.runJob(conf); 57. } 58. } 59.

Slide 44: 

Grid Framework – Ecosystem for Grid Example of Ozzie Code Workflow.xml <workflow-app xmlns='uri:oozie:workflow:0.1' name='vivek-pig-wf'> <start to='hadoop1' /> <action name='hadoop1'> <pig> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <script>../script/simple.pig</script> <param>INPUT=${inputDir1}</param> <param>OUTPUT=${outputDir1}</param> </pig> <ok to="end" /> <error to="fail" /> </action> <kill name="fail"> <message>Pig Failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name='end' /> </workflow-app> Property file [oozie.wf.application.path=hdfs://axoniteblue-nn1.blue.ygrid.yahoo.com:8020/user/vivek36/pigoozie inputDir1=hdfs://axoniteblue-nn1.blue.ygrid.yahoo.com:8020/user/vivek36 outputDir1=hdfs://axoniteblue-nn1.blue.ygrid.yahoo.com:8020/user/vivek36/output-pig-data jobTracker=axoniteblue-jt1.blue.ygrid.yahoo.com:50300 nameNode=hdfs://axoniteblue-nn1.blue.ygrid.yahoo.com:8020 queueName=unfunded #group.name=other $hadoop fs -ls drwx------ - vivek36 users 0 2009-10-05 12:39 /user/vivek36/oozie-wrkflow drwx------ - vivek36 users 0 2009-10-01 09:08 /user/vivek36/output-pig-data drwx------ - vivek36 users 0 2009-10-05 12:19 /user/vivek36/pigoozie -rw------- 3 vivek36 users 5141399432 2009-09-17 09:57 /user/vivek36/raw_test.txt -rw------- 3 vivek36 users 10964 2009-09-16 08:59 /user/vivek36/raw_test_top100.txt drwx------ - vivek36 users 0 2009-10-05 12:34 /user/vivek36/script drwx------ - vivek36 users 0 2009-10-05 12:38 /user/vivek36/viv_aggregate [vivek36@gwbl4001 ~]$ hadoop fs -ls /user/vivek36/ output-pig-data Found 1 items drwx------ - vivek36 users 0 2009-10-01 09:08 /user/vivek36/ output-pig-data /mapRed [vivek36@gwbl4001 ~]$ hadoop fs -ls /user/vivek36/ output-pig-data /mapRed/* Found 1 items -rw------- 3 vivek36 users 1547 2009-10-01 09:07 /user/vivek36/ output-pig-data /mapRed/part-00000

Slide 45: 

Next Topic : Introduction – PreReq of this session Thinking of Scale - Need for Grid Thinking MapReduce paradigm – Basis of Grid Grid (Hadoop Architecture) – Deep Dive into Grid Grid Framework – Ecosystem for Grid Introduction of Pig Latin – ETL on Grid Pig Latin Vs SQL – High level declaration Grid Shortcomings for BI – building your plug-in

Slide 46: 

The Reverse Journey This is a story about a young man faced with a decision - to follow his heart or brain. The heart wants happiness in India, among his family, friends and people who are like him. His brain wants money - without it what security does he have? All his friends are relocating to the USA. He feels isolated. And so he decides to follow 'the rat race'. He travels to America. Will the journey to a foreign land bring happiness? Will money be the answer to his prayers? Or will he finally realise that true joy is the sense of belonging? http://www.amazon.com/gp/search?index=books&linkCode=qs&keywords=9381115354 Pig Coming Next

Slide 47: 

Introduction of Pig Latin – ETL on Grid - High level declarative language - Flexible data type - supports common primitives like JOIN, GROUP - supports data flow - A layer above MR jobs - Supports UDFs - Allows customization and optimization of job What Pig Does -

Slide 48: 

Example of Pig Script Pig Script : A = load 'raw_test.txt' using PigStorage(',') as (ip:chararray,machine:chararray,id:chararray,gate:chararray); B = group A by ($1,$3); C = foreach B generate $0, COUNT ($1); store C into 'viv_aggregate' using PigStorage(','); Introduction of Pig Latin – ETL on Grid

Slide 49: 

Introduction of Pig Latin – ETL on Grid Select a, UPPER(b), COUNT(c) -- projection, scalar, aggregate Where a > 10 -- filter expression Group by a; -- group by Map Map Map Map Reducer Reducer projection Filter, transform per record <aggregate globally per group key> I/P files Part-00000 Part-00001 PIG LOAD PIG STORE combine Aggregate locally per group key combine combine <group key, val> combine Hadoop M/R provides Distributed Group by Aggregate Shuffle & Sorting GROUP BY Pig Script vs SQL

Slide 50: 

Introduction of Pig Latin – ETL on Grid Query Parser Logical Plan Semantic Checking Logical Plan Logical Optimizer Optimized Logical Plan Logical to Physical Translator Physical Plan Physical To M/R Translator MapReduce Plan Map Reduce Launcher Pig Latin Programs Create a job jar to be submitted to Hadoop cluster Pig Data Flow

Slide 51: 

Introduction of Pig Latin – ETL on Grid Pig Compliation Data Flow

Slide 52: 

Introduction of Pig Latin – ETL on Grid Type Example 1. Data Atom an atomic value ‘alice’ 2. Tuple a record (alice,lakers) 4. Data Map collection with key lookup ‘likes’# [ (lakers), (kings) ‘age’#22 ] 3. Data Bag collection supporting scan Pig Data Type

Slide 53: 

Introduction of Pig Latin – ETL on Grid Pig Data Access

Slide 54: 

Introduction of Pig Latin – ETL on Grid public class Tokenize extends BagEvalFunc { public void exec(Tuple input, DataCollector output) { String str = input.getAtomField(0).strval(); StringTokenizer st = new StringTokenizer(str); while (st.hasMoreTokens()) { Tuple t = new Tuple(1); t.setField(0, st.nextToken()); output.add(t); } } } Tokenize(): UDF User Defined function in Pig

Slide 55: 

Introduction of Pig Latin – ETL on Grid Cogroup Join Illustrate Dump Store Parallel Some Important Pig Function

Slide 56: 

Introduction of Pig Latin – ETL on Grid COGROUP Vs JOIN

Slide 57: 

Relook at Pig Code Pig Script : A = load 'raw_test.txt' using PigStorage(',') as (ip:chararray,machine:chararray,id:chararray,gate:chararray); B = group A by ($1,$3); C = foreach B generate $0, COUNT ($1); store C into 'viv_aggregate' using PigStorage(','); Introduction of Pig Latin – ETL on Grid

authorStream Live Help