Microsoft Big Data

Views:
 
     
 

Presentation Description

Esta es la presentación de introducción a BigData con Hadoop y Microsoft. Saludos, Ing. Eduardo Castro Martinez, PhD

Comments

By: selva_p_kumar (27 month(s) ago)

Please allow me to download Microsoft Big Data PPT...

Presentation Transcript

Nosql big data Hadoop with microsoft:

Nosql big data Hadoop with microsoft Ing. Eduardo Castro Martinez, PhD ecastro@mswindowscr.org http://tinyurl.com/comunidadwindows Facebook ecastrom Twitter edocastro Youtube eduardocastrom

This presentation is based on the following sources:

This presentation is based on the following sources What’s the big deal? David J. DeWitt Introduction to NoSQL databases and MapReduce . J Singh Large Scale Machine Translation Architectures. Qin Gao NoSQL . Perry Hoekstra Fitting Microsoft Hadoop Into Your Enterprise BI Strategy. Cindy Gross | @ SQLCindy | SQLCAT PM A bove the cloud: Big Data and BI. Denny Lee. 2

Big Agenda:

What Big Data Is and Isn’t Microsoft Hadoop Data, Insights, Visualization Big Agenda SQL, NoSQL , Hive

What Exactly Does “Big Data” Mean?:

What Exactly Does “Big Data” Mean? 4 ? Massive collections of records – think 10’s of PBs Typically housed on large clusters of low-cost processors Facebook has 2700 nodes in their cluster with 60PB of storage!! Truly stunning. To some, “Big Data” means using a NoSQL system or Parallel relational DBMS like

What is Big Data?:

What is Big D ata? Large Complex Unstructured

Scale Up!:

Scale Up! With the power of the Hubble telescope, we can take amazing pictures 45M light years away Amazing image of the Antennae Galaxies (NGC 4038-4039) Analogous with scale up: non-commodity s pecialized equipment single point of failure*

Scale Out | Commoditized Distribution:

Scale Out | Commoditized Distribution Hubble can provide an amazing view Giant Galactic Nebula (NGC 3503) but how about radio waves? Not just from one area but from all areas viewed by observatories SETI @ Home: 5.2M participants, 10 21 floating point operations 1 , 769 teraFLOPS 2 Analogous with commoditized distributed computing Distributed and calculated locally Engage with hundreds, thousands, + machines Many points of failure, auto-replication prevents this from being a problem

Some Big Data Stats:

1 zettabyte ? = 1 million petabytes = 1 trillion terabytes = 1 quadrillion gigabytes Some Big Data Stats 8 Sources : "Big Data: The Next Frontier for Innovation, Competition and Productivity." US Bureau of Labor Statistics | McKinsley Global Institute Analysis Petabytes Mars Earth 35ZB = enough data to fill a stack of DVDs reaching halfway to Mars If you like analogies…

Why the Sudden Explosion of Interest?:

Why the Sudden Explosion of Interest? An increased number and variety of data sources that generate large quantities of data Sensors (e.g. location, acoustical, …) Web 2.0 (e.g. twitter, wikis, … ) Web clicks Realization that data was “too valuable” to delete Dramatic decline in the cost of hardware, especially storage If storage was still $100/GB there would be no big data revolution underway 9 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1 0 1 1 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 0

History of the World, Part 1:

History of the World, Part 1 Relational Databases – mainstay of business Web-based applications caused spikes Especially true for public-facing e-Commerce sites Developers begin to front RDBMS with memcache or integrate other caching mechanisms within the application (ie. Ehcache)

Scaling Up:

Scaling Up Issues with scaling up when the dataset is just too big RDBMS were not designed to be distributed Began to look at multi-node database solutions Known as ‘scaling out’ or ‘horizontal scaling’ Different approaches include: Master-slave Sharding

Scaling RDBMS – Master/Slave:

Scaling RDBMS – Master/Slave Master-Slave All writes are written to the master. All reads performed against the replicated slave databases Critical reads may be incorrect as writes may not have been propagated down Large data sets can pose problems as master needs to duplicate data to slaves

Scaling RDBMS - Sharding:

Scaling RDBMS - Sharding Partition or sharding Scales well for both reads and writes Not transparent, application needs to be partition-aware Can no longer have relationships/joins across partitions Loss of referential integrity across shards

Other ways to scale RDBMS:

Other ways to scale RDBMS Multi-Master replication INSERT only, not UPDATES/DELETES No JOINs, thereby reducing query time This involves de-normalizing data In-memory databases

Managing “Big Data”:

Young turks Old guard Managing “Big Data” Use a parallel database system eBay – 10PB on 256 nodes 15 Use a NoSQL system Facebook - 20PB on 2700 nodes Bing – 150PB on 40K nodes Use a Parallel system

Twitter Statistics:

Twitter Statistics In 2007 the average was 5,000 tweets per day. In 2008 that had grown to 300,000. In 2009 tweets per day averaged 2.5 million. In 2010 that number was 35 million tweets per day. In the month of March 2011 alone, 140 million tweets are being sent on average per day. Update: As of June 2011, users on Twitter are now averaging 200 million Tweets per day. Read more: http://www.marketinggum.com/twitter-statistics-2011-updated-stats/#ixzz1sIsstjV9 16

Twitter Statistics:

Twitter Statistics 17

Twitter Statistics:

Twitter Statistics 18

Hadoop at Twitter:

Hadoop at Twitter 19

Facebook’s Hadoop Warehouse (2011):

Facebook’s Hadoop Warehouse (2011) Cluster 2700 nodes, each with 8 CPUs, 32-48 GB memory, 12 disks (1 TB or 2 TB) 19 PB of data in HDFS!! 50.4 PB with replication 150 TB (compressed) added daily! 40 TB new data 110 TB of derived tables 150K jobs processed daily! Only 500 are MapReduce jobs. Rest are in Hive About ½ are ad hoc queries. 20

Big Agenda:

What Big Data Is and Isn’t Microsoft Hadoop Data, Insights, Visualization Big Agenda SQL, NoSQL , Hive

I don’t need no NoSQL…. Do I?:

Unstructured NoSQL Structured SQL SQL Server I don’t need no NoSQL …. Do I? Hadoop Fulfill Different Needs

How do I leverage my #SQLAwesomeness?:

HiveQL TSQL How do I leverage my # SQLAwesomeness ? SELECT deviceplatform , state, country FROM hivesampletable LIMIT 200; Hive

All your data are belong to us:

SFTP Amazon S3 Azure Data Market Azure Blob Store All your data are belong to us Sqoop to/from relational Hive ODBC Driver

Big Data is what again? :

MapReduce HDFS Big Data is what again? Hadoop Massively Parallel Processing Streaming Machine Learning Unstructured

What is Hadoop:

What is Hadoop Hadoop is a distributed computing framework with two main components: a distributed file system and a map-reduce implementation . Imagine you have a cluster of 100 computers. Hadoop's distributed file system makes it so you can put data "into Hadoop " and pretend that all the hard drives on your machines have coalesced into one gigantic drive. Under the hood, it breaks each file you give it into 64- or 128-MB chunks called blocks and sends them to different machines in the cluster, replicating each block three times along the way. 26

What is hadoop:

What is hadoop The second main component of Hadoop is its map-reduce framework Provides a simple way to break analyses over large sets of data into small chunks which can be done in parallel across your 100 machines. 27

What is Hadoop?:

What is Hadoop ? Synonymous with the Big Data movement Infrastructure to automatically distribute and replicate data across multiple nodes and execute and track map reduce jobs across all of those nodes Inspired by Google’s Map Reduce and GFS papers Components are: Hadoop Distributed File System (HDFS), Map Reduce, Job Tracker, and Task Tracker Based on the Yahoo! “ Nutch ” project in 2003, became Hadoop in 2005 named after Doug Cutting’s son’s toy elephant Reference : http://en.wikipedia.org/wiki/File:Hadoop_1.png Map Reduce Layer HDFS Layer Task tracker Job tracker Task tracker Name node Data node Data node

Comparing RDBMS and MapReduce:

Comparing RDBMS and MapReduce Traditional RDBMS MapReduce Data Size Gigabytes (Terabytes) Petabytes ( Hexabytes ) Access Interactive and Batch Batch Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear DBA Ratio 1:40 1:3000 Reference: Tom White’s Hadoop : The Definitive Guide

Traditional RDBMS: Move Data to Compute:

Traditional RDBMS: Move Data to Compute As you process more and more data, and you want interactive response Typically need more expensive hardware Failures at the points of disk and network can be quite problematic It’s all about ACID: atomicity, consistency, isolation, durability Can work around this problem with more expensive HW and systems Though distribution problem becomes harder to do

Hadoop: Move Compute to the Data:

Hadoop (and NoSQL in general) follows the Map Reduce framework Developed initially by Google -> Map Reduce and Google File system Embraced by community to develop MapReduce algorithms that are very robust Built Hadoop Distributed File System (HDFS) to auto-replicate data to multiple nodes And execute a single MR task on all/many nodes available on HDFS Use commodity HW: no need for specialized and expensive network and disk Not so much ACID, but BASE (basically available, soft state, eventually consistent) Hadoop : Move Compute to the Data

PowerPoint Presentation:

// Map Reduce is broken out into a Map function and reduce function // ------------------------------------------------------------------ // Sample Map function: tokenizes string and establishes the tokens // e.g. “a b \ t c \ n d ” is now an key value pairs representing [a, b, c, d] public void map (Object key, Text value, Context context ) throws IOException , InterruptedException { StringTokenizer itr = new StringTokenizer ( value.toString ()); while ( itr.hasMoreTokens ()) { word.set ( itr.nextToken ()); context.write (word, one); } } // Sample Reduce function: does the count of these key value pairs public void reduce (Text key, Iterable < IntWritable > values, Context context ) throws IOException , InterruptedException { int sum = 0; for ( IntWritable val : values) { sum += val.get (); } result.set (sum); context.write (key, result); } Sample Java MapReduce WordCount Function

PowerPoint Presentation:

The Project Gutenberg EBook of The Notebooks of Leonardo Da Vinci, Complete by Leonardo Da Vinci (#3 in our series by Leonardo Da Vinci) Copyright laws are changing all over the world. Be sure to check the copyright laws for your country before downloading or redistributing this or any other Project Gutenberg eBook. This header should be the first thing seen when viewing this Project Gutenberg file. Please do not remove it. Do not change or edit the header without written permission. Please read the "legal small print," and other information about the eBook and Project Gutenberg at the bottom of this file. Included is important information about your specific rights and restrictions in how the file may be used. You can also find out about how to make a donation to Project Gutenberg, and how to get involved. **Welcome To The World of Free Plain Vanilla Electronic Texts** **eBooks Readable By Both Humans and By Computers, Since 1971** *****These eBooks Were Prepared By Thousands of Volunteers!***** Title: The Notebooks of Leonardo Da Vinci, Complete Author: Leonardo Da Vinci ... Code to execute: hadoop jar AcmeWordCount.jar AcmeWordCount /test/davinci.txt / test/ davinci_wordcount Purpose: To perform count of number of words within the said davinci.txt laws 2 Project 5 … Executing WordCount against sample file

PowerPoint Presentation:

// Sample Generated Log 588.891.552.388,-,08/05/2011,11:00:02,W3SVC1,CTSSVR14,-,-,0,-,200,-,GET,/ c.gif,Mozilla /5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0,http://foo.live.com/cid-4985109174710/ blah?fdkjafdf , [GUID] , - ,MSFT, & PageID =1234&Region=89191& IsoCy= BR&Lang =1046 &Referrer= hotmail.com&ag =2385105&Campaign=&Event=12034 GUID Parameters [GUID] & PageID =1234&Region=89191& IsoCy= BR&Lang =1046 &Referrer= hotmail.com&ag =2385105&Campaign=&Event=12034 select GUID, str_to_map (" param ", "&", "=")[" IsoCy "], str_to_map (" param ", "&", "=")["Lang"] from weblog; HiveQL : SQL- like language Write SQL-like query which becomes MapReduce functions Includes functions like str_to_map so one can perform parsing functions in HiveQL Query a Sample WebLog using HiveQL

Hadoop ecosystem | open source, commodity:

Hadoop ecosystem | open source, commodity Cassandra Hive Scribe Hadoop Hadoop Oozie Pig (- latin ) BackType Hadoop Pig / Hbase Cassandra MR / GFS Bigtable Dremel … SimpleDB Dynamo EC2 / S3 … Hadoop | Azure | Excel | BI | SQL DW | PDW | F# | Viola James Mahout | Scalable machine learning and data mining MongoDB | Document-oriented database (C++) Couchbase | CouchDB (doc dB) + Membase ( memcache protocol) Hbase | Hadoop column-store database R | Statistical computing and graphics Pegasus | Peta -scale graph mining system Lucene | full-featured text search engine library

NoSQL ecosystem | business value:

NoSQL ecosystem | business value 140,000-190,000 more deep analytical talent positions 1.5 million more data saavy managers in the US alone $300 billion Potential annual value to US healthcare 15 out of 17 s ectors in the US have more data s tored per company than the US Library of Congress €250 billion Potential annual value to Europe’s public sector 50-60 % increase in the number of Hadoop developers within organizations already using Hadoop within a year … the migration proved that disaster recovery is possible with Hadoop clusters. This could be an important capability for organizations considering relying on Hadoop ( by running Hive atop the Hadoop Distributed File System) as a data warehouse, like Facebook does. As Yang notes, “Unlike a traditional warehouse using SAN/NAS storage, HDFS-based warehouses lack built-in data-recovery functionality. We showed that it was possible to efficiently keep an active multi-petabyte cluster properly replicated, with only a small amount of lag.” How Facebook moved 30 petabytes of Hadoop data .

PowerPoint Presentation:

This illustrates a new thesis or collective wisdom emerging from the Valley: If a technology is not your core value-add, it should be open-sourced because then others can improve it, and potential future employees can learn it. This rising tide has lifted all boats, and is just getting started. Kovas Boguta : Hadoop & Startups: Where Open Source Meets Business Data.

PowerPoint Presentation:

// Map Reduce function in JavaScript // ------------------------------------------------------------------ var map = function (key, value, context) { var words = value.split (/[^a- zA -Z]/); for ( var i = 0; i < words.length ; i ++) { if (words[ i ] !== "") { context.write (words[ i ]. toLowerCase (), 1); } } }; var reduce = function (key, values, context) { var sum = 0; while ( values.hasNext ()) { sum += parseInt ( values.next ()); } context.write (key, sum); }; Hadoop.JS executing AcmeWordCount.JS

PowerPoint Presentation:

Hadoop on Azure and Windows Excel Integration HiveODBC : PowerPivot , Crescent Information Worker How Microsoft address Big Data Operations Hadoop Analytics R Lucene Mahout Pegasus Data Scientist AD integration Kerberos Cluster Management and Monitoring Visual Studio Tools Integration Hadoop JS HiveQL Developer

PowerPoint Presentation:

EIS / ERP Database File System OData / RSS Azure Storage Java MR StreamInsight API HiveQL Pig Latin CQL Ocean of Data Hadoop on Azure and Windows

Azure + Hadoop = Tier-1 in the Cloud:

Azure + H adoop = Tier-1 in the Cloud In short, Hadoop has the potential to make the enterprise compatible with the entire rest of the open-source and startup world… Kovas Boguta : Hadoop & Startups: Where Open Source Meets Business Data.

BI + Tier-1 Big Data = Above the Cloud:

BI + Tier-1 Big Data = Above the Cloud "The sun always shines above the clouds." — Paul F. Davis

PowerPoint Presentation:

Hadoop on Azure Interactive Hive Console

PowerPoint Presentation:

Excel Hive Add-In

PowerPoint Presentation:

PowerPivot and Hadoop Connecting PowerPivot to Hadoop on Azure

PowerPoint Presentation:

Power View and Hadoop Connecting Power View to Hadoop on Azure

PowerPoint Presentation:

A Definition of Big Data - 4Vs: volume, velocity, variability, variety Big data: techniques and technologies that make handling data at extreme scale economical.

PowerPoint Presentation:

Hadoop : Auto-replication Hadoop processes data in 64MB chunks and then replicates to different servers Replication value set at the hdfs-site.xml, dfs.replication node

PowerPoint Presentation:

49 Source. http://developer.yahoo.com/hadoop/tutorial/module4.html

Detailed map reduce flow:

Detailed map reduce flow 50 Source. http://developer.yahoo.com/hadoop/tutorial/module4.html

PowerPoint Presentation:

51

PowerPoint Presentation:

52

Hadoop Ecosystem Snapshot:

MapReduce (Job Scheduling / Execution System) Hadoop Ecosystem Snapshot Serialization (Thrift, Protobuf , Writable) Zookeepr (Coordination) ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive ( SQL / DW) Sqoop (SSIS) HDFS (Hadoop Distributed File System) HBase (Column DB) Inspired by Tom White’s Hadoop: The Definitive Guide Mahout (ML) Lucene / Solr ( search indexing ) HCatalog Cassandra ( Column DB ) External Stores (S3, Azure Blobs, Azure Data Market, etc )

When is big data a big fit?:

Telemetry Management Clickstream and Application Log Analysis Sensor Data IT Management SLA Monitoring Cyber Security Forensic Analysis When is big data a big fit? Online Commerce Sentiment Analysis Recommendation Engines Search Indexing / Quality Financial Services Risk Modeling Threat Analysis Fraud Detection Credit Scoring

Big data is not the only tool:

Replacement for relational - NO Simply a VLDB - NO Big data is not the only tool Fast for subsets & filtered data - NO The answer to everything - NO

VVVVroom:

Variability – Multiple interpretations Velocity – Need decisions fast VVVVroom Variety – Many formats Volume – beyond what environment can handle

What does Microsoft bring to the table?:

Hive ODBC Driver Sqoop What does Microsoft bring to the table? JavaScript Hadoop On Azure - CTP Open Source Apache Hadoop

Why is Microsoft Hadoop a fit for my Enterprise?:

Elasticity Familiar, reusable skills Why is Microsoft Hadoop a fit for my Enterprise? Ease of data movement Interactivity Visualization Self Service

Who uses Big Data?:

Those Seeking Insights Information Workers Who uses Big Data? Anyone who uses BI now Data Scientists / Data Teams

How do we visualize the results?:

Hive ODBC Driver + Excel Add In PowerPivot How do we visualize the results? Power View Custom Tools

Insights to Action:

Take Action Insights to Action Discover Insights Rinse and Repeat

Microsoft Technologies:

Microsoft Technologies http:// www.microsoft.com/download/en/details.aspx?id=27584 62

PowerPoint Presentation:

63

PowerPoint Presentation:

64

PowerPoint Presentation:

65

Hopefully, you are not:

Hopefully, you are not 66

Big Summary:

What Big Data Is and Isn’t Microsoft Hadoop Data, Insights, Visualization Big Summary SQL, NoSQL , Hive

Big Data References:

Big Data References Hadoop : The Definitive Guide by Tom White SQL Server Sqoop http://bit.ly/rulsjX JavaScript http:// bit.ly/wdaTv6 Twitter https://twitter.com/#!/search/% 23bigdata Hive http://hive.apache.org Excel to Hadoop via Hive ODBC http:// tinyurl.com/7c4qjjj Hadoop On Azure Videos http:// tinyurl.com/6munnx2 Klout http:// tinyurl.com/6qu9php Microsoft Big Data http:// microsoft.com/bigdata Denny Lee http://dennyglee.com/category/bigdata / Carl Nolan http:// tinyurl.com/6wbfxy9 Cindy Gross http:// tinyurl.com/SmallBitesBigData