Apache MADlib AI/ML

Views:
 
     
 

Presentation Description

This presentation gives an overview of the Apache MADlib AI/ML project. It explains Apache MADlib AI/ML in terms of it's functionality, it's architecture, dependencies and also gives an SQL example. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

Comments

Presentation Transcript

slide 1:

What Is Apache MADlib ● For scalable in-database analytics ● Open source Apache 2.0 license ● For machine learning in SQL ● At big data scale ● Offers graph statistics analytics deep learning ● Provides data-parallel implementations ● For structured and unstructured data

slide 2:

MADlib Prerequisites ● Currently supports databases – PostgreSQL ● Needs Python extension specified – Greenplum distributed db – Apache Hawq v1.12+ distributed db ● Requires the GNU M4 Unix macro processor ● Works with Python 2.6 and 2.7

slide 3:

MADlib Architecture

slide 4:

MADlib Architecture ● MADlib has three main layers ● Python driver functions – Main entry point from user input – Largely responsible for algorithm flow control – Validating input parameters – Executing SQL statements – Evaluating the results – Potentially looping to execute more SQL statements ● Until some convergence criteria has been hit

slide 5:

MADlib Architecture ● MADlib has three main layers ● C++ implementations functions – C++ definitions of the core functions/aggregates ● Needed for particular algorithms – Implemented in C++ rather than Python ● For performance reasons

slide 6:

MADlib Architecture ● MADlib has three main layers ● C++ database abstraction layer – Provide a programming interface – Abstracts all the Postgres internal details – Provides support for different back end platforms – Focuses on the internal functionality ● Rather than the platform integration logic

slide 7:

MADlib Data Types and Transformations ● Arrays and Matrices ● Encoding Categorical Variables ● Path ● Pivot ● Sessionize ● Stemming

slide 8:

MADlib Graph Functionality ● All Pairs Shortest Path ● Breadth-First Search ● HITS ● Measures ● PageRank ● Single Source Shortest Path ● Weakly Connected Components

slide 9:

MADlib Model Selection / Sampling ● Model Selection – Cross Validation – Prediction Metrics – Train-Test Split ● Sampling – Balanced Sampling – Stratified Sampling

slide 10:

MADlib Statistics / Supervised Learning ● Statistics – Descriptive Statistics – Inferential Statistics – Probability Functions ● Supervised Learning – Conditional Random Field – k-Nearest Neighbors – Neural Network – Regression Models – Support Vector Machines – Tree Methods

slide 11:

MADlib Time Series / Unsupervised Learning ● Time Series Analysis – ARIMA ● Unsupervised Learning – Association Rules – Clustering – Dimensionality Reduction – Topic Modelling

slide 12:

MADlib Utilities ● Columns to Vector ● Database Functions ● Linear Solvers ● Mini-Batch Preprocessor ● PMML Export ● Term Frequency ● Vector to Columns

slide 13:

MADlib Deep Learning Example SQL ● First define the model configurations to train ● Meaning either model architectures or hyperparameters ● Load them into a model selection table ● The combination of model architectures and hyperparameters ● Constitutes the model configurations to train ● In the picture there are three model configurations ● Represented by the three different purple shapes

slide 14:

MADlib Deep Learning Example SQL

slide 15:

MADlib Deep Learning Example SQL ● Once we have model combinations ● In the model selection table ● Call the fit function to train the models – In parallel. ● In the picture the three orange shapes ● Represent the three models that have been trained

slide 16:

MADlib Deep Learning Example SQL

slide 17:

Available Books ● See “Big Data Made Easy” – Apress Jan 2015 ● See “Mastering Apache Spark” – Packt Oct 2015 ● See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” ● Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ ● Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

slide 18:

Connect ● Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at – open-source-systems.blogspot.com/ ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

authorStream Live Help