Apache Beam

Views:
 
     
 

Presentation Description

This presentation gives an overview of the Apache Beam project. It shows that it is a means of developing generic data pipelines in multiple languages using provided SDK's. The pipelines execute on a range of supported runners/executors. Links for further information and connecting http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ https://nz.linkedin.com/pub/mike-frampton/20/630/385 https://open-source-systems.blogspot.com/

Comments

Presentation Transcript

slide 1:

What Is Apache Beam ● A unified programming model ● To define and execute data processing pipelines ● For ETL batch and stream ● Open source / Apache 2.0 license ● Written in Java Python Go ● Cross platform support ● Pipelines define using Beam SDKs

slide 2:

How Does Beam Work ● Use provided SDKs to define pipelines ● In Java Python Go ● Beam SDK isolated in Docker container ● Can be run by any execution runners ● A supported group of runners execute the pipeline ● Capability matrix defines – Relative capabilities of runners – See beam.apache.org for matrix

slide 3:

Beam Programming Guide ● A guide for user to create data pipelines ● Examples in Java Python Go ● Can design create and test pipelines ● Provides multi language functions for ● Pcollections ● Transforms ● Pipeline I/O ● Schemas ● Data encoding / type safety ● Windowing ● Triggers ● Metrics ● State and Timers

slide 4:

Beam Pipelines ● When designing pipelines consider – Where data is stored – What does the data look like – What do you want to do with the data – What does your output data look like – Where should the data go ● Use PCollection and PTransform functions to define pipelines

slide 5:

Beam Example Pipelines

slide 6:

Beam Example Pipelines

slide 7:

Beam Runners ● Supported Beam Runners are – Direct Runner test and development – Apache Apex – Apache Flink – Apache Gearpump – Apache Hadoop MapReduce – Apache Nemo – Apache Samza – Apache Spark – Google Cloud Dataflow – Hazelcast Jet – IBM Streams – JStorm

slide 8:

Beam Capability Matrix – What Computed

slide 9:

Beam Capability Matrix – Where Computed

slide 10:

Beam Capability Matrix – When Computed

slide 11:

Beam Capability Matrix – How Computed

slide 12:

Available Books ● See “Big Data Made Easy” – Apress Jan 2015 ● See “Mastering Apache Spark” – Packt Oct 2015 ● See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” ● Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ ● Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020

slide 13:

Connect ● Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at – open-source-systems.blogspot.com/ ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

authorStream Live Help