Bringing Learnings from Googley Microservices with gRPC


Presentation Description

Varun Talwar, product manager on Google's gRPC project discusses the fundamentals and specs of gRPC inside of a Google-scale microservices architecture.


Presentation Transcript

slide 1:

Google confidential │ Do not distribute Google confidential │ Do not distribute Bringing learnings from Googley microservices with gRPC Microservices Summit Varun Talwar

slide 2:

Contents 1. Context: Why are we here 2. Learnings from Stubby experience a. HTTP/JSON doesnt cut it b. Establish a Lingua Franca c. Design for fault tolerance and control: Sync/Async Deadlines Cancellations Flow control d. Flying blind without stats e. Diagnosing with tracing f. Load Balancing is critical 3. gRPC a. Cross platform matters b. Performance and Standards matter: HTTP/2 c. Pluggability matters: Interceptors Name Resolvers Auth plugins d. Usability matters

slide 3:


slide 4:

Business Agility

slide 5:

Developer Productivity

slide 6:


slide 7:


slide 8:

Google confidential │ Do not distribute Microservices at Google O10 10 RPCs per second. Images by Connie Zhou

slide 9:

Stubby Magic Google

slide 10:

Making Google magic available to all Kubernetes Borg Stubby

slide 11:


slide 12:

Key learnings 1. HTTP/JSON doesnt cut it 2. Establish a lingua franca 3. Design for fault tolerance and provide control knobs 4. Dont fly blind: Service Analytics 5. Diagnosing problems: Tracing 6. Load Balancing is critical

slide 13:

HTTP/JSON doesn’t cut it 1. WWW browser growth - bled into services 2. Stateless 3. Text on the wire 4. Loose contracts 5. TCP connection per request 6. Nouns based 7. Harder API evolution 8. Think compute network on cloud platforms 1

slide 14:

Establish a lingua franca 1. Protocol Buffers - Since 2003. 2. Start with IDL 3. Have a language agnostic way of agreeing on data semantics 4. Code Gen in various languages 5. Forward and Backward compatibility 6. API Evolution 2

slide 15:

How we roll at Google

slide 16:

Google Cloud Platform Service Definition weather.proto syntax "proto3" service Weather rpc GetCurrentWeatherRequest returns WeatherResponse message WeatherRequest Coordinates coordinates 1 message Coordinates fixed64 latitude 1 fixed64 longitude 2 message WeatherResponse Temperature temperature 1 float humidity 2 message Temperature float degrees 1 Units units 2 enum Units FAHRENHEIT 0 CELSIUS 1 KELVIN 2

slide 17:

Design for fault tolerance and control ● Sync and Async APIs ● Need fault tolerance: Deadlines Cancellations ● Control Knobs: Flow control Service Config Metadata 3

slide 18:

18 First-class feature in gRPC. Deadline is an absolute point in time. Deadline indicates to the server how long the client is willing to wait for an answer. RPC will fail with DEADLINE_EXCEEDED status code when deadline reached. gRPC Deadlines

slide 19:

Google Cloud Platform Deadline Propagation Gateway 90 ms Now 1476600000000 Deadline 1476600000200 40 ms 20 ms 20 ms 60 ms withDeadlineAfter200 MILLISECONDS Now 1476600000040 Deadline 1476600000200 Now 1476600000150 Deadline 1476600000200 Now 1476600000230 Deadline 1476600000200 DEADLINE_EXCEEDED DEADLINE_EXCEEDED DEADLINE_EXCEEDED DEADLINE_EXCEEDED

slide 20:

20 Deadlines are expected. What about unpredictable cancellations • User cancelled request. • Caller is not interested in the result any more. • etc Cancellation

slide 21:

Google Cloud Platform Cancellation GW Busy Busy Busy Busy Busy Busy Busy Busy Busy Active RPC Active RPC Active RPC Active RPC Active RPC Active RPC Active RPC Active RPC Active RPC

slide 22:

Google Cloud Platform Cancellation Propagation GW Idle Idle Idle Idle Idle Idle Idle Idle Idle

slide 23:

23 Automatically propagated. RPC fails with CANCELLED status code. Cancellation status be accessed by the receiver. Server receiver always knows if RPC is valid Cancellation

slide 24:

Google Cloud Platform BiDi Streaming - Slow Client Fast Server Request Responses Slow Client CANCELLED UNAVAILABLE RESOURCE_EXHAUSTED

slide 25:

Google Cloud Platform BiDi Streaming - Slow Server Slow Server Request Response Fast Client CANCELLED UNAVAILABLE RESOURCE_EXHAUSTED Requests

slide 26:

26 Flow-control helps to balance computing power and network capacity between client and server. gRPC supports both client- and server-side flow control. Flow-Control Photo taken by Andrey Borisenko.

slide 27:

27 Policies where server tells client what they should do Can specify deadlines lb policy payload size per method of a service Loved by SREs they have more control Discovery via DNS Service Config

slide 28:

Metadata Exchange - Common cross-cutting concerns like authentication or tracing rely on the exchange of data that is not part of the declared interface of a service. Deployments rely on their ability to evolve these features at a different rate to the individual APIs exposed by services. Metadata helps in exchange of useful information

slide 29:

Don’t fly blind: Stats 4 ● What is the mean latency time per RPC ● How many RPCs per hour for a service ● Errors in last minute/hour ● How many bytes sent How many connections to my server

slide 30:

Data collection by arbitrary metadata is useful ● Any service’s resource usage and performance stats in real time by almost any arbitrary metadata 1. Service X can monitor CPU usage in their jobs broken down by the name of the invoked RPC and the mdb user who sent it. 2. Social can monitor the RPC latency of shared bigtable jobs when responding to their requests broken down by whether the request originated from a user on web/Android/iOS. 3. Gmail can collect usage on servers broken down by according POP/IMAP/web/Android/iOS. Layer propagates Gmails metadata down to every service even if the request was made by an intermediary job that Gmail doesnt own ● Stats layer export data to varz and streamz and provides stats to many monitoring systems and dashboards

slide 31:

Diagnosing problems: Tracing 5 ● 1/10K requests takes very long. Its an ad query :- I need to find out. ● Take a sample and store in database help identify request in sample which took similar amount of time ● I didnt get a response from the service. What happened Which link in the service dependency graph got stuck Stitch a trace and figure out. ● Where is it taking time for a trace Hotspot analysis ● What all are the dependencies for a service

slide 32:

Load Balancing is important 5 Iteration 1: Stubby Balancer Iteration 2: Client side load balancing Iteration 3: Hybrid Iteration 4: gRPC-lb

slide 33:

● Current client support intentionally dumb simplicity. ○ Pick first available - Avoid connection establishment latency ○ Round-robin-over-list - Lists not sets → ability to represent weights ● For anything more advanced move the burden to an external "LB Controller" a regular gRPC server and rely on a client-side implementation of the so-called gRPC LB policy. client LB Controller backends 1 Control RPC 2 address-list 3 RR over addresses of address-list gRPC LB Next gen of load balancing

slide 34:

In summary what did we learn ● Contracts should be strict ● Common language helps ● Common understanding for deadlines cancellations flow control ● Common stats/tracing framework is essential for monitoring debugging ● Common framework lets uniform policy application for control and lb Single point of integration for logging monitoring tracing service discovery and load balancing makes lives much easier

slide 35:


slide 36:

Open source on Github for C C++ Java Node.js Python Ruby Go C PHP Objective-C gRPC core gRPC Java gRPC Go

slide 37:

● 1.0 with stable APIs ● Well documented with an active community ● Reliable with continuous running tests on GCE ○ Deployable in your environment ● Measured with an open performance dashboard ○ Deployable in your environment ● Well adopted inside and outside Google Where is the project today

slide 38:

1. Cross language Cross platform matters 2. Performance and Standards matter: HTTP/2 3. Pluggability matters: Interceptors Name Resolvers Auth plugins 4. Usability matters More lessons

slide 39:

1. Cross language Cross platform matters 2. Performance and Standards matter: HTTP/2 3. Pluggability matters: Interceptors Name Resolvers Auth plugins 4. Usability matters More lessons

slide 40:

Google Cloud Platform Coverage Simplicity The stack should be available on every popular development platform and easy for someone to build for their platform of choice. It should be viable on CPU memory limited devices. gRPC Principles Requirements

slide 41:

Google Cloud Platform gRPC Speaks Your Language ● Java ● Go ● C/C++ ● C ● Node.js ● PHP ● Ruby ● Python ● Objective-C ● MacOS ● Linux ● Windows ● Android ● iOS Service definitions and client libraries Platforms supported

slide 42:

Google Cloud Platform Interoperability Java Service Python Service GoLang Service C++ Service gRPC Service gRPC Stub gRPC Stub gRPC Stub gRPC Stub gRPC Service gRPC Service gRPC Service gRPC Stub

slide 43:

1. Cross language Cross platform matters 2. Performance and Standards matter: HTTP/2 3. Pluggability matters: Interceptors Name Resolvers Auth plugins 4. Usability matters More lessons

slide 44:

Google Cloud Platform • Single TCP connection. • No Head-of-line blocking. • Binary framing layer. • Request – Stream. • Header Compression. HTTP/2 in One Slide TransportTCP Application HTTP/2 Network IP Session TLS optional Binary Framing HEADERS Frame DATA Frame HTTP/2 POST: /upload HTTP/1.1 Host: Content-Type: application/json Content-Length: 27 HTTP/1.x “msg”: “Welcome to 2016”

slide 45:

Google Cloud Platform HTTP/2 breaks down the HTTP protocol communication into an exchange of binary-encoded frames which are then mapped to messages that belong to a stream and all of which are multiplexed within a single TCP connection. Binary Framing Stream 1 HEADERS Stream 2 :method: GET :path: /kyiv :version: HTTP/2 :scheme: https HEADERS :status: 200 :version: HTTP/2 :server: nginx/1.10.1 ... DATA payload Stream N Request Response TCP

slide 46:

Google Cloud Platform HTTP/1.x vs HTTP/2

slide 47:

Google Cloud Platform gRPC Service Definitions Unary RPCs where the client sends a single request to the server and gets a single response back just like a normal function call. The client sends a request to the server and gets a stream to read a sequence of messages back. The client reads from the returned stream until there are no more messages. The client send a sequence of messages to the server using a provided stream. Once the client has finished writing the messages it waits for the server to read them and return its response. Client streaming Both sides send a sequence of messages using a read-write stream. The two streams operate independently. The order of messages in each stream is preserved. BiDi streaming Unary Server streaming

slide 48:

48 Messaging applications. Games / multiplayer tournaments. Moving objects. Sport results. Stock market quotes. Smart home devices. You name it BiDi Streaming Use-Cases

slide 49:

● Open Performance Benchmark and Dashboard ● Benchmarks run in GCE VMs per Pull Request for regression testing. ● gRPC Users can run these in their environments. ● Good Performance across languages: ○ Java Throughput: 500 K RPCs/Sec and 1.3 M Streaming messages/Sec on 32 core VMs ○ Java Latency: 320 us for unary ping-pong netperf 120us ○ C++ Throughput: 1.3 M RPCs/Sec and 3 M Streaming Messages/Sec on 32 core VMs. Performance

slide 50:

1. Cross language Cross platform matters 2. Performance and Standards matter: HTTP/2 3. Pluggability matters: Interceptors Auth 4. Usability matters More lessons

slide 51:

Google Cloud Platform Pluggable Large distributed systems need security health-checking load-balancing and failover monitoring tracing logging and so on. Implementations should provide extensions points to allow for plugging in these features and where useful default implementations. gRPC Principles Requirements

slide 52:

Google Cloud Platform Interceptors Client Server Request Response Client interceptors Server interceptors

slide 53:

● Auth Security - TLS Mutual Plugin auth mechanism e.g. OAuth ● Proxies ○ Basic: nghttp2 haproxy traefik ○ Advanced: Envoy linkerd Google LB Nginx in progress ● Service Discovery ○ etcd Zookeeper Eureka … ● Monitor Trace ○ Zipkin Prometheus Statsd Google DIY Pluggability

slide 54:

1. Cross language Cross platform matters 2. Performance and Standards matter: HTTP/2 3. Pluggability matters: Interceptors Auth 4. Usability matters More lessons

slide 55:

Get Started

slide 56:

1. Server reflection 2. Health Checking 3. Automatic retries 4. Streaming compression 5. Mechanism to do caching 6. Binary Logging a. Debugging auditing though costly 7. Unit Testing support a. Automated mock testing b. Dont need to bring up all dependent services just to test 8. Web support Coming soon

slide 57:

Microservices: in data centres Streaming telemetry from network devices Client Server communication/Internal APIs Some early adopters Mobile Apps

slide 58:

Thank you Thank you Twitter: grpcio Site: Group: Repo:

slide 59:


slide 60:

Why gRPC Multi-language 9 languages Open Open source and growing community Strict Service contracts Define and enforce contracts backward compatible Performant 1m+ QPS - unary 3m+ streaming dashboard Pluggable design Auth Transport IDL LB Efficiency on wire 2-3X gains Streaming APIs Large payloads speech logs Standard compliant HTTP/2 Easy to use Single line installation

slide 61:

Google Cloud Platform The Fallacies of Distributed Computing The network is reliable Latency is zero Bandwidth is infinite The network is secure Topology doesnt change There is one administrator Transport cost is zero The network is homogeneous

slide 65:

How is gRPC Used Direct RPCs : Microservices On Prem GCP Other Cloud

slide 66:

How is gRPC Used Direct RPCs : Microservices RPCs to access APIs Google APIs Your APIs On Prem GCP Other Cloud

slide 67:

How is gRPC Used Direct RPCs : Microservices RPCs to access APIs Google APIs Your APIs Mobile/Web RPCs Your Mobile /Web Apps On Prem GCP Other Cloud

slide 68:

Google confidential │ Do not distribute What are the benefits Ease of use Performance Versioning Programming model Developers Uniform Monitoring Debugging/Tracing Cross platform/language Operators Defined Contracts Single uniform framework for control Visibility Architects/Manag ers

slide 69:

Google Cloud Platform gRPC Principles Requirements Layered Key facets of the stack must be able to evolve independently. A revision to the wire-format should not disrupt application layer bindings.

slide 70:

Layered Architecture Stub Code Gen’d Service API Code Gen Support Code Channel API Transport API Standard applications Initialization interceptors and advanced applications

slide 71:

Google Cloud Platform Layered Architecture HTTP/2 RPC Client-Side App Channel Stub Future Stub Blocking Stub ClientCall RPC Server-side Apps Tran 1 Tran 2 Tran N Service Definition extends generated definition ServerCall handler Transport ServerCall NameResolver LoadBalancer Pluggable Load Balancing and Service Discovery

slide 72:

Google Cloud Platform Takeaways HTTP/2 is a high performance production-ready multiplexed bidirectional protocol. gRPC • HTTP/2 transport based open source general purpose standards-based feature-rich RPC framework. • Bidirectional streaming over one single TCP connection. • Netty transport provides asynchronous and non-blocking I/O. • Deadline and cancellations propagation. • Client- and server-side flow-control. • Layered pluggable and extensible. • Supports 10 programming languages. • Build-in testing support. • Production-ready current version is 1.0.1 and growing ecosystem.

slide 73:

Growing Ecosystem

slide 74:

74 Migration. Testing. Swagger / OpenAPI tooling. gRPC Gateway Photo taken by Andrey Borisenko.

slide 75:

● Protocol Structure ○ Request → Call Spec Header Metadata Messages ○ Response → Header Metadata Messages Trailing Metadata Status ● Generic mechanism for attaching metadata to requests and responses ● Commonly used to attach “bearer tokens” to requests for Auth ○ OAuth2 access tokens ○ JWT e.g. OpenId Connect Id Tokens ● Session state for specific Auth mechanisms is encapsulated in an Auth-credentials object Metadata and Auth

authorStream Live Help