Lyft's Envoy: From Monolith to Service Mesh

Views:
 
     
 

Presentation Description

Matt Klein discusses building Lyft's open source Envoy in order to achieve microservice scale at the 2017 Microservices.com Practitioner Summit.

Comments

Presentation Transcript

slide 1:

Lyfts Envoy: From monolith to service mesh Matt Klein Software Engineer Lyft

slide 2:

Agenda ● Historical Lyft SoA architecture ● State of SoA networking in industry ● What is Envoy ● Envoy deployment Lyft ● Envoy future directions ● QA

slide 3:

Lyft 3.5 years ago PHP / Apache monolith MongoDB Internet Clients AWS ELB Simple No SoA

slide 4:

Lyft 2 years ago PHP / Apache monolith +haproxy/nsq MongoDB Internet Clients AWS external ELB DynamoDB AWS internal ELBs Python services Not simple SoA With monolith and some haproxy/nsq

slide 5:

Agenda ● Historical Lyft SoA architecture ● State of SoA networking in industry ● What is Envoy ● Envoy deployment Lyft ● Envoy future directions ● QA

slide 6:

State of SoA networking in industry ● Languages and frameworks. ● Per language libraries for service calls. ● Protocols HTTP/1 HTTP/2 gRPC databases caching etc.. ● Infrastructures IaaS CaaS on premise etc.. ● Intermediate load balancers AWS ELB F5 etc.. ● Observability output stats tracing and logging. ● Implementations often partial of retry circuit breaking rate limiting timeouts and other distributed systems best practices. ● Authentication and Authorization.

slide 7:

State of SoA networking in industry A really big and confusing mess...

slide 8:

State of SoA networking in industry ● Likely already in a world of hurt or rapidly approaching that point. ● Debugging is difficult or impossible each application exposes different stats and logs with no tracing. ● Limited visibility into infra components such as hosted load balancers databases caches network topologies etc.. ● Multiple and partial implementations of circuit breaking retry and rate limiting If I had a for every time someone told me that retries are “easy” …. ● Furthermore if you do have a good solution you are likely using a library and are locked into a particular technology stack essentially forever. ● Libraries are incredibly painful to upgrade. Think CVEs.

slide 9:

State of SoA networking in industry ● Ultimately robust observability and easy debugging are everything. ● As SoAs become more complicated it is critical that we provide a common solution to all of these problems or developer productivity grinds to a halt and the site goes down … often. Can we do better

slide 10:

Agenda ● Historical Lyft SoA architecture ● State of SoA networking in industry ● What is Envoy ● Envoy deployment Lyft ● Envoy future directions ● QA

slide 11:

What is Envoy The network should be transparent to applications. When network and application problems do occur it should be easy to determine the source of the problem. This sounds great But it turns out it’s really really hard.

slide 12:

What is Envoy ● Out of process architecture: Let’s do a lot of really hard stuff in one place and allow application developers to focus on business logic. ● Modern C++11 code base: Fast and productive. ● L3/L4 filter architecture: A TCP proxy at its core. Can be used for things other than HTTP e.g. MongoDB redis stunnel replacement TCP rate limiter etc.. ● HTTP L7 filter architecture: Make it easy to plug in different functionality. ● HTTP/2 first Including gRPC and a nifty gRPC HTTP/1.1 bridge. ● Service discovery and active health checking. ● Advanced load balancing: Retry timeouts circuit breaking rate limiting shadowing etc. ● Best in class observability: stats logging and tracing. ● Edge proxy: routing and TLS.

slide 13:

Envoy service to service topology Service Cluster Envoy Service Discovery Service Cluster Envoy Service External Services HTTP/2 REST / GRPC

slide 14:

Envoy edge proxy topology “Front” Envoy Edge Proxy Region 1 Internet External Clients HTTP/1.1 HTTP/2 TLS “Front” Envoy Edge Proxy Region 2 Private Infra HTTP/2 TLS Client Auth

slide 15:

Lyft today Legacy monolith +Envoy MongoDB Internet Clients “Front” Envoy via TCP ELB DynamoDB Python services +Envoy Service mesh Awesome Go services +Envoy Stats / tracing direct from Envoy Discovery

slide 16:

Eventually consistent service discovery ● Fully consistent service discovery systems are very popular ZK etcd consul etc.. ● In practice they are hard to run at scale. ● Service discovery is actually an eventually consistent problem. Let’s recognize that and design for it. ● Envoy is designed from the get go to treat service discovery as lossy. ● Active health checking used in combination with service discovery to produce a routable overlay. Discovery Status HC OK HC Failed Discovered Route Don’t Route Absent Route Don’t Route / Delete

slide 17:

Advanced load balancing ● Different service discovery types. ● Zone aware least request load balancing. ● Dynamic stats: Per zone canary specific stats etc. ● Circuit breaking: Max connections requests and retries. ● Rate limiting: Integration with global rate limit service. ● Shadowing: Fork traffic to a test cluster. ● Retries: HTTP router has built in retry capability with different policies. ● Timeouts: Both “outer” including all retries and “inner” per try timeouts. ● Outlier detection: Consecutive 5xx ● Deploy control: Blue/green canary etc.

slide 18:

Observability ● Observability is by far the most important thing that Envoy provides. ● Having all SoA traffic transit through Envoy gives us a single place where we can: ○ Produce consistent statistics for every hop ○ Create and propagate a stable request ID ○ Consistent logging ○ Distributed tracing

slide 19:

Observability

slide 20:

Observability

slide 21:

Observability

slide 22:

Observability

slide 23:

Performance matters for a service proxy ● Throughput is important ultimately for cost and operational scaling but for many companies developer time is worth more than infra costs. ● BUT: Latency and predictability is what matters. And in particular tail latency P99+. ● We already deal with incredibly confusing deployments virtual IaaS multiple languages and runtimes languages that use GC etc.. All of these niceties improve productivity and reduce upfront dev costs but they make debugging really difficult. ● What is leading to sporadic error The IaaS The app GC ● Ability to reason about overall performance and reliability is critical.

slide 24:

Performance matters for a service proxy ● A service proxy provides invaluable benefits but if the proxy itself has tail latencies that are hard to reason about most of the debugging benefits go out the window and you are back to square one. ● Anyone that is trying to sell you a service infra that does not consider the above points is selling you a dream...

slide 25:

Agenda ● Historical Lyft SoA architecture ● State of SoA networking in industry ● What is Envoy ● Envoy deployment Lyft ● Envoy future directions ● QA

slide 26:

Envoy deployment Lyft ● 100 services. ● 10000 hosts. ● 2000000 RPS. ● All service to service traffic REST and gRPC. ● Use gRPC bridge to unlock Python and PHP clients. ● MongoDB proxy. ● DynamoDB proxy. ● External service proxy AWS and other partners. ● Kibana/Elastic Search for logging. ● LightStep for tracing. ● Wavefront for stats.

slide 27:

Agenda ● Historical Lyft SoA architecture ● State of SoA networking in industry ● What is Envoy ● Envoy deployment Lyft ● Envoy future directions ● QA

slide 28:

Envoy future directions ● Redis. ● More outlier detection and ejection SR and latency. ● LB subset support. ● More rate limiting options / open source rate limit service / IP tagging. ● Configuration schema and better error output. ● Work with Google/community to add k8s support. ● Authentication and authorization. ● Envoy ecosystem

slide 29:

Agenda ● Historical Lyft SoA architecture ● State of SoA networking in industry ● What is Envoy ● Envoy deployment Lyft ● Envoy future directions ● QA

slide 30:

QA ● Thanks for coming ● We are super excited about building a community around Envoy. Talk to us if you need help getting started. ● https://lyft.github.io/envoy/ ● Lyft is hiring: Contact us if you want to work on hard scaling problems in a fast moving company: https://www.lyft.com/jobs

authorStream Live Help