A crash course on Service Mesh
Like Kubernetes, the topic of service meshes can get quite complex once you get into the details of implementation. Based on my 3 years of experience with service meshes, I am attempting to simplify the journey of someone looking to makes sense of this vast topic. This is related to my Kubernetes series of posts.
- Introduction
- Do service meshes only work with Kubernetes?
- Features
- Service mesh cons
- Service mesh vs Event mesh
- Service mesh architecture types
- Comparison of different service meshes
- References
Introduction
- At a high level, a service mesh ensures communication between applications.
- Specifically, a service mesh is a tool for adding observability, security, and reliability features to applications by inserting these features at the platform layer rather than the application layer (where libraries like Twitter’s Finagle, Netflix’s Hystrix, and Google’s Stubby were used).
- The term was introduced in 2016 by William Morgan, Buoyant CEO which eventually led to the creation of the Linkerd 1.0 service mesh. You can read his followup blog post here.
Do service meshes only work with Kubernetes?
- Service meshes are usually associated with Kubernetes. K8s network architecture and layered approach are well suited for service meshes.
- Some service meshes like Linkerd 2.x work only with Kubernetes. Others like Istio, Consul and Cilium can work with applications deployed on Virtual Machines too though the setup will likely involve some additional effort.
Features
- Observability
- Organizations can get observability support (e.g., metrics, logs, and traces) as well as dependency or service graphs for each of their services (microservice or not), as they adopt a service mesh.
- Security
- A service mesh can help in setting up a zero trust security model.
- Authentication, authorization and encrypting traffic between services(mTLS) can be taken up by a service mesh.
- Most service meshes provide a certificate authority (CA) to manage keys and certificates for securing service-to-service communication.
- Reliability
- Resiliency features typically include circuit-breaking, latency-aware load balancing, eventually consistent service discovery, retries, timeouts, and deadlines.
- Service meshes also safeguard service reliability by enforcing a timeout on long-running requests. It can ensure services don’t get overloaded by utilizing techniques like circuit breaking.
Service mesh cons
- Some service meshes can be quite resource heavy(eg: Istiod uses 1 vCPU and 1.5 GB of memory).
- Additional network hops for the traffic.
- Operational complexity can significantly rise for some service meshes wth high learning curve.
- With increase in maturity in k8s and in network CNIs, a number of features are already present and your dependency on service mesh may not be as as much as before.
Service mesh vs Event mesh
Event mesh and service mesh complement each other in the enterprise by providing two different but effective communication options.
- Event mesh connects not only microservices but also legacy applications, cloud-native services, devices, and data sources/sinks. These can operate both in cloud and non-cloud environments.
- While event mesh is asynchronous, service mesh supports more traditional synchronous request-reply messaging.
Service mesh architecture types
Sidecar proxy
- The most popular pattern for implementing a service mesh is the sidecar pattern.
- It involves deploying a network proxy for every service instance which handles all communication between the services. This is part of the service mesh data plane which is controlled by the mesh control plane.
- Many service meshes like Istio, Consul, Cilium use Envoy as proxy.
Host based proxy
- A host based proxy approach involves using a shared agent running on each node/vm of a cluster as proxy.
- It supposed to be a leaner alternative to the sidecar approach as Envoy based sidecars require a fair bit of resources to run (eg: Istiod uses 1 vCPU and 1.5 GB of memory).
- There are service meshes out there who use host based proxy along with sidecar based proxy for better security and division of responsibilities. Eg: Istio’s new Ambient mesh (this mode is not ready for production yet though).
Istio’s Ambient mesh which uses both host based(ztunnel) and sidecar proxy(waypoint):
eBPF based
Isovalent’s Cilium service mesh architecture:
- Extended Berkeley Packet Filter (eBPF) is a feature of the Linux kernel that allows applications to do certain types of work in the kernel itself. eBPF can be used to replace iptables rules, and accelerate the data plane by shortening the data path.
- There are efforts being made with eBPF to have an improved performance with sidecar free service meshes. Currently Cilium is a service mesh which uses eBPF and a node based proxy(Envoy). Istio is also experimenting on this with Merbridge.
- The Linux kernel has decades of features and safeguards in it. It looks difficult to have all proxy features of a service mesh in the kernel with eBPF(especially layer 7 features) but many organisations are looking into using eBPF for optimisation. We need to wait and watch to see how the winds blow here.
Comparison of different service meshes
The below table compares 4 prominent service meshes. For others you can look at https://servicemesh.es/(very detailed with a bit of outdated info and lesser meshes) or https://layer5.io/service-mesh-landscape.
Factor | Istio | Linkerd2 | Consul Connect | Cilium |
---|---|---|---|---|
First stable release | 31 Jul 2018 | 18 Sep 2018 | 16 Oct 2017 | 24 Apr 2018 |
Repository | Istio | Linkerd 2.0 | Consul | Cilium |
Language | go (control plane), C++(data plane ie Envoy) | go (control plane), rust (data plane) | go | go |
Supporting organizations | Lyft, Google, IBM, Microsoft | Cloud Native Foundation (CNCF) | HashiCorp | Isovalent |
Workloads | Kubernetes + VMs | Kubernetes only | Kubernetes + VMs | Kubernetes + VMs |
Architecture : Single point of failure | No – uses sidecar per pod | No | No | Partial - node proxy makes the services in affected node vulnerable |
Architecture : Proxy | Sidecar proxy (customised Envoy) | Sidecar proxy (Linkerd2-proxy) | Sidecar proxy (Envoy) | eBPF + node proxy (Envoy) |
Architecture : Security | Sidecar approach ensures high security. | Sidecar approach ensures high security. | Sidecar approach ensures high security. | Host proxy approach isnt considered as secure. |
mTLS | Yes | Yes | Yes | Yes |
Security : Certificate Management | Yes | Yes | Yes (with Vault integration) | Yes |
Communication Protocols | TCP, HTTP/1.x, HTTP/2, gRPC | TCP, HTTP/1.x, HTTP/2, gRPC | TCP, HTTP/1.x, HTTP/2, gRPC | TCP, HTTP/1.x, HTTP/2, gRPC |
Traffic Management | Blue/Green Deployments, Circuit Breaking, Fault Injection, Rate Limiting | Blue/Green Deployments, Fault Injection | Blue/Green Deployments, Circuit Breaking, Fault Injection, Rate Limiting | Blue/Green Deployments, Circuit Breaking, Fault Injection, Rate Limiting |
Multicluster Support | Yes | Yes | Yes | Yes |
Ingress | Istio gateway or Nginx ingress controller | Any | Envoy and Ambassador | Any |
Operations Complexity | High | Low | Medium | Medium (debugging with eBPF can be hard) |
Learning curve | High | Medium | High (plenty of moving parts) | Medium |
Resources footprint | High | Medium | High | Low |
Support | Largest community support | Large community support + Enterprise support | Solid Enterprise support | Community support + Enterprise support |
References
- https://linkerd.io/what-is-a-service-mesh/
- https://konghq.com/learning-center/service-mesh/what-is-a-service-mesh
- https://isovalent.com/blog/post/addressing-bandwidth-exhaustion-with-cilium-bandwidth-manager/
- https://medium.com/elca-it/service-mesh-performance-evaluation-istio-linkerd-kuma-and-consul-d8a89390d630
- https://www.toptal.com/kubernetes/service-mesh-comparison