A Definitive Guide to Cross-Cluster Seamless Access in Multicluster Istio Service Mesh

Explore how to effectively implement cross-cluster seamless access in the Istio multicluster mesh using SPIRE federation, DNS proxy, and east-west gateway technologies. This guide provides detailed configuration examples and steps to help you overcome deployment challenges and ensure efficient, secure communication between services.

Copyright
This is an original article by Jimmy Song. You may repost it, but please credit this source: https://jimmysong.io/en/blog/seamless-cross-cluster-access-istio/
Click to show the outline

Introduction

As enterprise information systems increasingly adopt microservices architecture, how to achieve efficient and secure cross-cluster access to services in a multicluster environment has become a crucial challenge. Istio, as a popular service mesh solution, offers a wealth of features to support seamless inter-cluster service connections.

There are several challenges when deploying and using a multicluster service mesh:

  • Cross-cluster service registration, discovery, and routing
  • Identity recognition and authentication between clusters

This article will delve into how to achieve seamless cross-cluster access in a multicluster Istio deployment by implementing SPIRE federation and exposing services via east-west gateways. Through a series of configuration and deployment examples, this article aims to provide readers with a clear guide to understanding and addressing common issues and challenges in multicluster service mesh deployments.

Istio Deployment Models

The Istio documentation divides various deployment models based on clusters, networks, control planes, meshes, trust domains, and tenants.

This article focuses on the hybrid deployment model of multi-cloud + multi-mesh + multi-control plane + multi-trust domain. This is a relatively complex scenario. If you can successfully deploy this model, then other scenarios should also be manageable.

FQDN in Multicluster Istio Service Mesh

For services across different meshes to access each other, they must be aware of each other’s Fully Qualified Domain Name (FQDN). FQDNs typically consist of the service name, namespace, and top-level domain (e.g., svc.cluster.local). In Istio’s multi-cloud or multi-mesh setup, different mechanisms such as ServiceEntry, VirtualService, and Gateway configurations are used to control and manage service routing and access, instead of altering the FQDN.

The FQDN in a multi-cloud service mesh remains the same as in a single cluster, usually following the format:

<service-name>.<namespace>.svc.cluster.local

You might think about using meshID to distinguish meshes? The meshID is mainly used to differentiate and manage multiple Istio meshes within the same environment or across environments, and it is not used to directly construct the service FQDN.

Main roles of meshID
  • Mesh-level telemetry data aggregation: Differentiates data from different meshes, allowing for monitoring and analysis on a unified platform.
  • Mesh federation: Establishes federation among meshes, allowing for sharing some configurations and services.
  • Cross-mesh policy implementation: Identifies and applies mesh-specific policies, such as security policies and access control.

Cross-Cluster Service Registration, Discovery, and Routing

In the Istio multi-mesh environment, the East-West Gateway plays a key role. It not only handles ingress and egress traffic between meshes but also supports service discovery and connectivity. When one cluster needs to access a service in another cluster, it routes to the target service through this gateway.

The diagram below shows the process of service registration, discovery, and routing across clusters.

In the configuration of Istio multi-mesh, the processes of service registration, discovery, and routing are crucial as they ensure that services in different clusters can discover and communicate with each other. Here are the basic steps in service registration, discovery, and routing in the Istio multi-mesh environment:

1. Service Registration

In each Kubernetes cluster, when a service is deployed, its details are registered with the Kubernetes API Server. This includes the service name, labels, selectors, ports, etc.

2. Sync to Istiod

Istiod, serving as the control plane, is responsible for monitoring changes in the status of the Kubernetes API Server. Whenever a new service is registered or an existing service is updated, Istiod automatically detects these changes. Istiod then extracts the necessary service information and builds internal configurations of services and endpoints.

3. Cross-Cluster Service Discovery

To enable a service in one cluster to discover and communicate with a service in another cluster, Istiod needs to synchronize service endpoint information across all relevant clusters. This is usually achieved in one of two ways:

  • DNS Resolution: Istio can be configured to use CoreDNS or a similar service to return cross-cluster service endpoints in DNS queries. When a service tries to resolve another cluster’s service, the DNS query returns the IP addresses of the accessible remote service. In this article, we enable Istio’s DNS proxy to achieve cross-cluster service discovery. If a service exists both locally and remotely, a local DNS query returns only the local service’s ClusterIP. If the service exists only in a remote cluster, the DNS query returns the IP address of the East-West Gateway’s load balancer in the remote cluster, which can also be used for cross-cluster failover.
  • Service Entry Synchronization: By setting specific ServiceEntry configurations, an Envoy proxy in one cluster knows how to find and route to a service in another cluster through the East-West Gateway.

4. Routing and Load Balancing

When Service A needs to communicate with Service B, its Envoy proxy first resolves the name of Service B to get an IP address, which is the load balancer address of the East-West Gateway in Service B’s cluster. Then, the East-West Gateway routes the request to the target service. Envoy proxies can select the best service instance to send requests based on configured load balancing strategies (e.g., round-robin, least connections, etc.).

5. Traffic Management

Istio offers a rich set of traffic management features, such as request routing, fault injection, and traffic mirroring. These rules are defined in the Istio control plane and pushed to the various Envoy proxies for execution. This allows for flexible control and optimization of communication between services in a cross-cluster environment.

Identity Recognition and Authentication Between Clusters

When services running in different clusters need to communicate with each other, correct identity authentication and authorization are key to ensuring service security. Using SPIFFE helps to identify and verify the identities of services, but in a multi-cloud environment, these identities need to be unique and verifiable.

To this end, we will set up SPIRE federation to assign identities to services across multiple clusters and achieve cross-cluster identity authentication:

  • Using SPIFFE to Identify Service Identities: Under the SPIFFE framework, each service is assigned a unique identifier in the format spiffe://<trust-domain>/<namespace>/<service>. In a multi-cloud environment, including the cluster name in the “trust domain” ensures the uniqueness of identities. For example, spiffe://foo.com/ns/default/svc/service1 and spiffe://bar.com/ns/default/svc/service1 can be set to differentiate services with the same name in different clusters.
  • Using SPIRE Federation to Manage Inter-Cluster Certificates: This enhances the security of the multi-cloud service mesh. SPIRE (SPIFFE Runtime Environment) offers a highly configurable platform for service identity verification and certificate issuance. When using SPIRE federation, cross-cluster service authentication can be achieved by creating a Trust Bundle for each SPIRE cluster.

Here are the steps for implementing SPIRE federation.

1. Configuring Trust Domain

Each cluster is configured as a separate trust domain. Thus, each service within a cluster will have a unique SPIFFE ID based on its trust domain. For instance, a service in cluster 1 might have the ID spiffe://cluster1/ns/default/svc/service1, while the same service in cluster 2 would be spiffe://cluster2/ns/default/svc/service1.

2. Establishing Trust Bundle

Configure trust relationships in SPIRE to allow nodes and workloads from different trust domains to mutually verify each other. This involves exchanging and accepting each other’s CA certificates or JWT keys between trust domains, ensuring the security of cross-cluster communication.

3. Configuring SPIRE Server and Agent

Deploy a SPIRE Server and SPIRE Agent in each cluster. The SPIRE Server is responsible for managing the issuance and renewal of certificates, while the SPIRE Agent handles the secure distribution of certificates and keys to services within the cluster.

Compatibility Issues with Workload Registration when Using SPIRE Federation in Istio
In this article, we use the traditional Kubernetes Workload Registrar in the SPRIE Server to handle workload registration within the cluster. The Kubernetes Workload Registrar has been deprecated from SPIRE v1.5.4 onwards, replaced by the SPIRE Controller Manager, which, in my testing, does not run well with Istio.

4. Using the Workload API

Services can request and update their identity certificates through SPIRE’s Workload API. This way, services can continuously verify their identities and securely communicate with other services, even when operating in different clusters. We will configure the proxies in the Istio mesh to share the Unix Domain Socket in the SPIRE Agent, thus accessing the Workload API to manage certificates.

5. Automating Certificate Rotation

We will use cert-manager as SPIRE’s UpstreamAuthority to configure automatic rotation of service certificates and keys, enhancing the system’s security. With automated rotation, even if certificates are leaked, attackers can only use these certificates for a short period.

These steps allow you to establish a cross-cluster, secure service identity verification framework, enabling services in different clusters to securely recognize and communicate with each other, effectively reducing security risks and simplifying certificate management. This configuration not only enhances security but also improves the system’s scalability and flexibility through distributed trust domains.

Multicluster Deployment

The diagram below shows the deployment model for Istio multi-cloud and SPIRE federation.

image
Multicloud Mesh Deployment Model

Below, I will demonstrate how to achieve seamless cross-cluster access in a multi-cloud Istio mesh.

  1. Create two Kubernetes clusters in GKE, named cluster-1 and cluster-2.
  2. Deploy SPIRE and set up federation in both clusters.
  3. Install Istio in both clusters, paying attention to configure the trust domain, east-west gateways, ingress gateways, sidecarInjectorWebhook mounting SPIFFE UDS’s workload-socket, and enabling DNS proxy.
  4. Deploy test applications and verify seamless cross-cluster access.

The versions of the components we deployed are as follows:

  • Kubernetes: v1.29.4
  • Istio: v1.22.1
  • SPIRE: v1.5.1
  • cert-manager: v1.15.1

I have saved all commands and step-by-step

instructions on Github: rootsongjc/istio-multi-cluster. You can follow the instructions in this project. Here are explanations for the main steps.

1. Preparing Kubernetes Clusters

Open Google Cloud Shell or your local terminal, and make sure you have installed the gcloud CLI. Use the following commands to create two clusters:

gcloud container clusters create cluster-1 --zone us-central1-a --num-nodes 3
gcloud container clusters create cluster-2 --zone us-central1-b --num-nodes 3

2. Deploying cert-manager

Use cert-manager as the root CA to issue certificates for istiod and SPIRE.

./cert-manager/install-cert-manager.sh

3. Deploying SPIRE Federation

Basic information for SPIRE federation is as follows:

Cluster Alias Trust Domain
cluster-1 foo.com
cluster-2 bar.com

Note: The trust domain does not need to match the DNS name but must be the same as the trust domain in the Istio Operator configuration.

Execute the following command to deploy SPIRE federation:

./spire/install-spire.sh

For details on managing identities in Istio using SPIRE, refer to Managing Certificates in Istio with cert-manager and SPIRE.

4. Installing Istio

We will use IstioOperator to install Istio, configuring each cluster with:

  • Automatic Sidecar Injection
  • Ingress Gateway
  • East-West Gateway
  • DNS Proxy
  • SPIRE Integration
  • Access to remote Kubernetes cluster secrets

Execute the following command to install Istio:

istio/install-istio.sh

Verifying Traffic Federation

To verify the correctness of the multi-cloud installation, we will deploy different versions of the helloworld application in both clusters and then access the helloworld service from cluster-1 to test the following cross-cluster access scenarios:

  1. East-West Traffic Federation: Cross-cluster service redundancy
  2. East-West Traffic Federation: Handling non-local target services
  3. North-South Traffic Federation: Accessing services via a remote ingress gateway

Execute the following command to deploy the helloworld application in both clusters:

./example/deploy-helloword.sh

East-West Traffic Federation: Cross-Cluster Service Redundancy

After deploying the helloworld application, access the hellowrold service from the sleep pod in cluster-1:

kubectl exec --context=cluster-1 -n sleep deployment/sleep -c sleep \
	-- sh -c "while :; do curl -sS helloworld.helloworld:5000/hello; sleep 1; done"

The diagram below shows the deployment architecture and traffic routing path for this scenario.

image
East-West Traffic Federation: Cross-Cluster Service Redundancy

The response results including both helloworld-v1 and helloworld-v2 indicate that cross-cluster service redundancy is effective.

Verifying DNS

At this point, since the helloworld service exists both locally and in the remote cluster, if you query the DNS name of the helloworld service in cluster-1:

kubectl exec -it deploy/sleep --context=cluster-1 -n sleep -- nslookup helloworld.helloworld.svc.cluster.local

You will get the ClusterIP of the helloworld service in cluster-1.

Verifying Traffic Routing

Next, we will verify the cross-cluster traffic routing path by examining the Envoy proxy configuration.

View the endpoints of the helloworld service in cluster-1:

istioctl proxy-config endpoints deployment/sleep.sleep --context=cluster-1 --cluster "outbound|5000||helloworld.helloworld.svc.cluster.local"

You will see output similar to the following:

ENDPOINT               STATUS      OUTLIER CHECK     CLUSTER
10.76.3.22:5000        HEALTHY     OK                outbound|5000||helloworld.helloworld.svc.cluster.local
34.136.67.85:15443     HEALTHY     OK                outbound|5000||helloworld.helloworld.svc.cluster.local

These two endpoints, one is the endpoint of the helloworld service in cluster-1, and the other is the load balancer address of the istio-eastwestgateway service in cluster-2. Istio sets up SNI for cross-cluster TLS connections, and in cluster-2, the target service is distinguished by SNI.

Execute the following command to query the endpoint in cluster-2 based on the previous SNI:

istioctl proxy-config endpoints deploy/istio-eastwestgateway.istio-system --context=cluster-2 --cluster "outbound_.5000_._.helloworld.helloworld.svc.cluster.local"

You will get output similar to the following:

ENDPOINT           STATUS      OUTLIER CHECK     CLUSTER
10.88.2.4:5000     HEALTHY     OK                outbound_.5000_._.helloworld.helloworld.svc.cluster.local

This endpoint is the endpoint of the helloworld service in the cluster-2 cluster.

Through the steps above, you should understand the traffic path for cross-cluster redundant services. Next, we will delete the helloworld service in cluster-1. No configuration changes are needed in Istio to automatically achieve failover.

East-West Traffic Federation: Failover

Execute the following command to scale down the replicas of helloworld-v1 in cluster-1 to 0:

kubectl -n helloworld scale deploy helloworld-v1 --context=cluster-1 --replicas 0

Access the helloworld service again from cluster-1:

kubectl exec --context=cluster-1 -n sleep deployment/sleep -c sleep \
	-- sh -c "while :; do curl -sS helloworld.helloworld:5000/hello; sleep 1; done"

You will still receive responses from helloworld-v2.

Now, directly delete the helloworld service in cluster-1:

kubectl delete service helloworld -n helloworld --context=cluster-1

You will still receive responses from helloworld-v2, indicating that cross-cluster failover is effective.

The diagram below shows the traffic path for this scenario.

image
East-West Traffic Federation: Failover

Verifying DNS

At this point, since the helloworld service exists both locally and in the remote cluster, if you query the DNS name of the helloworld service in cluster-1:

kubectl exec -it deploy/sleep --context=cluster-1 -n sleep -- nslookup helloworld.helloworld.svc.cluster.local

You will get the address and port 15443 of the East-West Gateway in cluster-2.

North-South Traffic Federation: Accessing Services via Remote Ingress Gateway

Accessing services in a remote cluster through the ingress gateway is the most traditional way of cross-cluster access. The diagram below shows the traffic path for this scenario.

image
North-South Traffic Federation: Accessing Services via Remote Ingress Gateway

Execute the following command to create a Gateway and VirtualService in cluster-2:

kubectl apply --context=cluster-2 \
	-f ./examples/helloworld-gateway.yaml -n helloworld

Get the address of the ingress gateway in cluster-2:

GATEWAY_URL=$(kubectl -n istio-ingress --context=cluster-2 get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

Execute the following validation to access the service via the remote ingress gateway:

kubectl exec --context="${CTX_CLUSTER1}" -n sleep deployment/sleep -c sleep \
	-- sh -c "while :; do curl -s http://$GATEWAY_URL/hello; sleep 1; done"

You will receive responses from helloworld-v2.

Verifying Identity

Execute the following command to obtain the certificate from the sleep pod in the cluster-1 cluster:

istioctl proxy-config secret deployment/sleep -o json --context=cluster-1| jq -r '.dynamicActiveSecrets[0].secret.tlsCertificate.certificateChain.inlineBytes' | base64 --decode > chain.pem
split -p "-----BEGIN CERTIFICATE-----" chain.pem cert-
openssl x509 -noout -text -in cert-ab
openssl x509 -noout -text -in cert-aa

If you see the following fields in the output message, it indicates that the identity assignment is correct:

Subject: C=US, O=SPIFFE

URI:spiffe://foo.com/ns/sample/sa/sleep

View the identity information in SPIRE:

kubectl --context=cluster-1 exec -i -t -n spire spire-server-0 -c spire-server \
	-- ./bin/spire-server entry show -socketPath /run/spire/sockets

/server.sock --spiffeID spiffe://foo.com/ns/sleep/sa/sleep

You will see output similar to the following:

Found 1 entry
Entry ID         : 9b09080d-3b67-44c2-a5b8-63c42ee03a3a
SPIFFE ID        : spiffe://foo.com/ns/sleep/sa/sleep
Parent ID        : spiffe://foo.com/k8s-workload-registrar/cluster-1/node/gke-cluster-1-default-pool-18d66649-z1lm
Revision         : 1
X509-SVID TTL    : default
JWT-SVID TTL     : default
Selector         : k8s:node-name:gke-cluster-1-default-pool-18d66649-z1lm
Selector         : k8s:ns:sleep
Selector         : k8s:pod-uid:6800aca8-7627-4a30-ba30-5f9bdb5acdb2
FederatesWith    : bar.com
DNS name         : sleep-86bfc4d596-rgdkf
DNS name         : sleep.sleep.svc

Recommendations for Production Environments

For production environments, it is recommended to use a Unified Gateway, employing a Tier-2 architecture. In the Tier-1 edge gateway, configure global traffic routing. This edge gateway will send the transcribed Istio configuration to the various ingress gateways in the Tier-2 clusters.

The diagram below shows the deployment of an Istio service mesh using SPIRE federation and a Tier2 architecture with TSB.

image
Deployment of a Multicluster Istio Service Mesh with SPIRE and Tier2 Architecture Using TSB

We have divided these four Kubernetes clusters into Tier1 cluster (tier1) and Tier2 clusters (cp-cluster-1, cp-cluster-2, and cp-cluster-3). An Edge Gateway is installed in T1, while bookinfo and httpbin applications are installed in T2. Each cluster will have an independent trust domain, and all these clusters will form a SPIRE federation.

The diagram below shows the traffic routing for users accessing bookinfo and httpbin services through the ingress gateway.

image
Unified Gateway Architecture Diagram

You need to create a logical abstraction layer suitable for multi-cloud above Istio. For detailed information about the unified gateway in TSB, refer to TSB Documentation.

Summary

This article has detailed the key technologies and methods for implementing service identity verification, DNS resolution, and cross-cluster traffic management in an Istio multi-cloud mesh environment. By precisely configuring Istio and SPIRE federation, we have not only enhanced the system’s security but also improved the efficiency and reliability of inter-service communication. Following these steps, you will be able to build a robust, scalable multi-cloud service mesh to meet the complex needs of modern applications.

References


This blog was initially published at tetrate.io.

Last updated on Nov 22, 2024