Accessing an AWS ElastiCache Cluster from Outside AWS
This week, I needed to access an AWS ElastiCache cluster from outside AWS. This is a problem because, as AWS says in their docs, "the service is designed to be accessed exclusively from within AWS." That is because it's recommended (and rightly so) to keep your cache as close to your application as possible. AWS assumes if you are using ElastiCache your application is also deployed on AWS. But you know what they say about assuming. Besides the fact that Redis is more than just an in-memory cache and can be used as a database or message broker, I can think of several exceptions to this assumption. For example, maybe an application is being moved a piece at a time to AWS and temporarily needs access to the new cloud resources until the application has is fully deployed on AWS. Maybe an application is hosted on another cloud provider that doesn't provide a managed Redis service and the engineering team doesn't want to host and manage Redis themselves. Or, maybe an engineering team was using a third-party Redis provider that had too many outages and didn't provide enough observability into the cluster for them to rely on. I could keep going, but I won't. The fact of the matter is there are valid use cases where one needs to access an ElastiCache cluster from outside AWS.
To work around this limitation, AWS recommends launching a NAT instance into the VPC and then using IPTables to forward traffic to the cluster. Another option using EC2 instances would be forwarding traffic over SSH. ElastiCache is a managed service, and launching your own NAT instance creates maintenance work, additional complexity, and cost, not to mention a lack of redundancy and scalability. When I encountered this problem, I thought of two production-ready solutions that could be employed both short-term or long-term. First, use a VPN called Tailscale (or any VPN for that matter) to access the cluster using a subnet router without having to open any additional ports in security groups to allow access from outside the VPC. However, it isn't always possible to start using a new VPN technology to solve a problem like this so I want to write about a solution that I came up with using AWS native technologies.
An AWS Native Solution
When trying to get traffic from outside a VPC to a particular service inside a VPC my first thought is always a load balancer. I know and have used Network Load Balancers (NLBs) in the past and figured it would fit this use case almost perfectly. I set up an NLB, created an IP-based Target Group, and pointed it at the primary node in the Redis cluster. All that was left to do was set up a Security Group on the NLB to allow access from the external IP addresses and a Security Group on the Redis cluster to allow communication from the NLB and viola, we had external access to the cluster. I realize that here we are just using NLBs essentially as a reverse proxy and that we could have also hosted our own reverse proxies like HAProxy or NGINX on an EC2 instance, but this way AWS is maintaining the infrastructure for us.
There is just one small problem with this solution, and that is that the NLB only allows you to specify instances in the target groups by IP. What happens when there is a failover event with the cluster and the IP you entered for the Target Group is no longer the primary instance that should be receiving the write operations. We need to be able to forward traffic from the NLB to a specific domain name. Unfortunately, this is not something AWS supports.
Hostname as Target for Network Load Balancers
Lucky for me, I stumbled upon an AWS blog post that provides a solution to this problem. I'll let you read through the articles for the details, but the TL;DR; is that using a lambda that runs every x amount of minutes, we can query the DNS ourselves and update the target group accordingly. The blog post I linked even has a CloudFormation template ready to launch. Although I did add two parameters to the template before using it. One was for a name for the lambda event rule. The names for those event rules are unique so without parameterizing the template it couldn't be deployed multiple times to the same account. I also added a parameter for the schedule expression on the lambda rule because not all NLBs would need the DNS checked every five minutes, some would need it more, some would need it less. I'd also recommend checking the box to preserve resources on stack deployment failure because there are a couple of race conditions with the template. That way, if the stack fails to deploy, you can hit retry until it succeeds (or you can fix the race conditions, but I was lazy).
Conclusion
And that's it! Now you have a fully functioning solution to accessing an ElastiCache cluster from outside AWS using an NLB. I decided to write this post because there was hardly any existing information on how to solve this problem. Everything that did exist seemed to take the NAT approach. I couldn't find any information on how to do this with an NLB besides a single Reddit post asking if it was possible. It lacked any follow-up details or definitive answers. Hopefully this article will help some poor soul in the future who finds themselves in this same rare situation.