DevOps approach to infrastructure, security, and scalability
We have re-engineered the LoginRadius CIAM platform to meet enterprise customers’ growing needs. In this process, the DevOps team’s relentless efforts have pushed the limits and enhanced infrastructure, security, and scalability.
In the last couple of years, we re-engineered our CIAM platform to consistently deliver enterprise-grade scalability, throughput, availability, and stability. This was much needed as some of our largest customers have been serving massive user bases for whom seamless and responsive experiences were paramount.
In this transition, the DevOps team has pushed boundaries to improve the infrastructure, availability, and scalability, supporting our platform re-engineering efforts and thus delivering a highly performant CIAM platform.
Here, I write about the DevOps team’s journey, invaluable contributions, and memorable achievements.
There are no two ways about it: We don’t like downtimes, and avoiding them is our priority, so we need to be technically strong with the DevOps fundamentals while incorporating cutting-edge technologies and creative engineering.
Firstly, our DevOps team has prioritized upgrades with zero downtime, as this approach benefits our customers immediately as well as in the long term.
We extensively use Kubernetes to deploy, manage, and orchestrate our application and container infrastructure. And new Kubernetes versions are released every four months with security and performance improvements. For this, our team has devised an upgrade procedure with robust automation, which resulted in upgrading our Kubernetes clusters to the latest version with zero downtime.
Secondly, our biggest customers have had some events with unpredictably heavy application loads. As our customers communicated these events timely and relied on us to deliver seamless scalability and throughput, we diligently worked with them to provide zero-downtime elastic scalability with efficient cost optimization.
The team’s efforts are supported by our earlier rebuilding of APIs in Golang, about which our Lead Architect Vijay Singh has written thoroughly: Why We Re-engineered LoginRadius APIs with Go?
Performance upgrades and zero-downtime efforts are easily perceivable; however, we know that security efforts should be second to none.
While our customers want us to deliver cutting-edge performance for their identity use cases, we help them trust us with robust security and data compliance measures.
Firstly, as a CIAM platform provider, it’s common for us or for our customer endpoints to get malicious traffic. To improve malicious IP address blocking, we have automated blocking at the proxy level based on real-time analysis utilizing factors like HTTP response codes and IP malicious score — blocking bad actors incredibly early without letting them degrade API and infrastructure performance.
Secondly, we have thoroughly reviewed the security posture of various multi-cloud platforms and services. We have moved away from services that didn’t meet our security levels and incorporated much more secure and robust services.
All these efforts have helped us successfully complete a third-party penetration testing and provide compliance reporting for ISOs and SOC2 with no shortcomings.
New Infrastructure for Disaster Recovery
As our application loads grew, we soon realized that relying on Kubernetes for failover has some downsides. Keeping the control plane up to date became tedious, and the team soon discovered that one region’s API degradation was affecting another region’s traffic since it was the failover for the first region.
After much discussion and research, we decided to work on a completely new DR infrastructure utilizing AWS Elastic Container Service to create a solution to completely isolate region-specific traffic and degradation. With this approach, the team re-architected the DR setup at minimal cost and achieved better resilience.
Also, we have successfully completed the yearly disaster recovery execution, achieving an impressive 30% improvement in the time it takes to restore various components of the architecture.
Through careful planning and execution, we have streamlined the disaster recovery process, ensuring critical systems can be brought back online as quickly as possible when extreme events occur.
Further DevOps Improvements
We have worked on many other processes and objectives, including:
Incident Management Handbook: We recently introduced a thoroughly improved incident management handbook that provides comprehensive guidance for various actions based on alerts.
The handbook serves as a single point of reference and contains detailed information on handling specific alerts. The team’s proactive efforts in creating the handbook have significantly reduced the time taken to onboard new site reliability engineering (SRE) members to just a couple of days from a few weeks.
The handbook has proven to be a valuable resource for the team, providing the necessary information and tools to manage and mitigate incidents effectively.
Fully Automated Custom Domain/SSL Pipeline: Our existing custom domain pipeline had code complexities and could only create or renew certificates every six hours. This led to certificate synchronizing issues across all proxy servers in the old architecture, making it hard to detect if sync didn’t happen.
In response to this challenge, we architected and implemented a fully automated custom domain and SSL pipeline from the ground up. The new system has significantly reduced the time it takes to create or renew certificates, bringing it down to just 15 minutes. With the new pipeline, certificates synchronize automatically across all proxy servers instantly. Additionally, an alerting system has been implemented to notify us of any sync failures.
The DevOps team’s efforts have been monumental in extensively supporting our re-engineering efforts. All this has given us competitive performance, reliability, and cost-effectiveness advantages that our customers have been so satisfied with — and helped us scale our platform to support 100k RPS (requests per second) and beyond if the need arises.
Overall, the LoginRadius DevOps team’s efforts and achievements have exceeded expectations, resulting in a highly performant CIAM platform that meets customers’ demands for scalability, stability, and security.
Originally published on LinkedIn