
Subramanian Lakshmanan
DevOps Engineer
I am an experienced Senior DevOps with a strong background in building and maintaining reliable, scalable cloud infrastructures and data platforms. Over the past few years, I have worked with leading organizations like Visa, Play Games24x7, ZestMoney, and others, where I focused on delivering high-availability systems, automating critical operations, and improving the reliability of large-scale data and application environments. My expertise spans across AWS, Kubernetes, Docker, Terraform, Ansible, and CI/CD tools like Jenkins and GitHub Actions. I have deep experience managing complex systems, optimizing cloud costs, building automation using Python and Shell scripting, and ensuring observability through tools like Prometheus and Grafana. More recently, I have expanded my focus into modern Data Platform engineering, working with technologies like Snowflake, Databricks, Segment, and AWS data services (Glue, Lambda, EMR), combining DevOps and data reliability practices to support scalable data-driven applications. I am passionate about automation, system reliability, proactive monitoring, and constantly driving improvements in infrastructure, operations, and data systems.
Careers
Senior Site Reliability Engineer
Visa Inc.
- Lead reliability initiatives for Visa’s global Data Platform by applying SRE principles across Snowflake, Databricks, Segment, and AWS-native services (S3, Glue, EMR, Lambda).
- Provide level 3, 24x7 on-call support for critical applications hosted in Linux, Docker, and Kubernetes environments, ensuring high availability and system reliability. Support for critical data applications and pipelines running on Databricks Workflows, and Airflow.
- Designed, built, and maintained scalable CI/CD pipelines using Jenkins and GitHub Actions ensuring high-quality, consistent, and traceable workflows for enterprise-grade applications.
- Architect, build, and maintain scalable CI/CD pipelines for data workloads using Jenkins and GitHub Actions, ensuring seamless integration and deployment of ETL/ELT jobs.
- Develop automation scripts (Python, Shell) to monitor data ingestion processes, ETL performance, and automate operational workflows.
- Deployed, configured, and managed AWS infrastructure services using Terraform enabling consistent, reusable, and scalable infrastructure provisioning.
- Automate key processes using shell scripts and Python to generate log output based on application behavior, ensuring faster resolution of support requests.
- Implement robust monitoring for Data Platform services (Snowflake query performance, Databricks job health, Segment event delivery) using Prometheus, Grafana, and AWS CloudWatch.
- Collaborate closely with Data Engineering and Analytics teams to enable faster data access while ensuring data pipeline reliability, scalability, and security.
- Conduct root cause analysis and execute proactive measures designed to promote a trouble-free production environment, leading to improved system performance.
- Analyze and prioritize incidents based on criticality and impact, implementing appropriate remediation strategies to prevent recurring issues.
Site Reliability Engineer
Play Games24x7
- Supported critical 24x7 On-Call operations for large-scale data ingestion and analytics workloads leveraging Kubernetes, Kafka (MSK), and Snowflake.
- Designed and maintained Terraform modules for data infrastructure provisioning including EKS clusters, RDS, S3, and Glue services.
- Led efforts to migrate batch processing pipelines to Databricks, improving compute efficiency and reducing operational overhead.
- Led initiatives to improve cost visibility and optimize infrastructure expenses using Quicksight, driving financial efficiency across the organization.
- Managed and optimized Jenkins pipelines, automating deployment tasks and responding to alerts to ensure smooth, continuous operations.
- Implemented proactive monitoring for data pipelines using custom Prometheus exporters and Grafana dashboards.
- Worked with stakeholders to improve real-time event processing through Kafka-to-Snowflake pipelines, ensuring data quality and delivery guarantees.
- Optimized cloud resource utilization and cost for data workloads by analyzing billing patterns in Quicksight and applying right-sizing strategies.
- Automated incident response playbooks for common data pipeline failures to reduce mean time to recovery (MTTR).
- Implemented proactive monitoring using Prometheus and Grafana, improving incident detection and response times.
DevOps Engineer
Zestmoney
- Managed Kubernetes pods and optimized resource utilization to ensure reliability and performance at scale.
- Led the migration from ECS to EKS, leveraging Kubernetes to enhance system scalability and performance.
- Developed auto-scaling strategies and monitoring solutions, ensuring 99.99% system uptime through proactive incident management and automation.
- Pioneered cost optimization initiatives within AWS, implementing strategies to maximize financial efficiency without compromising system availability.
- Diagnosed and resolved high-priority application errors, applying best practices in incident response to maintain system reliability.
- Managed Kubernetes clusters hosting data pipelines and microservices, ensuring high availability and efficient resource utilization for Snowflake-based data workloads.
- Spearheaded the migration from ECS-based data workloads to EKS (Elastic Kubernetes Service), enhancing scalability and reliability.
- Designed and implemented auto-scaling and cost optimization strategies for data processing workloads on AWS (EC2 Spot Instances, Databricks Clusters).
- Built and maintained CI/CD pipelines for deploying data ingestion frameworks and analytics applications.
- Proactively diagnosed and resolved data pipeline failures, including Snowflake ingestion issues, schema drift problems, and data delivery delays.
- Worked closely with Data Engineers to streamline ingestion pipelines for structured and semi-structured data from sources like Segment, S3, and APIs.
- Led AWS cost governance for Data Platform workloads, optimizing resource allocation without impacting performance.
DevOps Engineer
Pramata Knowledge Solutions
- Migrated a monolithic architecture to microservices using Docker, improving deployment flexibility and resource utilization across Kubernetes clusters.
- Designed and implemented CI/CD pipelines in Jenkins to automate build, test, and deployment processes, significantly reducing deployment errors and lead times.
- Managed Kubernetes environments, ensuring high availability, efficient auto-scaling, and streamlined deployment workflows.
- Conducted regular security audits on cloud infrastructure and containerized applications, enforcing industry best practices (IAM hardening, secrets management, image vulnerability scanning).
- Developed Infrastructure-as-Code templates for consistent resource provisioning using Terraform.
- Led post-incident reviews and blameless retrospectives, driving actionable improvements to minimize recurrence of production incidents.
- Collaborated with QA and Development teams to shift left performance and security testing into the CI/CD pipelines.
- Tuned application and infrastructure monitoring using Prometheus and Grafana, setting up alerting systems to detect anomalies before impacting end users.
- Automated recurring operational tasks using Python and Shell scripts, enhancing engineering productivity and operational efficiency.
DevOps Engineer
BlinkIN Technologies
- Managed and scaled GCP-based infrastructure using services like Compute Engine, Cloud Functions, Cloud Storage, and VPC networking.
- Automated infrastructure provisioning and configuration management tasks using Terraform and Deployment Manager templates.
- Developed and maintained Jenkins pipelines for continuous integration and deployment of microservices hosted in Docker containers.
- Implemented centralized logging and monitoring systems with Prometheus, Grafana, and Stackdriver to gain full visibility into application and infrastructure performance.
- Designed dynamic scaling policies using GCP Autoscaler to handle traffic spikes and ensure system reliability during peak usage.
- Optimized Docker container images and registry usage, reducing build and deployment times across multiple environments.
- Participated actively in incident response, root cause analysis, and production troubleshooting, improving system uptime and customer experience.
- Created Shell scripts to automate daily operational tasks, backup management, and log rotations, minimizing manual intervention.
DevOps Intern
PeopleClick Techno Solutions
- Assisted in provisioning and managing AWS resources, ensuring efficient allocation of EC2 instances, RDS, and S3 storage.
- Supported the development of CI/CD pipelines using Jenkins, automating build, test, and deployment processes for faster software delivery.
- Diagnosed and resolved technical issues, optimizing AWS resource usage and reducing operational costs.
- Contributed to the implementation of proactive alerting mechanisms to swiftly detect incidents, maintaining system stability and performance.
Technical Support Engineer
CSS Corp
- Engaged with esteemed client Aruba Networks while employed with CSS Corp
- Demonstrated expertise in swiftly comprehending customer concerns, and provided proficient troubleshooting to ensure swift issue resolution
- Took charge of the meticulous management, updating, and issuance of customer license keys, ensuring seamless access to essential services
- Facilitated the activation process for customer Access Points, Controllers, and Switches, ensuring optimal functionality and performance
Education
Anna University
Mechanical Engineer
Welcome to Outdefine
A free tokenized community dedicated to connecting global tech talent with remote job opportunities. Our platform is designed to help you connect, learn, and earn in the tech industry while providing the chance to collect DEF tokens. Join our vibrant community today and explore a world of possibilities for your tech career!
Join for free