Phonepe : Site Reliability Engineer – Big Data

CategoryDetails
CompanyPhonePe
Job TitleSite Reliability Engineer – Big Data
LocationBangalore
Experience7–11 years
Role FocusManage and maintain distributed big data ecosystems; ensure reliability, scalability, and security of large-scale production infrastructure
Key Responsibilities– Manage Linux/Unix environments and on-call incident response- Design & implement automation for provisioning, scaling, upgrades, patching clusters- Troubleshoot production issues and perform root cause analysis- Optimize system performance, resource usage, and workflows- Collaborate with teams on system design and integration- Enforce security and SRE best practices- Develop operational automation scripts/tools- Monitor system health using ELK, Grafana, Prometheus, OpenTelemetry
Technical Skills– Linux (IP, Iptables, IPsec)- Scripting: Perl, Golang, Python- Hadoop stack: HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot- Configuration/Deployment: Puppet, Salt, Chef, Ansible- DevOps tools: Docker, Git, Saltstack, Ansible- Observability/Monitoring: ELK, Grafana, Prometheus, OpenTelemetry- Networking, cloud infrastructure (AWS, GCP, Azure – good to have)
Scope / ScaleLarge-scale big data production clusters, supporting critical business services for millions of users and 330+ million transactions/day
Soft SkillsStrong collaboration, communication, problem-solving, and independent decision-making
BenefitsMedical, critical illness, accidental, life insurance; wellness programs; parental support; mobility benefits; retirement benefits (PF, NPS, Gratuity); higher education assistance, car lease, salary advance
Key DifferentiatorFocus on infrastructure reliability and automation at scale; distinct from HR, Payroll, or PR roles at PhonePe

Click here to apply

Leave a Comment