| Company | PhonePe |
| Job Title | Site Reliability Engineer – Big Data |
| Location | Bangalore |
| Experience | 7–11 years |
| Role Focus | Manage and maintain distributed big data ecosystems; ensure reliability, scalability, and security of large-scale production infrastructure |
| Key Responsibilities | – Manage Linux/Unix environments and on-call incident response- Design & implement automation for provisioning, scaling, upgrades, patching clusters- Troubleshoot production issues and perform root cause analysis- Optimize system performance, resource usage, and workflows- Collaborate with teams on system design and integration- Enforce security and SRE best practices- Develop operational automation scripts/tools- Monitor system health using ELK, Grafana, Prometheus, OpenTelemetry |
| Technical Skills | – Linux (IP, Iptables, IPsec)- Scripting: Perl, Golang, Python- Hadoop stack: HDFS, HBase, Airflow, YARN, Ranger, Kafka, Pinot- Configuration/Deployment: Puppet, Salt, Chef, Ansible- DevOps tools: Docker, Git, Saltstack, Ansible- Observability/Monitoring: ELK, Grafana, Prometheus, OpenTelemetry- Networking, cloud infrastructure (AWS, GCP, Azure – good to have) |
| Scope / Scale | Large-scale big data production clusters, supporting critical business services for millions of users and 330+ million transactions/day |
| Soft Skills | Strong collaboration, communication, problem-solving, and independent decision-making |
| Benefits | Medical, critical illness, accidental, life insurance; wellness programs; parental support; mobility benefits; retirement benefits (PF, NPS, Gratuity); higher education assistance, car lease, salary advance |
| Key Differentiator | Focus on infrastructure reliability and automation at scale; distinct from HR, Payroll, or PR roles at PhonePe |