Submitting more applications increases your chances of landing a job.

Here’s how busy the average job seeker was last month:

Opportunities viewed

Applications submitted

Keep exploring and applying to maximize your chances!

Looking for employers with a proven track record of hiring women?

Click here to explore opportunities now!

We Value Your Feedback

You are invited to participate in a survey designed to help researchers understand how best to match workers to the types of jobs they are searching for

Would You Be Likely to Participate?

If selected, we will contact you via email with further instructions and details about your participation.

You will receive a $7 payout for answering the survey.

https://bayt.page.link/z8DDaBNMAdEGRhdVA

Back to the job results

Lead Site Reliability Engineer, DevOps

- Qualys, Inc
- India

6 days ago 2026/05/30

Complete Questionnaire

Apply on company site

Other Business Support Services

Create a job alert for similar positions

Job alert turned off. You won’t receive updates for this search anymore.

Undo

Job description

Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!

Job Title

Senior Site Reliability Engineer (SRE) – Observability & DevOps

Role Summary

We are looking for a Senior SRE who will own and evolve our observability and reliability platform. The ideal candidate has strong Linux fundamentals, hands-on experience with modern monitoring stacks, and the ability to design scalable alerting and metrics pipelines for large, distributed systems.

This role requires both deep technical expertise and production ownership mindset.

Primary ResponsibilitiesObservability & Monitoring

Design, implement, and maintain end-to-end observability using:
- Prometheus for metrics collection
- Alertmanager for alert routing, deduplication, and escalation
- Grafana for visualization and dashboards
- AppDynamics for APM, transaction tracing, and application health
Build actionable dashboards for:
- SLIs, SLOs, and error budgets
- Application, infrastructure, and platform health
Reduce alert fatigue by implementing signal-based alerting and proper severity models

Data & Metrics Platform

Manage and optimize ClickHouse for:
- High-volume metrics, logs, or traces
- Long-term retention and fast analytical queries
Work on schema design, performance tuning, and cost optimization

Reliability & Operations

Define and measure SRE best practices (SLIs, SLOs, SLAs)
Participate in incident response, postmortems, and root cause analysis
Drive reliability improvements through automation and capacity planning

Automation & Engineering

Develop tooling and automation using at least one scripting/programming language
Automate monitoring onboarding, alert generation, dashboard creation
Improve operational efficiencies across DevOps tooling

Required Technical Skills (Must-Have)Core Skills

Strong Linux fundamentals
- Troubleshooting, performance tuning, networking, system internals
Scripting / Programming (Any one or more):
- Python (preferred), Bash, Go, or similar
Observability Tools (Hands-on):
- Prometheus
- Alertmanager
- Grafana
- AppDynamics
Data Platform:
- Hands-on experience with ClickHouse

Monitoring & Alerting Concepts

Metrics vs logs vs traces
Golden signals (latency, traffic, errors, saturation)
Alert thresholds, routing policies, escalation strategies

Preferred / Nice-to-Have Skills

Kubernetes monitoring (Prometheus Operator, kube-state-metrics)
Infrastructure as Code (Terraform, Helm)
CI/CD observability
Cloud platforms (AWS / Azure / GCP)
Experience managing observability at scale (100+ services / platforms)

Senior-Level Expectations

Ability to architect observability solutions, not just operate them
Strong production troubleshooting and incident ownership
Mentoring junior engineers
Influence DevOps and SRE best practices across teams
Communicate clearly with developers and leadership

Experience & Qualification

5-7 years of experience in SRE / DevOps / Production Engineering
Experience operating high-availability, large-scale systems
Proven background in observability-driven reliability improvements

This job post has been translated by AI and may contain minor differences or errors.

Apply on company site Email to Friend Complete Questionnaire

Compare your profile with other applicants

Cancel

You’ve reached the maximum limit of 15 job alerts. To create a new alert, please delete an existing one first.

MANAGE

Job alert created for this search. You’ll receive updates when new jobs match.

Manage alerts

Are you sure you want to unapply?

You'll no longer be considered for this role and your application will be removed from the employer's inbox.