$ whoami

Hamza Karhat, M.Sc.

DevOps Engineer and Site Reliability Lead

$ skills

PythonRustOpenstackAWSDockerKubernetesTerraformAnsible

System Architecture Projects

IP Network Management Application Reliability

Improved a business critical application's reliability and increased uptime for an IP Network Management platform by implementing proactive monitoring, automation, and fault-tolerant design practices.

Key Achievements:

Managed and maintained infrastructure on OpenStack, deploying and scaling virtual machines to support critical service workloads.
Maintained a fault tolerant architecture through automatic failover and regular backups.
Implemented monitoring and alerting using Grafana, custom dashboards, and alerting rules to catch and act on system anomalies in real time.
Automated operations and maintenance tasks using Ansible, Shell scripting, and Python, reducing manual effort and improving system consistency.
Increased uptime by 1.2% through my tenure, bringing the application within SLA agreement.

KubernetesDockerRedhatKafkaGrafanaAnsiblepython

Migration of IP Network Devices To New Management Application

Oversaw the successful migration of over 600 network devices to the IP Network Management Application under my ownership.

Key Achievements:

Planned and executed a phased migration strategy to minimize service disruption and maintain network integrity throughout the process.
Managed configuration updates and connectivity policies across firewalls and routers, ensuring secure device integration into the new system.
Collaborated with cross-functional teams and stakeholders to coordinate timelines, address technical risks, and ensure alignment with business goals.
Developed automation scripts in Python and Shell to validate device readiness and streamline post-migration testing.
Resolved infrastructure and integration challenges in real time, maintaining service availability and ensuring SLA compliance during the transition.

KubernetesDockerRedhatKafkaGrafanaAnsiblepython

Impact & Achievements

System Performance

Reliability Engineering

• Increased uptime by 1.2% and aligned system reliability with SLA targets
• Reduced support tickets by 40% via proactive monitoring and issue detection
• Cut incident response time through automated fault detection pipelines

Application Migration

• Led migration of 600+ devices to a newly developed network management platform
• Maintained full service continuity during phased rollout and firewall policy updates
• Improved stakeholder satisfaction by driving transparent planning and team coordination

Infrastructure & DevOps

Cloud Infrastructure

• Deployed and managed scalable OpenStack environments for critical services
• Built fault-tolerant VM clusters with automated failover and regular backups
• Integrated cloud-init and infrastructure-as-code for repeatable provisioning

Automation & Monitoring

• Automated patching and provisioning with Ansible and Python
• Deployed monitoring stacks using Grafana, Prometheus, and Zabbix
• Reduced manual intervention by 30% through scripting and workflow orchestration

Development & Leadership

Technical Ownership

• Assumed ownership of a business-critical platform and led service reliability
• Defined incident response and root-cause processes
• Earned promotion to Lead SRE for leadership and execution impact

Team & Stakeholder Impact

• Coordinated across DevOps, network, and business teams during migrations
• Led on-call readiness and reliability best practices
• Increased stakeholder confidence through transparent reporting and metrics

$ contact --info

Let's Connect

$ location --current

Montreal, QC

$ contact --form

$ ls ./social-links

Gitlab

@hamzakarhat

Hamza Karhat