Barbaricum

Barbaricum

Barbaricum is a dynamic government contracting firm in Washington, D.C., offering innovative technology, communications, and cyber/intel services to transform U.S. Government approaches for National Security missions.

Professional Services
251-1K
Founded 2008

Description

  • Monitor and maintain system reliability, availability, and performance across on-premises, cloud, and hybrid environments.
  • Implement proactive monitoring, automated alerting, incident response workflows, and resilience engineering practices.
  • Develop, maintain, and improve automated infrastructure solutions that support reliable and repeatable operations.
  • Implement rollback strategies, recovery approaches, and chaos engineering practices to validate and improve system resilience.
  • Analyze usage patterns, capacity trends, and performance indicators to support scaling and optimization decisions.
  • Develop and maintain real-time dashboards, reports, and metrics for operational visibility and rapid decision-making.
  • Respond to and resolve outages, service impairments, and disruptions while coordinating with technical teams.
  • Conduct post-incident reviews, identify root causes, document lessons learned, and implement preventive measures.
  • Collaborate with developers, cloud engineers, cybersecurity staff, and operations teams to improve reliability and operational standards.
  • Create and maintain system documentation, runbooks, configuration standards, monitoring procedures, and service reliability guidance.

Requirements

  • Bachelor’s degree in Computer Science, Information Technology, Systems Engineering, Cybersecurity, or a related field; master’s degree preferred.
  • 10+ years of experience in site reliability engineering, systems administration, infrastructure operations, cloud operations, DevSecOps, or a similar technical role.
  • Experience in a government, federal, defense, or secure IT environment is strongly preferred.
  • Expert knowledge of site reliability engineering practices, system monitoring, incident management, automation, performance tuning, and operational resilience.
  • Strong understanding of Windows and Linux administration, infrastructure operations, system configuration, service management, and troubleshooting.
  • Experience with automation/configuration management tools such as Ansible, Puppet, or Chef, or similar technologies.
  • Proficiency with scripting languages such as Python, Shell, or PowerShell.
  • Knowledge of cloud services and infrastructure across AWS, Microsoft Azure, Google Cloud, or comparable environments.
  • Experience developing automated infrastructure, scripts, monitoring solutions, dashboards, runbooks, and configuration standards.
  • Experience supporting incident response, outage resolution, post-incident reviews, root cause analysis, and operational improvement initiatives.
  • DoD Secret Security Clearance.
  • Certifications related to cloud computing, system administration, site reliability engineering, DevSecOps, or automation are beneficial.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Principal Site Reliability Engineer

Accela 251-1K Internet Software & Services

Accela is hiring a Principal Site Reliability Engineer to lead reliability, scalability, and operational excellence for its Civic Platform and cloud-based SaaS offerings in a highly regulated environment.

Ansible Argo CD Azure Bash Flux Git GitHub GitOps HIPAA Kubernetes Linux OpenTelemetry PowerShell Python Terraform
12 hours, 48 minutes ago

Sr. Site Reliability Engineer (Starshield)

SpaceX 10K-50K Aerospace & Defense

SpaceX is hiring a Senior Site Reliability Engineer for Starshield to build and operate reliable infrastructure and automation supporting secure government satellite systems.

Ansible Bash CI/CD Kubernetes Linux Python TCP/IP Terraform
1 day, 12 hours ago

Sr. Site Reliability Engineer (Starshield)

SpaceX 10K-50K Aerospace & Defense

SpaceX is hiring a Senior Site Reliability Engineer for Starshield to build and operate reliable infrastructure supporting government-focused satellite systems and national security missions.

Ansible Bash CI/CD Kubernetes Linux Python TCP/IP Terraform
1 day, 13 hours ago

Senior Site Reliability Engineer

DexCare 51-250 Health Care Providers & Services

DexCare is hiring a Senior Site Reliability Engineer to help operate and improve its AWS-based healthcare infrastructure that supports digital care access and reliable patient service delivery.

Agile AWS Azure CI/CD Datadog EC2 GitHub Actions Helm HIPAA JIRA Kubernetes Python Scrum Serverless Terraform
1 day, 13 hours ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers