Recorded Future

Recorded Future

Recorded Future is the leading threat intelligence platform, empowering organizations to identify and mitigate threats across various domains with real-time, unbiased, and actionable intelligence.

Professional Services
251-1K
Founded 2009
$58M raised

Description

  • Ensure the performance, capacity, scalability, reliability, resiliency, security, compliance, supportability, cost efficiency, and service-level objectives for the platform.
  • Design, implement, and maintain scalable and reliable infrastructure on AWS.
  • Develop and manage observability solutions using tools such as Grafana, ELK, and Prometheus to monitor system health and performance.
  • Automate infrastructure provisioning and configuration using Terraform and Chef.
  • Participate in a 24/7 on-call rotation to respond to and resolve production incidents.
  • Collaborate with engineering teams to ensure applications are designed for high availability and resilience.
  • Perform comprehensive root cause analysis for outages and recurring incidents.
  • Identify performance bottlenecks and systemic issues, then drive proactive improvements.
  • Lead continuous improvement efforts through automation, process optimization, and post-incident reviews.
  • Create clear incident reports and technical documentation.

Requirements

  • 3+ years of experience in a Site Reliability Engineer, DevOps Engineer, or similar role.
  • Hands-on experience with Amazon Web Services (AWS), including AWS networking concepts.
  • Expert-level troubleshooting and diagnostic skills.
  • Proven track record of reducing system downtime.
  • Advanced Linux skills across engineering fundamentals, networking, storage, and operating systems.
  • Experience managing and optimizing observability tools such as Grafana and the ELK Stack.
  • Strong proficiency in Terraform and Chef.
  • Strong preference for automating tasks and using Infrastructure as Code rather than manual changes.
  • Ability to understand complex architectures and stay calm under pressure during outages.
  • Preferred: knowledge and experience with Kubernetes.
  • Preferred: familiarity with message brokers such as RabbitMQ and Apache Kafka.
  • Preferred: experience with NoSQL databases, particularly MongoDB and Elasticsearch.
  • Preferred: familiarity with OpenTelemetry.
  • Preferred: experience with large distributed systems and microservices architecture.
  • Preferred: experience with CI/CD pipelines.

Benefits

  • Opportunity to join a large, global intelligence company with more than 1,000 intelligence professionals serving over 1,900 clients worldwide.
  • Work for a company with a 4.6-star user rating on G2 and customers including more than 50% of the Fortune 100.
  • Be part of a diverse, inclusive workplace representing over 40 nationalities.
  • Accommodation and special assistance available during the application process.
  • Equal opportunity and affirmative action employer committed to fair hiring practices.
  • Drug-free workplace.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Staff Operations Engineer

Mozilla 251-1K Internet Software & Services

Mozilla is hiring a Staff Operations Engineer to lead the design, reliability, and evolution of hybrid-cloud and workplace infrastructure across teams.

Ansible DNS Linux Puppet Python TCP/IP Unix
3 hours, 38 minutes ago

Principal Site Reliability Engineer (SRE)

Symmetrio Professional Services

Symmetrio is recruiting a Principal Site Reliability Engineer for a rapidly growing healthcare technology company to own the reliability, scalability, security, and performance of a mission-critical SaaS platform used by healthcare providers across the United States.

Active Directory AWS CI/CD Datadog Django Grafana Kubernetes Python Terraform Windows Server
3 hours, 53 minutes ago

Performance Test Engineer Lead

PartnerOne 51-250 Media

An enterprise performance engineering role at a cloud-focused organization, responsible for validating the scalability, stability, and production readiness of distributed systems across Azure and hybrid environments.

Azure CI/CD Kubernetes PowerShell
4 hours, 8 minutes ago

Site Reliability Engineer

MLabs 11-50 Internet Software & Services

Remote UK-hours Site Reliability Engineering role at a financial technology company, focused on automating and operating the infrastructure that supports global integration services for financial institutions.

Active Directory Ansible AWS CI/CD GCP OAuth PostgreSQL SAML
4 hours, 23 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers