Recorded Future

Recorded Future

Recorded Future is the leading threat intelligence platform, empowering organizations to identify and mitigate threats across various domains with real-time, unbiased, and actionable intelligence.

Professional Services
251-1K
Founded 2009
$58M raised

Description

  • Ensure the performance, capacity, scalability, reliability, resiliency, security, compliance, supportability, cost efficiency, and service-level objectives for the platform.
  • Design, implement, and maintain scalable and reliable infrastructure on AWS.
  • Develop and manage observability solutions using tools such as Grafana, ELK, and Prometheus to monitor system health and performance.
  • Automate infrastructure provisioning and configuration using Terraform and Chef.
  • Participate in a 24/7 on-call rotation to respond to and resolve production incidents.
  • Collaborate with engineering teams to ensure applications are designed for high availability and resilience.
  • Perform comprehensive root cause analysis for outages and recurring incidents.
  • Identify performance bottlenecks and systemic issues, then drive proactive improvements.
  • Lead continuous improvement efforts through automation, process optimization, and post-incident reviews.
  • Create clear incident reports and technical documentation.

Requirements

  • 3+ years of experience in a Site Reliability Engineer, DevOps Engineer, or similar role.
  • Hands-on experience with Amazon Web Services (AWS), including AWS networking concepts.
  • Expert-level troubleshooting and diagnostic skills.
  • Proven track record of reducing system downtime.
  • Advanced Linux skills across engineering fundamentals, networking, storage, and operating systems.
  • Experience managing and optimizing observability tools such as Grafana and the ELK Stack.
  • Strong proficiency in Terraform and Chef.
  • Strong preference for automating tasks and using Infrastructure as Code rather than manual changes.
  • Ability to understand complex architectures and stay calm under pressure during outages.
  • Preferred: knowledge and experience with Kubernetes.
  • Preferred: familiarity with message brokers such as RabbitMQ and Apache Kafka.
  • Preferred: experience with NoSQL databases, particularly MongoDB and Elasticsearch.
  • Preferred: familiarity with OpenTelemetry.
  • Preferred: experience with large distributed systems and microservices architecture.
  • Preferred: experience with CI/CD pipelines.

Benefits

  • Opportunity to join a large, global intelligence company with more than 1,000 intelligence professionals serving over 1,900 clients worldwide.
  • Work for a company with a 4.6-star user rating on G2 and customers including more than 50% of the Fortune 100.
  • Be part of a diverse, inclusive workplace representing over 40 nationalities.
  • Accommodation and special assistance available during the application process.
  • Equal opportunity and affirmative action employer committed to fair hiring practices.
  • Drug-free workplace.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Site Reliability Engineering Manager

RapidSOS 51-250 Diversified Telecommunication Services

RapidSOS is seeking an SRE Manager to lead its SRE Operations team and own the reliability of critical cloud infrastructure that supports real-time emergency response.

Argo CD AWS Datadog GitHub Actions Helm Kubernetes Python RabbitMQ Terraform
32 minutes ago

Senior Site Reliability Engineer (Remote - Brazil)

Loadsmart 251-1K Air Freight & Logistics

Loadsmart is hiring a Senior Site Reliability Engineer in Brazil to build and maintain its internal platform and ensure the reliability, safety, and operational excellence of critical engineering systems.

Ansible AWS Bash Chef CI/CD Docker Kubernetes PostgreSQL Python Terraform
1 hour, 17 minutes ago

Site Reliability Engineer

Alpaca 51-250 Capital Markets

Alpaca is hiring a Site Reliability Engineer to keep its brokerage platform reliable and operable across cloud, Kubernetes, observability, messaging, and database systems, with a strong focus on PostgreSQL reliability on the trading-critical path.

DNS GitOps Go Kafka Kubernetes Linux Load Balancing PostgreSQL Python RabbitMQ Secrets Management TLS
4 hours, 36 minutes ago

Site Reliability Engineer

Kaseya 1K-5K IT Services

Kaseya is hiring a Site Reliability Engineer to own the reliability, automation, and production stability of its AWS-based services used by thousands of MSPs worldwide.

Ansible AWS Chef CloudFormation Datadog DevSecOps Elasticsearch Kibana Kubernetes MySQL PostgreSQL Puppet Secrets Management Serverless Terraform
8 hours, 36 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers