PlayON! Sports Network

PlayON! Sports Network

PlayON! Sports Network provides a comprehensive platform for high school sports programs, offering digital ticketing, live streaming, statistics, coaching tools, and social content to enhance community engagement and support student athletes.

Media
51-250
Founded 2006
$10M raised

Description

  • Assess and improve system visibility by reviewing dashboards, metrics, and logs and closing observability gaps.
  • Tighten monitoring and alerting for critical services to detect issues earlier and improve response times.
  • Build observability into build and deploy workflows by adding instrumentation and telemetry to release processes.
  • Help define SLIs and SLOs for core user flows and align the team on reliability expectations.
  • Improve incident response by partnering with the Event Commander/on-call rotation and strengthening communication, coordination, and follow-up.
  • Automate routine checks and monitoring tasks to reduce manual effort and free up engineering time.
  • Develop automation, tooling, and monitoring solutions that support high service availability.
  • Partner with application and quality engineering teams on reliability practices, release automation, and testing.
  • Drive operational excellence through incident prevention, blameless postmortems, and capacity planning.
  • Participate in on-call rotations to support critical services and respond quickly to incidents.

Requirements

  • Solid experience in Python for automation, tooling, and data-driven operational work.
  • Proficiency in at least one of Java, C++, or Go.
  • Strong understanding of Linux systems, cloud infrastructure, and modern deployment practices.
  • Experience with AWS, GCP, or Azure.
  • Experience with Docker, Kubernetes, and Terraform.
  • Experience with CI/CD pipelines, version control, and automated testing frameworks.
  • Experience with observability tools such as Prometheus, Grafana, ELK, or Datadog.
  • Experience analyzing logs and metrics to diagnose issues.
  • Proven experience facilitating and documenting Critical User Journeys and translating them into actionable SLAs/SLOs for automation.
  • Strong collaboration and communication skills in cross-functional, high-impact situations.
  • Familiarity with AI-augmented development tools such as Claude and Codex.
  • Nice to have: experience writing or maintaining end-to-end or integration tests for distributed systems.
  • Nice to have: background in performance testing, capacity planning, or chaos engineering.
  • Nice to have: contributions to internal developer tooling or reliability-focused frameworks.
  • Nice to have: exposure to security, compliance, or change management processes in production environments.
  • Nice to have: relevant certifications.

Benefits

  • Multiple medical insurance plans to choose from.
  • Dental, vision, life, and disability insurance.
  • Employee Emergency Fund.
  • Company equity in the form of stock options.
  • Open PTO policy.
  • 401(k) plan with company match.
  • Hybrid/flexible work environment.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Manager, Software Engineering

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is seeking a Senior Manager to lead CorpTech Platform software teams that build and operate AI-enabled production systems and improve how internal engineering work is designed, shipped, and maintained.

CI/CD Computer Vision ERP LLM Microservices
53 minutes ago

Senior Site Reliability Engineer

Anduril Industries 1K-5K Aerospace & Defense

Anduril Industries is hiring a Site Reliability Engineer for its Mission Autonomy team to support the reliability and operational excellence of autonomous systems used across cloud, hardware-in-the-loop, and air-gapped environments.

Ansible AWS Azure DNS Docker GCP Go HTTP Kubernetes Linux Load Balancing Puppet Python Splunk TCP/IP Terraform
53 minutes ago

Staff Site Reliability Engineer

Veeam Software 1K-5K Internet Software & Services

Veeam is hiring a Staff Site Reliability Engineer to lead reliability and observability efforts across its global platform and help shape resilient architecture and SRE practices at scale.

Azure C# Go Grafana Java JavaScript Kubernetes OpenTelemetry Prometheus Pulumi Terraform TypeScript
1 hour, 8 minutes ago

Site Reliability Engineer

66degrees 251-1K IT Services

66degrees is hiring a Site Reliability Engineer to help enterprise cloud clients maintain, optimize, and scale Google Cloud environments through reliability engineering, automation, and incident response.

Agile Datadog GCP JIRA Kanban Kubernetes Linux Prometheus Python Scrum SQL Server Terraform
1 hour, 25 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers