Senior Python Developer: Databricks AI Platform, Alerting & Monitoring

1 month, 1 week ago
Contract
Senior
DevOps and Infrastructure
Xenon7

Xenon7

Xenon7 provides advanced AI solutions and consultancy services, leveraging a team of highly qualified experts and a strong emphasis on research and innovation to address complex industry challenges and enhance operational efficiency.

Internet Software & Services
Founded 2014

Description

  • Build Python-based workflows for MLOps, LLMOps, and application deployment within Databricks.
  • Enhance Databricks workspace onboarding and governance, including Unity Catalog, permissions, and reusable environment setup modules.
  • Integrate Mosaic AI components (Gateway, Model Serving, Agents) into platform automation and deployment pipelines.
  • Support Delta Lake (Bronze/Silver/Gold) architecture and manage MLflow model lifecycles.
  • Implement automated health checks and observability for AWS resources and Databricks applications.
  • Develop event-driven alerting mechanisms using AWS CloudWatch, SNS, and EventBridge.
  • Build Python automations to validate configuration consistency across multiple AWS accounts and detect anomalies or misconfigurations.
  • Create automated service-request workflows that bridge alerting with ticketing and collaboration tools (Slack, Jira, etc.).
  • Design monitoring dashboards and fail-safe/rollback mechanisms to maintain production stability and uptime.

Requirements

  • 6+ years of professional Python development and cloud automation experience (Python mastery, internals, GIL, multiprocessing vs. multithreading, memory trade-offs).
  • Hands-on experience with Databricks ecosystem components: Unity Catalog, MLflow, and Mosaic AI.
  • Experience with Delta Lake architecture (Bronze/Silver/Gold) and ML model lifecycle management.
  • Strong proficiency with AWS automation and observability tools: Lambda, API Gateway, CloudWatch, EventBridge, SNS.
  • Experience implementing reliability engineering practices such as Docker image immutability and automated rollback strategies.
  • Familiarity with Service Principal–based authentication for secure Databricks/AWS integration.
  • Experience building event-driven alerting and integrations with ticketing/collaboration systems (Slack, Jira).
  • Ability to work independently in a remote, global environment; immediate availability is highly preferred.
  • Mindset combining development of new AI capabilities with proactive monitoring and operational uptime focus (e.g., SRE/Reliability orientation).

Benefits

  • Access to a networked ecosystem of client engagements, thought leadership, and mentorship opportunities.
  • Outcome-focused culture emphasizing autonomy, ownership, and smart execution over hours logged.
  • Opportunity to contribute to leading-edge AI and high-scale cloud infrastructure projects.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Machine Learning Engineer Specialist– Recommendation Systems

MUTT DATA 51-250 Internet Software & Services

Muttdata is hiring a remote Machine Learning Engineer Specialist to build and operate large-scale recommendation systems that improve personalization and user experience for consumer products and e-commerce clients.

Apache Spark AWS Azure Databricks dbt Feature Engineering GCP Machine Learning Python PyTorch SQL TensorFlow
31 minutes ago

Site Reliability Engineer

Alpaca 51-250 Capital Markets

Alpaca is hiring a Site Reliability Engineer to keep its brokerage platform reliable and operable across cloud, Kubernetes, observability, messaging, and database systems, with a strong focus on PostgreSQL reliability on the trading-critical path.

DNS GitOps Go Kafka Kubernetes Linux Load Balancing PostgreSQL Python RabbitMQ Secrets Management TLS
2 hours, 30 minutes ago

Senior Machine Learning Engineer - Personalization

Spotify Media

Senior Machine Learning Engineer on Spotify’s Personalization team, building recommendation systems that power music experiences like Home and Now Playing for millions of listeners.

Agile Apache Spark AWS GCP Generative AI Hugging Face Java LLM Machine Learning Python PyTorch Scala Statistics Transformers
3 hours, 45 minutes ago

Site Reliability Engineer

Kaseya 1K-5K IT Services

Kaseya is hiring a Site Reliability Engineer to own the reliability, automation, and production stability of its AWS-based services used by thousands of MSPs worldwide.

Ansible AWS Chef CloudFormation Datadog DevSecOps Elasticsearch Kibana Kubernetes MySQL PostgreSQL Puppet Secrets Management Serverless Terraform
6 hours, 30 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers