Senior Python Developer: Databricks AI Platform, Alerting & Monitoring

3 weeks, 1 day ago
Contract
Senior
DevOps and Infrastructure
Xenon7

Xenon7

Xenon7 provides advanced AI solutions and consultancy services, leveraging a team of highly qualified experts and a strong emphasis on research and innovation to address complex industry challenges and enhance operational efficiency.

Internet Software & Services
Founded 2014

Description

  • Build Python-based workflows for MLOps, LLMOps, and application deployment within Databricks.
  • Enhance Databricks workspace onboarding and governance, including Unity Catalog, permissions, and reusable environment setup modules.
  • Integrate Mosaic AI components (Gateway, Model Serving, Agents) into platform automation and deployment pipelines.
  • Support Delta Lake (Bronze/Silver/Gold) architecture and manage MLflow model lifecycles.
  • Implement automated health checks and observability for AWS resources and Databricks applications.
  • Develop event-driven alerting mechanisms using AWS CloudWatch, SNS, and EventBridge.
  • Build Python automations to validate configuration consistency across multiple AWS accounts and detect anomalies or misconfigurations.
  • Create automated service-request workflows that bridge alerting with ticketing and collaboration tools (Slack, Jira, etc.).
  • Design monitoring dashboards and fail-safe/rollback mechanisms to maintain production stability and uptime.

Requirements

  • 6+ years of professional Python development and cloud automation experience (Python mastery, internals, GIL, multiprocessing vs. multithreading, memory trade-offs).
  • Hands-on experience with Databricks ecosystem components: Unity Catalog, MLflow, and Mosaic AI.
  • Experience with Delta Lake architecture (Bronze/Silver/Gold) and ML model lifecycle management.
  • Strong proficiency with AWS automation and observability tools: Lambda, API Gateway, CloudWatch, EventBridge, SNS.
  • Experience implementing reliability engineering practices such as Docker image immutability and automated rollback strategies.
  • Familiarity with Service Principal–based authentication for secure Databricks/AWS integration.
  • Experience building event-driven alerting and integrations with ticketing/collaboration systems (Slack, Jira).
  • Ability to work independently in a remote, global environment; immediate availability is highly preferred.
  • Mindset combining development of new AI capabilities with proactive monitoring and operational uptime focus (e.g., SRE/Reliability orientation).

Benefits

  • Access to a networked ecosystem of client engagements, thought leadership, and mentorship opportunities.
  • Outcome-focused culture emphasizing autonomy, ownership, and smart execution over hours logged.
  • Opportunity to contribute to leading-edge AI and high-scale cloud infrastructure projects.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Machine Learning Engineer, AI Platform

Affinity 251-1K IT Services

Affinity is hiring a Senior Machine Learning Engineer for its AI Platform team to build production ML systems that extract, retrieve, and rank insights from massive relationship and business interaction data for its CRM platform.

CI/CD Feature Engineering Machine Learning Python PyTorch Scikit-learn
3 hours, 2 minutes ago

AI Tech Lead - Staff Machine Learning Engineer

Sumo Logic 251-1K Internet Software & Services

Sumo Logic is hiring a Staff Machine Learning Engineer – AI Tech Lead to lead the design and production delivery of agentic AI systems for Security Operations Center use cases at global scale.

Apache Airflow AWS Azure Docker GCP Kubernetes LLM Machine Learning MLflow Python PyTorch System Design Vertex AI
3 hours, 22 minutes ago

AI/ML Engineer (AWS)

Reply 10K-50K Internet Software & Services

Valorem Reply is hiring a Senior AI/ML Engineer in Irvine or Los Angeles to build and evolve AWS-based machine learning and Generative AI applications for enterprise customers.

Agile AWS CI/CD Generative AI LLM Machine Learning Python
5 hours, 23 minutes ago

Site Reliability Engineer (Senior or Staff), Atlas

MongoDB 1K-5K Internet Software & Services

MongoDB is hiring a Senior Site Reliability Engineer for its Atlas team to help support, maintain, and grow a multi-cloud platform for customer-facing production workloads.

AWS Azure DNS GCP Go HTTP Linux Python Ruby TLS
5 hours, 44 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers