Senior Python Developer: Databricks AI Platform, Alerting & Monitoring

1 day, 20 hours ago
Contract
Senior
DevOps and Infrastructure
Xenon7

Xenon7

Xenon7 provides advanced AI solutions and consultancy services, leveraging a team of highly qualified experts and a strong emphasis on research and innovation to address complex industry challenges and enhance operational efficiency.

Internet Software & Services
Founded 2014

Description

  • Build Python-based workflows for MLOps, LLMOps, and application deployment within Databricks.
  • Enhance Databricks workspace onboarding and governance, including Unity Catalog, permissions, and reusable environment setup modules.
  • Integrate Mosaic AI components (Gateway, Model Serving, Agents) into platform automation and deployment pipelines.
  • Support Delta Lake (Bronze/Silver/Gold) architecture and manage MLflow model lifecycles.
  • Implement automated health checks and observability for AWS resources and Databricks applications.
  • Develop event-driven alerting mechanisms using AWS CloudWatch, SNS, and EventBridge.
  • Build Python automations to validate configuration consistency across multiple AWS accounts and detect anomalies or misconfigurations.
  • Create automated service-request workflows that bridge alerting with ticketing and collaboration tools (Slack, Jira, etc.).
  • Design monitoring dashboards and fail-safe/rollback mechanisms to maintain production stability and uptime.

Requirements

  • 6+ years of professional Python development and cloud automation experience (Python mastery, internals, GIL, multiprocessing vs. multithreading, memory trade-offs).
  • Hands-on experience with Databricks ecosystem components: Unity Catalog, MLflow, and Mosaic AI.
  • Experience with Delta Lake architecture (Bronze/Silver/Gold) and ML model lifecycle management.
  • Strong proficiency with AWS automation and observability tools: Lambda, API Gateway, CloudWatch, EventBridge, SNS.
  • Experience implementing reliability engineering practices such as Docker image immutability and automated rollback strategies.
  • Familiarity with Service Principal–based authentication for secure Databricks/AWS integration.
  • Experience building event-driven alerting and integrations with ticketing/collaboration systems (Slack, Jira).
  • Ability to work independently in a remote, global environment; immediate availability is highly preferred.
  • Mindset combining development of new AI capabilities with proactive monitoring and operational uptime focus (e.g., SRE/Reliability orientation).

Benefits

  • Access to a networked ecosystem of client engagements, thought leadership, and mentorship opportunities.
  • Outcome-focused culture emphasizing autonomy, ownership, and smart execution over hours logged.
  • Opportunity to contribute to leading-edge AI and high-scale cloud infrastructure projects.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Senior Site Reliability Engineer

Spotify Media

Senior Site Reliability Engineer role at Spotify’s Backstage team, building and operating the cloud infrastructure behind its developer portal and AI-native agent workflows.

AWS GCP Go Java Kubernetes Microservices Python React Terraform TypeScript
0 minutes ago

Senior/Lead ML Applied Scientist

Intuition Machines 51-250 Life Sciences Tools & Services

Intuition Machines is seeking an ML Applied Scientist to design and scale production machine learning systems for its enterprise security products, including hCaptcha, in a fast-paced, globally distributed environment.

Machine Learning Statistics
0 minutes ago

Mid SRE Engineer / DevOps 6 moths contract

Margo Bank Professional Services

Mid SRE Engineer / DevOps role at a consulting team in Warsaw focused on building a developer platform and defining CI/CD standards across multiple teams on a 6-month contract.

Bash CI/CD DevSecOps Git Kubernetes Python
15 minutes ago

FBS AIOps Engineer

Capgemini 100K+ Internet Software & Services

The AIOps Engineer at Farmers Business Services will design and operate a centralized AIOps platform that supports IT Operations, SRE, and infrastructure teams across the enterprise.

Python Snowflake
15 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers