Senior Python Developer: Databricks AI Platform, Alerting & Monitoring

2 months ago
Contract
Senior
DevOps and Infrastructure
Xenon7

Xenon7

Xenon7 provides advanced AI solutions and consultancy services, leveraging a team of highly qualified experts and a strong emphasis on research and innovation to address complex industry challenges and enhance operational efficiency.

Internet Software & Services
Founded 2014

Description

  • Build Python-based workflows for MLOps, LLMOps, and application deployment within Databricks.
  • Enhance Databricks workspace onboarding and governance, including Unity Catalog, permissions, and reusable environment setup modules.
  • Integrate Mosaic AI components (Gateway, Model Serving, Agents) into platform automation and deployment pipelines.
  • Support Delta Lake (Bronze/Silver/Gold) architecture and manage MLflow model lifecycles.
  • Implement automated health checks and observability for AWS resources and Databricks applications.
  • Develop event-driven alerting mechanisms using AWS CloudWatch, SNS, and EventBridge.
  • Build Python automations to validate configuration consistency across multiple AWS accounts and detect anomalies or misconfigurations.
  • Create automated service-request workflows that bridge alerting with ticketing and collaboration tools (Slack, Jira, etc.).
  • Design monitoring dashboards and fail-safe/rollback mechanisms to maintain production stability and uptime.

Requirements

  • 6+ years of professional Python development and cloud automation experience (Python mastery, internals, GIL, multiprocessing vs. multithreading, memory trade-offs).
  • Hands-on experience with Databricks ecosystem components: Unity Catalog, MLflow, and Mosaic AI.
  • Experience with Delta Lake architecture (Bronze/Silver/Gold) and ML model lifecycle management.
  • Strong proficiency with AWS automation and observability tools: Lambda, API Gateway, CloudWatch, EventBridge, SNS.
  • Experience implementing reliability engineering practices such as Docker image immutability and automated rollback strategies.
  • Familiarity with Service Principal–based authentication for secure Databricks/AWS integration.
  • Experience building event-driven alerting and integrations with ticketing/collaboration systems (Slack, Jira).
  • Ability to work independently in a remote, global environment; immediate availability is highly preferred.
  • Mindset combining development of new AI capabilities with proactive monitoring and operational uptime focus (e.g., SRE/Reliability orientation).

Benefits

  • Access to a networked ecosystem of client engagements, thought leadership, and mentorship opportunities.
  • Outcome-focused culture emphasizing autonomy, ownership, and smart execution over hours logged.
  • Opportunity to contribute to leading-edge AI and high-scale cloud infrastructure projects.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Staff Operations Engineer

Mozilla 251-1K Internet Software & Services

Mozilla is hiring a Staff Operations Engineer to lead the design, reliability, and evolution of hybrid-cloud and workplace infrastructure across teams.

Ansible DNS Linux Puppet Python TCP/IP Unix
8 hours, 1 minute ago

Software Engineer II, Backend (ML Training & Serving)

Affirm 1K-5K Diversified Financial Services

Affirm is hiring a Software Engineer II for its ML Training & Serving engineering team to build the infrastructure that trains and serves machine learning models across the company.

AWS Kotlin Kubernetes Machine Learning MySQL Python
8 hours, 1 minute ago

Ssr. Fullstack Engineer

Resilient Co 11-50 Professional Services

Resilient Co. is hiring a semi-senior Fullstack Engineer in Argentina or Brazil to build AI-driven full-stack solutions for enterprise workflows, with a focus on agentic AI, machine learning, backend services, and cloud integration.

Angular Azure C# CI/CD Django Docker Entity Framework FastAPI Flask Git JavaScript Microservices .NET NumPy Pandas Python RabbitMQ React Scikit-learn Terraform Vue.js YAML
8 hours, 16 minutes ago

Principal Site Reliability Engineer (SRE)

Symmetrio Professional Services

Symmetrio is recruiting a Principal Site Reliability Engineer for a rapidly growing healthcare technology company to own the reliability, scalability, security, and performance of a mission-critical SaaS platform used by healthcare providers across the United States.

Active Directory AWS CI/CD Datadog Django Grafana Kubernetes Python Terraform Windows Server
8 hours, 16 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers