Faire

Faire

Faire is an online wholesale marketplace connecting independent retailers with unique merchandise from around the world. With flexible payment terms, free returns, and personalized recommendations, Faire empowers small businesses to compete with larger...

Textiles, Apparel & Luxury Goods
1K-5K
Founded 2017
$1500M raised

Description

  • Design and operate ML infrastructure, including workspaces, clusters, jobs, and workflows.
  • Productionize ML workloads using Spark, Delta Lake, MLflow, and Databricks Workflows.
  • Teach data scientists how to move models from notebooks into production on the ML platform.
  • Implement Unity Catalog for data governance, lineage, access control, and secure multi-tenant usage.
  • Build CI/CD pipelines for machine learning using Terraform and Git-based workflows such as GitHub Actions.
  • Optimize performance, reliability, and cost across training and inference workloads.
  • Configure IAM and RBAC for sensitive datasets.
  • Establish observability for data quality, model performance, and platform health.
  • Build and maintain technical documentation for the ML platform.

Requirements

  • 8+ years of experience building production ML or data platforms.
  • A degree, preferably graduate level, in Computer Science, Engineering, Statistics, or a related technical field.
  • Strong hands-on expertise with Databricks, Spark, Delta Lake, and MLflow.
  • Proficiency in Python, SQL, and distributed systems concepts.
  • Experience with cloud platforms and infrastructure-as-code.
  • Solid understanding of MLOps best practices, including CI/CD, monitoring, reproducibility, and security.
  • Experience supporting multiple ML teams in a shared platform environment.
  • Experience with Kotlin, PyTorch, Kafka, Snowflake, Fivetran, Iceberg, Datadog, Airflow, Cockroach DB, or MySQL is preferred.
  • Experience with AWS, S3, SageMaker, Kubernetes, Docker, GitHub Actions, or Terraform is preferred.
  • Familiarity with generative AI tools such as Claude Sonnet 4.5 and ChatGPT 5.2 is listed in the tech stack.

Benefits

  • Salary range of $224,000 to $308,000 per year in San Francisco.
  • Eligibility for equity.
  • Hybrid work schedule with 3 days per week in the office.
  • Flexibility to work remotely up to 4 weeks per year in hybrid roles.
  • Reasonable accommodation support during the recruitment process.
  • Equal employment opportunity commitment.
  • Access to benefits, though specific plan details are not listed.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Staff Machine Learning Engineer, AI Researcher

Cribl 251-1K IT Services

Cribl is hiring a remote-first machine learning engineer to help build AI-enabled security and observability products that solve real customer problems.

Computer Vision Feature Engineering Kubeflow Machine Learning MLflow MLOps NLP Python PyTorch Reinforcement Learning TensorFlow
11 hours, 10 minutes ago

Staff Machine Learning Engineer - Platform (Core AI Automation)

Coinbase 1K-5K Capital Markets

Coinbase is hiring a Machine Learning Engineer for its Core Automation Team to build AI infrastructure and automation that improve customer support, compliance operations, and AI-powered customer interactions on its onchain platform.

Apache Airflow Apache Spark Blockchain Computer Vision Databricks Deep Learning Flink Generative AI Kafka LLM Machine Learning NLP Python Snowflake
11 hours, 10 minutes ago

Software Engineer - ML Platform

Veriff 51-250 IT Services

Veriff’s ML Platform team is hiring a software or ML engineer to build the systems that support machine learning development, experimentation, observability, and scalable model deployment.

Apache Spark dbt Grafana Kubeflow MLflow MLOps Prometheus Python Snowflake SQL
11 hours, 10 minutes ago

Staff ML Engineer - ML Infrastructure

Samsara 1K-5K IT Services

Samsara is hiring a Staff / Senior Staff Machine Learning Infrastructure Engineer in Canada to lead the end-to-end ML platform for Safety AI and adjacent product areas that improve real-world operational safety.

Apache Spark AWS Computer Vision Embedded Systems IoT Kubernetes LLM Machine Learning
11 hours, 40 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers