Machine Learning Operations (MLOps) Architect - Generative Al Focus

6 hours, 14 minutes ago
Full-time
Senior
DevOps and Infrastructure
Kapitus

Kapitus

Kapitus, a leading small business financing company, offers fast, flexible term loans and lines of credit from $5k to $5 million. With a focus on simplifying financing for small businesses, Kapitus provides multiple loan options to support business gro...

Diversified Financial Services
251-1K
Founded 2006
$490M raised

Description

  • Design and implement scalable ML and LLM infrastructure on AWS using services such as SageMaker, EKS, S3, IAM, Lambda, Step Functions, and CloudWatch.
  • Architect end-to-end ML and Generative AI lifecycle workflows from data ingestion and preprocessing through training, fine-tuning, deployment, monitoring, and retraining.
  • Integrate LLM pipelines, including prompt workflows, RAG architectures, and fine-tuning flows, into the enterprise MLOps stack.
  • Define and enforce CI/CD/CT standards across ML and GenAI workloads.
  • Architect Retrieval-Augmented Generation pipelines, including embedding generation, vector database integration, document chunking, and retrieval monitoring.
  • Design and deploy LLM-based services using managed endpoints or containerized custom inference services.
  • Establish prompt versioning, evaluation frameworks, experiment tracking, and guardrails for LLM systems.
  • Implement model monitoring, observability, and traceability standards for performance, drift, output quality, latency, and token usage.
  • Define SLAs/SLOs and safe deployment strategies such as blue/green, canary, and shadow testing.
  • Implement FinOps practices for ML and GenAI workloads, including cost tracking, optimization, forecasting, and autoscaling strategies.
  • Provide architectural guidance to data science, AI, and engineering teams and drive reusable documentation and standards.

Requirements

  • 6+ years of experience in ML engineering, data engineering, or MLOps roles.
  • Proven experience architecting ML platforms in AWS.
  • Strong hands-on experience with SageMaker, including training, pipelines, and deployment.
  • Experience operationalizing LLM or Generative AI systems in production.
  • Experience building RAG pipelines and integrating vector databases.
  • Experience working with Databricks in production.
  • Experience implementing data governance and catalog systems such as Atlan.
  • Strong understanding of CI/CD principles for ML and GenAI.
  • Experience with Docker and Kubernetes/EKS.
  • Deep knowledge of infrastructure as code tools such as Terraform and CloudFormation.
  • Strong understanding of observability and monitoring for ML systems.
  • Experience implementing cloud cost optimization strategies (FinOps).
  • Strong Python proficiency.
  • Experience with foundation model fine-tuning and parameter-efficient methods.
  • Experience implementing model registries and experiment tracking tools.
  • Experience designing feature stores and embedding stores.
  • Familiarity with AI risk management, bias mitigation, and safety controls.
  • Experience supporting regulated or data-sensitive environments.
  • Platform-level architectural thinking and the ability to integrate GenAI into enterprise ML ecosystems.
  • Ability to balance scalability, governance, security, performance, and cost.
  • Strong technical leadership and cross-functional collaboration skills.
  • Hands-on ability to move from architecture design to implementation.

Benefits

  • Competitive base salary range of $117,800 to $189,000, depending on location, skills, and experience.
  • Annual incentive compensation eligibility of up to 10% annually.
  • Comprehensive medical, dental, and employer-paid vision insurance.
  • Flexible spending account for qualified medical, dental, vision, pharmacy, and dependent care expenses.
  • Lifestyle spending account for physical, mental, and financial well-being expenses.
  • 100% company-paid basic short-term and long-term disability insurance, plus vision insurance.
  • Paid maternity and parental leave.
  • Commuter benefits for parking and travel expenses.
  • Tuition reimbursement of up to $5,000 annually, plus conference and career development support.
  • Paid time off and sick time.
  • 401(k) retirement plan with a 25% company match up to 6% of annual salary.
  • Remote candidates may be considered if they reside in eligible states where Kapitus or a subsidiary has a physical presence.

Interested in this position?

Apply directly on the company website

Apply Now

Similar Roles

Machine Learning Engineering Manager - Personalization

Spotify Media

Spotify’s Personalization team is hiring a Machine Learning Engineering Manager in New York or Boston to lead safety-focused ML systems for recommendations, search, and emerging AI experiences.

Generative AI LLM Machine Learning
38 minutes ago

Senior AI Fullstack Engineer

Etera Technologies FZCO 11-50 Travel & Tourism

Etera is seeking a Senior AI Fullstack Engineer to build its AI-native corporate travel platform for the GCC market, owning mobile, backend, and multi-agent AI systems that power real user-facing travel workflows.

Android Expo iOS React Native REST API TypeScript
1 hour, 23 minutes ago

Machine Learning/AI Engineer

Element Solutions 11-50 Professional Services

Element is hiring a remote Machine Learning/AI Engineer for a federal CMS program to design, build, and deploy production-grade AI/ML solutions that improve mission-critical decision-making and operational efficiency in a regulated government environment.

Agile AWS Azure CI/CD Docker GCP Kubeflow Kubernetes LLM Machine Learning MLflow MLOps Python PyTorch REST API SageMaker Scikit-learn SQL TensorFlow Vertex AI
1 hour, 25 minutes ago

[Job 29147] Mid-level Developer Backend Java (Focus in AI)

CI&T 5K-10K Internet Software & Services

CI&T is hiring a mid-level Backend Java Developer in Brazil to build and modernize backend systems for an AI-powered chatbot platform using Java, Spring Boot, and cloud-native delivery practices.

Angular CI/CD Docker Java JUnit Kafka Kubernetes LLM Microservices Mockito RabbitMQ REST API Spring Boot SQL
2 hours, 20 minutes ago

You're on a roll! Sign up now to keep applying.

Sign Up

Already have an account? Log in

Used by 14,729+ remote workers