Senior Data Engineer

Convoy Health

Convoy Health

Data Science

United States

Posted on May 22, 2026

About the Role

Convoy Health is an early-stage, venture-backed healthcare AI company. Small founding team. You'll own meaningful surface area from day one and ship product into real customer environments within weeks, not quarters. Our platform combines data engineering, analytics, and agentic AI to turn complex healthcare data into actionable intelligence.

As a Senior Data Engineer, you'll own the data infrastructure end-to-end. Every insight the AI agent surfaces, every dashboard a user opens, every forecast model that runs — it all starts with the data platform you build and operate. This means ingestion pipelines across diverse source systems, a multi-tenant warehouse, transformation layers that normalize heterogeneous healthcare data, and the analytics foundation that serves both human users and autonomous AI agents.

You're expected to use AI-assisted development tools daily and to design data systems with agentic consumption patterns in mind.

What You'll Do

Data Platform & Architecture

  • Design and operate a dual-database architecture: a columnar analytics warehouse with schema-per-tenant isolation and a relational OLTP database for profiles, config, audit logs, and AI conversations
  • Implement and enforce tenant data isolation: schema-level grants, session-level context injection, defense-in-depth filtering, and row-level security policies
  • Build the data foundation that serves both human-facing dashboards and agentic tool-use queries — optimizing for the distinct access patterns of each

Multi-Source Ingestion

  • Build and operate ingestion pipelines for diverse healthcare data sources:
  • Claims: ANSI X12 837I/837P/835/834 and CSV formats via workflow orchestration, serverless parsers, and bulk loading
  • EMR/EHR: HL7 FHIR extracts, flat-file exports from major EHR systems (Epic, Cerner, Athena, eClinicalWorks, etc.)
  • RCM: Revenue cycle data from billing platforms — charge capture, A/R aging, denial management, collections
  • Financial: General ledger exports, budget files, contract rate tables, fee schedules, capitation rate cards
  • Operational: HRIS staffing data, scheduling system exports, patient volume feeds
  • Payer: Remittance files, eligibility responses, authorization data, quality measure submissions
  • Own the CMS public dataset ingestion pipeline: NCCI edits, NPPES, Provider Utilization, MS-DRG, HCRIS, Hospital Compare, Physician Fee Schedule, and more — with workflow orchestration, schema validation, external tables, and operational copies

Transformation & Modeling

  • Manage the dbt transformation pipeline: input layer ingestion, core data model normalization (claims, members, providers, encounters), and data mart outputs (PMPM, CMS-HCC, readmissions, quality measures, CCSR, chronic conditions, ED classification)
  • Build domain-specific analytical models beyond claims: visit economics (slot utilization, no-show rates, revenue per visit), VBC performance (quality measures, shared savings, risk adjustment), financial forecasting (revenue projections, budget variance, TCOP), and provider benchmarking (peer cohort construction, percentile distributions)
  • Build custom domain tables: contracts, fee schedules, capitation rates, delegation agreements, DOFR surveillance, capitation reconciliation, facility claims audit, TCOP, budget/forecast, JV attribution
  • Operate cross-tenant benchmarking pipelines: zero-copy data sharing, monthly aggregate computation, de-identified metric distributions, CMS national benchmark integration

AI & ML Data Foundation

  • Provide the data foundation for ML models: feature store tables, statistical process control metrics, population surveillance outputs, claim scoring results, vector embedding storage, and feedback loop tables
  • Design and optimize data access patterns for agentic AI workflows — the AI agent queries your data layer autonomously via tool use, so schema design, indexing, and query performance directly impact agent quality
  • Support the analytics query layer: data model definitions mapping to transformation outputs and custom domain tables, tenant security context injection, pre-aggregation optimization, cache invalidation on data refresh

What We're Looking For

Required:

  • 5+ years of data engineering experience with production data platforms at scale
  • Experience with healthcare data — claims formats (837/835/834), clinical data (HL7/FHIR), or revenue cycle data. Deep expertise in at least one; working familiarity with others
  • Deep experience with a cloud columnar warehouse (Redshift, BigQuery, Snowflake, or Databricks) — schema design, bulk loading, external tables, data sharing, query optimization
  • Strong PostgreSQL skills — RLS policies, indexing strategies, transactional DDL, connection pooling
  • Production experience with dbt — model design, incremental materializations, macros, testing, CI/CD integration
  • Experience with cloud data services: object storage, workflow orchestration, serverless compute, event-driven scheduling, data catalogs, encryption
  • Proficiency in SQL and Python for ETL scripting and data validation
  • Experience with schema-per-tenant or row-level-security multi-tenant data architectures
  • Comfort with AI-assisted development — you use AI coding tools daily and understand how to build data systems that AI agents consume
  • Familiarity with HIPAA technical safeguards: encryption at rest/in transit, audit logging, PHI handling
  • You've built data infrastructure in an early-stage environment — comfortable making architectural decisions without a platform team behind you.

Preferred:

  • Experience building data infrastructure consumed by AI/ML systems or autonomous agents (feature stores, tool-use query patterns, RAG data layers)
  • Experience with multiple healthcare data source types: claims + EMR + RCM + financial
  • Familiarity with healthcare data normalization frameworks (Tuva Project or similar)
  • Experience with ANSI X12 EDI parsing (837I, 837P, 835, 834 transaction sets)
  • Exposure to CMS public datasets (NPPES, NCCI, HCRIS, MS-DRG tables, Physician Fee Schedule)
  • Experience with vector databases or extensions (pgvector, Pinecone, or similar) for similarity search
  • Infrastructure-as-code experience (CDK, Terraform, Pulumi, or CloudFormation)
  • Experience operating data pipelines under BAA/HIPAA compliance requirements
  • Founding / early data-engineer experience at a VC-backed healthcare or AI startup (Tempus AI, Lightbeam, Lyra, Virta, Sprinter, Holmusk, Flatiron, Veeva, etc.).
  • You've built data layers that AI agents query via tool use in production — not theoretical.

Tech Stack

  • Warehouse: Cloud columnar warehouse (Redshift, BigQuery, Snowflake, or similar) with external table and data sharing support
  • Operational DB: PostgreSQL (with RLS, vector extension support)
  • Transformation: dbt, SQL, Python
  • Orchestration: Workflow orchestration (Step Functions, Airflow, Prefect, or similar), event-driven scheduling, serverless compute
  • Storage: Object storage (Parquet, JSON Lines), encryption at rest
  • AI/Agent Layer: LLM platform — the AI agent consumes your data layer via tool use
  • Infrastructure: Infrastructure-as-code, monitoring, IAM
  • Formats: ANSI X12 (837I/837P/835/834), HL7 FHIR, CSV, Parquet, JSON Lines
  • Development: AI-assisted coding tools

Compensation

Competitive salary commensurate with experience, equity, and benefits.