Data Engineering AI Startup Jobs
Explore startups tagged with Data Engineering and compare hiring activity, company profiles, and direct job links. This page is indexable only when a tag reaches at least 5 companies to avoid thin content.

Databricks
238 jobsUnified analytics platform for data and AI, helping companies process and analyze big data in the cloud.

Scale AI
80 jobsData infrastructure company providing high-quality training data for AI applications, recently partnered with Meta.

Together AI
37 jobsCloud platform for running and fine-tuning open-source AI models at scale.

Hightouch
34 jobsData activation platform (Reverse ETL, Customer Studio) to sync warehouse data to business tools.

Cribl
29 jobsData pipeline platform that gives you control over your observability data.

Astronomer
14 jobsCompany behind Astro, a managed Apache Airflow DataOps platform for data & AI pipelines.

Snorkel AI
14 jobsData-centric AI platform for programmatically labeling and managing training data.

PhaseV
5 jobsML-driven adaptive trials and clinical development optimization.

Datology AI
4 jobsAI training data curation platform helping enterprises optimize ML training data at petabyte scale.

Polars
4 jobsPolars is a blazingly fast DataFrames library written in Rust, offering Python, R, Node.js, and SQL bindings for efficient, multi-threaded data manipulation at scale.

OneSchema
3 jobsAI-driven CSV and PDF data import automation platform for seamless customer onboarding.

Alloy
Alloy is a data platform for robotics that helps companies process, organize, and search through the massive volumes of sensor, camera, and telemetry data their robots generate. The Sydney-based startup enables natural language search across robot data and automated issue detection, reducing data processing time by up to 90%.

Anaconda
Anaconda provides the world's most popular open-source Python and R distribution for data science and AI development. Serving over 45 million users, its platform enables enterprises to manage packages, environments, and AI workflows at scale with security and governance controls.

Anomalo
Anomalo is an AI-powered enterprise data quality monitoring platform that automatically detects data issues across warehouses and lakes without manual rule configuration. The platform uses machine learning to monitor structured and unstructured datasets for enterprises like Block and Discover Financial.

Apheris
Apheris provides governed, privacy-preserving data access and collaboration for AI and analytics across sensitive datasets.

Artie
Fully managed change data capture (CDC) streaming platform that replicates production databases into data warehouses and lakes in real time. Trusted by Substack, ClickUp, and Alloy, processing over 700 billion rows annually.

Astral
Astral builds high-performance Python developer tooling, including Ruff, uv, and ty, with a focus on fast local workflows and production-grade packaging.

Ayar Labs
Ayar Labs builds optical I/O and in-package photonics technology to reduce data-movement bottlenecks in large-scale AI and high-performance computing systems.

Bindwell
Bindwell is an AI-powered pesticide discovery company that uses machine learning models 4x faster than DeepMind's AlphaFold to screen billions of molecules and design safer, more effective crop protection products. Unlike traditional agtech software companies, Bindwell develops and licenses complete proprietary pesticide molecules to major agrochemical companies. Founded by teen entrepreneurs Tyler Rose and Navvye Anand through Y Combinator's W25 batch, the company is backed by General Catalyst and Paul Graham.

Biostate AI
A scalable biological data collection service providing multi-omics data for research.

Bronto
Modern logging and observability platform for AI applications and engineering teams, offering fast log ingestion, search, and alerting with a columnar storage architecture.

Colossal Biosciences
Colossal Biosciences is a genetic engineering and de-extinction company using CRISPR technology to restore extinct species like the woolly mammoth and protect critically endangered ecosystems.

Credal.ai
Credal provides a secure AI agent platform for enterprises, enabling teams to build AI agents and MCP-connected workflows across internal data sources with governance controls.

Dagster
Dagster builds open-source and commercial orchestration tooling that helps data teams ship, observe, and scale pipelines with a modern developer experience.

David AI
David AI is the world's first dedicated audio data research lab, building the data layer for next-generation audio AI. Founded by former Scale AI engineers, serving most FAANG companies and major AI labs.

Deepnote
Collaborative cloud data notebook platform for data science and analytics teams.

Definite
Definite combines a cloud data warehouse, metrics layer, notebooks, dashboards, and AI assistant workflows into an all-in-one analytics platform for faster self-serve analysis.

Distyl AI
Data intelligence platform that unifies messy operational data, applies AI agents, and routes insights back into business workflows.

DualBird
DualBird provides a cloud-native hardware-software data and AI infrastructure engine that delivers 10-100x faster performance and 50-90% lower costs through FPGA-based acceleration.

Encharge AI
Developing analog in-memory compute chips and software for energy-efficient AI at the edge.

Eon
Eon is the first cloud backup posture management (CBPM) platform, automating and unifying complex cloud backups into a queryable data lake for fast recovery, compliance, and AI analytics. Founded by the team behind AWS Disaster Recovery, Eon converts idle backup data into an accessible secondary storage layer for enterprise AI workloads.

Espresso AI
Espresso AI uses generative AI and machine learning to automatically optimize SQL queries and reduce cloud compute costs by up to 70-80% for Snowflake data warehouse users. The platform integrates with existing data warehouse setups to analyze and optimize queries in real time using NLP, program synthesis, and reinforcement learning.

Firecrawl
Firecrawl is a web data infrastructure platform that converts websites into clean, structured data optimized for AI applications through a simple API, turning entire websites into LLM-ready markdown or structured data.

Flatfile
AI-assisted data exchange platform that helps teams collect, map, validate, and transform messy customer data before it enters core systems.

Flow Computing
Flow Computing develops Parallel Processing Unit technology to accelerate next-generation CPUs for AI, edge, cloud, and parallel computing workloads.

Fundamental
Fundamental builds large tabular models and enterprise AI infrastructure for prediction and analysis on complex business data, focused on tabular reasoning and decision support.

Grafana Labs
Company behind the open-source Grafana observability stack providing monitoring, logging, and tracing solutions, reaching $400M ARR as a fully remote company across 40+ countries.

Gruve
Gruve delivers AI-native infrastructure, inference systems, and enterprise AI agents for inference-heavy workloads with an emphasis on speed, security, and measurable outcomes.

Hex
Hex is a collaborative analytics workspace that combines notebooks, SQL, data apps, and AI-assisted workflows for data teams.

Junction
Junction (formerly Vital) modernizes healthcare infrastructure with seamless lab testing and device data integration, connecting over 500 wearables and medical devices with 10+ lab networks including Labcorp and Quest across all 50 states.

LlamaIndex
LlamaIndex is a data framework for LLM applications that enables developers to connect, index, and query custom data sources with large language models through their open-source library and LlamaCloud platform.

Mage
Mage is an open-source, AI-native data pipeline platform that enables teams to build, run, and manage data pipelines for integrating and transforming data using Python, SQL, and R. Available as both open-source and enterprise versions, it provides real-time and batch pipeline orchestration.

MotherDuck
MotherDuck is a serverless cloud data warehouse built on the open-source DuckDB engine, enabling fast SQL analytics with no infrastructure to manage. The platform supports hybrid local-cloud execution, allowing analysts to query data seamlessly across laptop and cloud.

Nexthop AI
Nexthop AI builds networking systems for AI-scale data centers, focusing on high-performance switching infrastructure for hyperscale and cloud environments.

Omni
Omni is a modern business intelligence and analytics platform that combines a unified semantic data model with SQL flexibility, enabling AI-powered trustworthy answers in seconds. The platform supports embedded analytics, custom dashboards, and governed data exploration.

Perle
Perle is an AI training data platform that combines human expertise with adaptive workflows to help companies collect, annotate, and evaluate specialized training data for generative AI, LLMs, and RLHF. Their vetted global network of domain experts provides modular solutions for data annotation, enrichment, and adversarial robustness assessment.

Prefect
Prefect builds workflow orchestration and AI infrastructure software that helps teams automate, observe, and manage data and application workflows.

Prior Labs
Prior Labs builds tabular foundation models that understand spreadsheets and databases, enabling instant pattern inference across any dataset without task-specific training. Their flagship model TabPFN, trained on 130 million synthetic datasets, ranks #1 on the TabArena benchmark and scales to 10 million rows, serving Fortune 500 companies like Hitachi.

Profluent Bio
Uses generative AI to design novel proteins and gene editors for therapeutics.

Protege
Protege operates a governed marketplace platform for ethical sourcing of multimodal, real-world AI training data with compliant data exchange capabilities.

Pulse
API-first Document AI that converts PDFs, images, slides, and spreadsheets into structured JSON for RAG, analytics, and automation.

Pytho AI
Provides a unified interface to design AI workflows by connecting data, models, and automations.

Reducto
Reducto provides a high-quality AI document ingestion and parsing API for large language models. The Y Combinator-backed company processes nearly a billion pages monthly for leading AI teams like Harvey and Scale AI.

Relace
Relace is a provider of auxiliary coding models for faster, more reliable AI code generation that makes it easy to deploy production-ready coding agents with models co-optimized with infrastructure to achieve state-of-the-art performance across million-line repositories.

Rune
Developer of the world's first DC data centers built exclusively for solar and wind power. Using proprietary chip design and smart controllers, Rune converts stranded and curtailed renewable energy into compute power at generation sites.

San Francisco Compute
SF Compute provides rentable, large, low-cost GPU clusters for AI pre-training workloads. The platform operates as a marketplace connecting AI teams with on-demand high-performance computing capacity, offering flexible access to supercomputing-scale infrastructure with InfiniBand interconnects.

Sapien
Sapien builds AI-native analysts for finance and operations teams, connecting ERP, warehouse, spreadsheet, and operational data. Its agents help CFO and analytics teams find profit drivers, explain variance, and act on messy transaction-level data faster.

Shovels
Shovels builds construction intelligence software that turns fragmented building permit data into actionable market and go-to-market signals through APIs and analytics tools.

Spiral
Spiral is a data infrastructure company that provides a multimodal data platform for AI, unifying governance and exposing a single API for every data modality including video, audio, geospatial, and text, engineered for machine-scale throughput to keep GPUs fully saturated.

Structify
AI-powered data platform that transforms unstructured web data and documents (websites, PDFs, pitch decks, reports) into structured, enterprise-ready datasets using their proprietary DoRa model that navigates and extracts data like a human, enabling real-time web extraction for business intelligence and data workflows.

Supper
AI-native agentic data platform that integrates with SaaS tools and data warehouses, cleanses and normalizes data, and enables self-serve insights through natural language.

Syenta
Syenta develops Localized Electrochemical Manufacturing (LEM) technology for advanced semiconductor chip packaging, enabling scalable, high-density interconnects without traditional lithography. Spun out from the Australian National University, their approach addresses memory bandwidth bottlenecks in AI computing.

Tensormesh
Semantic KV caching layer built for LLM inference, enabling AI applications to reduce inference costs and latency by reusing cached computation across similar prompts.

Tinybird
Tinybird is a real-time data platform that enables data and engineering teams to build real-time data products and APIs at scale. The platform ingests, transforms, and serves large volumes of data with sub-second latency for analytics and operational intelligence.

TinyFish
TinyFish provides enterprise web agents that automate complex web-based workflows and extract structured data from websites at scale. The platform enables Fortune 500 companies like Google and DoorDash to automate web interactions, streamline data collection, and integrate web automation into their business processes.

Tonic AI
Generates realistic synthetic data to power software testing and analytics without exposing sensitive production data.

Tracer
Tracer is the first pipeline monitoring system purpose-built for high-performance computing in life sciences, providing real-time performance metrics, cost breakdowns, and optimization insights for complex computational pipelines.

Transcend
Transcend is an enterprise-grade data privacy infrastructure platform that serves as the compliance layer for customer data. It enables organizations to automate data subject requests, map data across systems, manage consent, and activate data for AI responsibly at scale.

Unlimited Industries
Unlimited Industries is an AI-native construction company that vertically integrates design and build for large-scale infrastructure projects including data centers, energy facilities, and advanced manufacturing. The company's proprietary AI platform can explore tens of thousands of design configurations to optimize costs and timelines, reducing pre-construction engineering from months to weeks. Founded by serial entrepreneurs and backed by Andreessen Horowitz, Unlimited is rethinking how America's critical infrastructure gets built.

Unstructured
Open-source data preprocessing platform that extracts, cleans, and transforms unstructured documents (PDFs, images, HTML, emails) into structured formats optimized for AI and LLM pipelines.

Weka
WEKA builds a cloud and AI data platform that accelerates model training and inference workloads with high-performance, software-defined storage.

ZeroEntropy
ZeroEntropy provides a high-accuracy search API over unstructured data for AI agents and RAG applications. The YC-backed company builds smarter retrieval models enabling AI agents across healthcare, law, and sales.
FAQ
What is the Data Engineering tag page on Fast AI Startup Jobs?
It is a curated landing page that groups AI startup companies tagged with Data Engineering, plus links to their company profiles and available jobs.
How many Data Engineering companies are included?
This page currently lists 72 companies tagged with Data Engineering.
How many jobs are associated with Data Engineering companies?
The companies on this page currently account for 462 listed jobs in our public dataset (subject to regular updates).
What roles are most common at Data Engineering companies?
Based on currently listed jobs for Data Engineering companies, the most common role groups are Engineering (1777), Sales (494), Other (408).
What funding stages are most common among Data Engineering companies?
Common funding stages on this Data Engineering page include Seed (19), Series A (18), Series B (15), Series C (5).
Where do the job links go?
Job links point to official company career pages or public job listings, not re-hosted application forms.
How often is this tag page refreshed?
Data is refreshed on a near-daily cadence as public company and job listings change.