This post was originally published on Medium and is a collaboration between Pete Soderling (founder, Data Council & the Data Community Fund), Sarah Catanzaro (partner, Amplify Partners) and Abe Gong (co-founder, Superconductive).
There are dozens of new tools in the fast-growing data ecosystem today. Together, they are reshaping data work in exciting, productive and often surprising ways. The seeds of the data landscape for the next decade have been planted, and they’re growing wildly.
Turns out, cultivating a new ecosystem is messy.
One symptom of messiness is that many of these tools are perceived as competitive, even when they’re not.
The perception of false competition is not surprising, since:
- There’s a lot of overlapping functionality among tools. Partly this is because tools and products need to “stub out” basic functionality — sometimes in areas that aren’t their primary focus;
- Growth trajectories and category boundaries are still uncertain;
- No single tool is yet ubiquitous, even within a given niche;
- Entrepreneurs are incentivized to tell a “Big Story” both when pitching VCs and selling their product.
So the confusion is natural.
It’s also bad. Bad for tool builders trying to focus, bad for investors trying to assess markets, and especially bad for data scientists and engineers trying to build productive data stacks utilizing the best new tooling options.
Time for Clarity
This blog post is a collaboration between Sarah Catanzaro (partner, Amplify Partners), Abe Gong (co-founder, Superconductive/Great Expectations), and myself (founder, Data Council & Data Community Fund) to start to clear the confusion and tangle. It grew out of conversations in the startup community among people who were tired of “Wait, aren’t you competitors?”-style questions.
We’ve encountered these questions enough to recognize the pattern. We’ve seen the unfortunate drag that they impose on adoption and collaboration. Now it’s time to do something about it.
What does your tool NOT do?
To start to attack this problem, we reached out to dozens of entrepreneurs and open source maintainers in the data ecosystem, and asked two questions:
1. What is your tool uniquely good at?
2. What does your tool NOT do?
We limited our search to open-source projects and pre-Series B companies. Responses were kept short and sweet, and edited for clarity.
All of the participating tools are listed below, listed in the order that they responded.
Confusion in the ecosystem won’t clear up overnight, but our hope is that this post is a good start.
Tools and Answers
- Great Expectations/Superconductive: Great Expectations is uniquely good at testing data systems and creating documentation from those tests. Great Expectations also does data profiling. Great Expectations is highly pluggable and extensible, and is entirely open source. It is NOT a pipeline execution framework or a data versioning tool.
- Databand: Databand is a dataops solution that’s uniquely good at monitoring production pipelines and detecting issues on a workflow code, data, or system level, and helping engineers do root cause analysis of the problem. Databand is NOT a point solution for pipeline orchestration, data quality testing, or data versioning.
- Dolt/Liquidata: Dolt is a SQL database with Git versioning. You can commit to, diff, clone, pull, branch, and merge a SQL database just like you would in Git. Dolt is a uniquely good format for sharing data. Dolt is NOT designed for a specific data use case like feature storage or data transformation. Dolt is a general purpose database that may be applied in those use cases.
- Bayes: Bayes is a visual, exploratory data analysis tool. It guides you through recommended visualizations, and enables easy, interpretable insight sharing with interactive narrative-based reports. Bayes is NOT a business intelligence dashboard, nor is it a code-based notebook for programmers.
- Hex: Hex is a computational notebook platform that is uniquely good at sharing. Users can connect to data, develop their analyses, and then easily build a fully-interactive, beautiful app that anyone in their organization can use. Hex is NOT a ML engineering platform or a charting tool.
- Sisu Data: Sisu is a proactive analytics platform uniquely suited to rapidly exploring complex enterprise data and helping analysts explain why key business metrics are changing. Sisu can test hundreds of millions of hypotheses in seconds and guide users to the highest impact drivers of change. Sisu is NOT a predictive or model-building tool, nor a descriptive dashboard.
- Ascend: Ascend is uniquely good at building, running, and optimizing cloud-only data pipelines with significantly less code. Ascend links data to the code that produces it, enabling declarative data pipelines with automated maintenance, data profiling, lineage tracking, cost optimizing, and easy integration to databases, warehouses, notebooks, and BI tools. Ascend is NOT a general-purpose Spark solution, but rather, the data engineering platform on top.
- Dataform: Dataform is uniquely good at helping you manage hundreds of datasets within your data warehouse. Dataform helps teams turn raw data into a suite of well-defined, tested, and documented datasets for analytics. Dataform is NOT an extraction tool.
- DataKitchen: The DataKitchen Platform coordinates the people, processes, tools, and environments in data-analytics organizations — orchestrating everything from development sandboxes to workflows, testing, deployment, data operations, monitoring and maintenance. The DataKitchen Platform does NOT replace tools that perform data integration/preparation, ETL, visualization, artifact storage, distributed computing, test data management, performance management, model creation, AI augmentation, security, governance, risk management and compliance.
- Snorkel: Snorkel is a platform for programmatically building training datasets. In Snorkel, rather than hand-labeling data, users write labeling functions which Snorkel combines using theoretically-grounded modeling techniques. Snorkel is NOT about unsupervised learning; it is a human-in-the-loop platform focused on radically accelerating how users inject their domain knowledge into ML models.
- Transform (stealth): Transform is building a metrics repository which ensures that businesses can capture metric definitions in a standardized, well-formatted, and organized way to streamline analysis and enable decision-making with confidence and speed. Transform is NOT a data pipelining framework or a business intelligence tool.
- Materialize: Materialize is uniquely good at executing and then maintaining PostgreSQL queries (including joins) on top of streaming data, keeping those queries up to date with millisecond latencies under high throughput. Materialize is NOT a time-series database or another streaming microservice platform.
- DataHub/LinkedIn: DataHub is a search and discovery application powered by metadata, and built to boost AI and data science productivity. It features a unique stream-first distributed metadata architecture that has made it successful at LinkedIn’s people and big data scale. It’s NOT a data integration or processing tool, or an orchestrator for running data quality checks.
- Prefect: Prefect is a workflow orchestration tool that allows you to define flows of tasks using a pure Python API, and deploy them easily using modern, scalable infrastructure. Prefect gives you the semantics you need to make robust pipelines, such as retries, logging, caching, state transition callbacks, failure notifications, and more, without getting in the way of your code. Prefect is NOT a no-code tool or infrastructure provider.
- Mara/Project A: Mara is uniquely good at composing SQL, Bash & Python scripts into pipelines. Run pipelines from command line or through web UI. Local execution, no queues, no workers, no magic. Mara is NOT a scheduling, data movement, or dependency detection tool.
- dbt/Fishtown Analytics: dbt excels at creating, maintaining, and documenting DAGs of SQL-based business logic in data warehouses. dbt is NOT a general-purpose job orchestrator.
- Watchful: Watchful is uniquely good at creating high-quality, probabilistically-labeled training data quickly and at scale without an army of human labelers. Watchful lets you build, test, and prototype models fast through feedback. It is NOT a managed labeling service or an analytics tool.
- Preset: Preset is a SaaS-based data exploration and visualization platform by the makers of Apache Superset. Preset is about visualization and data consumption, it is NOT a computation or orchestration platform.
- Kedro: Kedro shines at software engineering best practices for data and ML pipelines. Kedro enables a seamless transition from experimentation to production using a reproducible analytics workflow, I/O abstraction, and pipeline modeling. Kedro is NOT a workflow orchestrator or experiment tracking framework.
- Toro Data: Toro is uniquely good at helping teams deploy monitoring on their data, suggesting what to monitor for and making it easy to do without writing and deploying code. Toro does NOT clean or transform data, or natively control the pipelines/workflows.
- Tecton: Tecton is great at curating and serving features. Tecton is NOT a data processing engine (e.g. Spark) or a model management tool. Instead, it leverages existing data processing engines to process raw batch/streaming/real-time data, turn it into features, and deploy the features for training and serving.
- Dagster/Elementl: Dagster is a data orchestrator that’s uniquely good at structuring data applications for local development, testing, deployment, and ops. Dagster pipeline components are authored in any language or framework and combine to form a unified data application through common metadata and tools. Dagster is NOT a processing engine, or a data warehouse/object store.
- Select Star: Select Star is a data catalog and management tool that solves data discovery problems. It’s uniquely good at helping you understand your data — what data you have, where it lives, how it’s structured, and how it’s being used. Select Star does NOT provide a SQL client or ETL processing.
- Monte Carlo (stealth): Monte Carlo is a data reliability platform that includes data monitoring, troubleshooting, and incident management. Monte Carlo is NOT a testing framework, pipeline, or versioning tool.
- Flyte/Lyft: Flyte is uniquely good at iteratively developing scalable, container-native and repeatable pipelines that connect disparate distributed systems while putting data flow front and center. Flyte is not a machine learning platform but can be a core component of one.
Starting a Dialogue
Our desire in compiling this list is two-fold. On one hand, we want to give credit to these amazing tools, founders and OSS leaders for pushing the evolution of our data tooling ecosystem forward. On the other hand, we wish to start a helpful dialogue in the community surrounding a clear understanding of the intentional limitations of these tools. No one can do everything better than everyone; even innovators!
We hope that this article has shed some helpful light on the wild garden that is our modern data-tooling ecosystem. May it continue to thrive as we cultivate it with intention.