How to get started with data mesh (v0.1)
Scott's opinions only
Feedback welcome (comments, in the Slack, via email, etc.) Help me make this a really useful resource for the community!
This is an opinion piece with some practical advice
Not all domains need to be mapped upfront
Start with 1-3 domains that are bought in
MVP of the data platform = easily produce versioned initial data product(s)
Start to place data engineers into domains but don’t disperse the entire team at the start
If buy-in isn’t 100% -> show the benefit of data mesh at a small scale
Data mesh -> additional resources (data engineers at least) for domains
If you like your tools, use them, just in a decentralized setup
Flow is “things some think re getting started that are wrong” -> “your start plan” -> “recommended content”
This post is designed to help companies get started but it is only Scott’s opinion, not that of the entire data mesh community or Zhamak.
This post is based on information gathered from discussions and content as of early July 2021. Definitely subject to change. Version 0.1 for sure.
If someone is interested in rewriting this for better flow, contact Scott.
Recommended content re some of the points at the end.
Getting Started Misconceptions
I have seen or intuited the following common misconceptions, often from chats in the Slack and a number of private discussions with people/companies struggling to get going with their data mesh journeys. I have tried to give context around why they aren’t true.
You Need to Map/Define all your Domains Upfront
You absolutely do not need to decompose your entire data space into domains to start.
Zhamak has mentioned a few places, including notably her podcast episode with Barry O’Reilly (start ~ min 30), that you want to start with one data product - “Think big, start small, move fast”. In other places, she has mentioned getting going with one to three domains that are bought in on data mesh.
Your domains WILL change/evolve as you implement data mesh. Just as they do on the operational side when moving towards microservices and with the natural course of business. As you learn more, you will create new domains, split existing domains that get too large into multiple domains (or subdomains), retire domains, combine elements of domains into new domains, etc.
To sum up: there is no reason to attempt breaking out every domain you have/will have to start. Find the domains that matter with a good first use case.
You Need All Domains/Entire Company Bought in to Data Mesh to Start
Data mesh isn’t something to commit an entire company to lightly. Find domains that are bought in to the concept and have a good initial use case and build out the ROI story as you get more domains interested.
Zhamak has mentioned in a few places starting your data mesh implementation with one to three domains that are bought in. Data mesh may not be perfect for every company/org and figuring out the nuances of implementation at your company will take time and be very different from other companies. This is especially true for domains. E.g. Adevinta has a core data engineering team that manages core data sets - and resulting data products - while domains manage their own data products. See their write-up here.
Delivery Hero talked about they have BUs (equivalent to domains) moving slowly into their data mesh set up but it is taking time. DPG Media has mentioned in the Slack that not every domain/org - e.g. Finance - is participating in their data mesh deployment yet or plans to.
Trying to build to 100% consensus in favor of data mesh prior to launch, especially if things could change considerably for some domains, is not feasible/workable. Find the domains that are feeling the pain (often agility/scalability) and/or find a use case for a first data product - e.g. Intuit needed a single place for customers to see and manage all their subscriptions so they would stop calling customer service to ask. Then, work with those domains to build your first a data product(s).
To sum up: you need to find people where data mesh is the carrot, it will be much more difficult to start moving trying to force a domain to comply. Find domains that are bought in and have a good use case to show off to the rest of the company. You can get more domains bought in later.
Your Data Platform Must Support all Use Cases to Start
A self-serve data platform at the start of your implementation needs to essentially be CI/CD for data products. Build for re-use but only build for the initial data product(s) and add functionality later.
The self-serve data platform pillar of data mesh ends up confusing some people. Self-serve for whom? The end answer is “yes, for both the data product producer and consumer”. But to start?
The most important early piece of the data platform is making producing a reliable data product as easy as possible - within reason - for the data product producers. You want a repeatable process that includes observability, monitoring, CI/CD, etc., just as you would for a microservice.
There is a misconception (following this one) that producing data products is a considerable amount of work. If you do not have a platform that eases production, it probably is. Again, just like in microservices. Creating a data product may be a lot of work but actually producing (equivalent of deploying in microservices) shouldn’t be.
And, there is always a desire to support everything that is coming from future data products. If you don’t have a data product serving events, you don’t need to build the tooling out to support serving events yet. MVP for your data platform should support initial use case(s) and be built for re-use in future data products. But nothing more.
To sum up: build out your data platform MVP to make production of data products - as in the versioned product including the data - as easy as possible for the data product producers. Data consumers are important but start by removing headaches for the producers so people can actually trust the data products and domains are/remain bought in that data mesh is worth it.
Data Mesh Just Means Extra Work - Domains Have no Reason to be Bought In
This one is VERY common. It seems to be the “gotcha” that some people assume destroys the entire concept.
In most orgs, moving to data mesh would mean additional resources for a domain, not just the additional responsibilities of serving data as a product
The most common additional resource is data engineers - newly hired and placed inside the domain and/or moved from the central data engineering team into the domains
Some domains will likely want to participate even if it means more responsibility
Potentially look for domains where ad hoc data requests eat up a large portion of their “interrupt buffer/budget”
Domains feel the pain of the current data setup too with fragile pipelines and lack of understanding re data consumer wants/needs
Play to the desire: for many domains, they may not be able to do everything they want because they can’t get the data they need/want; data mesh could change that - but be realistic about what it can offer
CMC Markets (presenting at our July 29 meetup) talked about this in their QCon Plus conference talk (behind a paywall unfortunately)
If a domain’s data is easy to query, likely that will mean more quality analysis of the data, which could mean better insights -> a better product, which helps all members of that team
So will data mesh mean more “work” for a domain? Yes, probably. The offset between creating data products and reduction in ad hoc data requests is likely not 1:1. But, there are additional benefits to the domain AND, in most companies, there should also be additional resources for the domain - data engineers being a big one.
To sum up: data mesh probably means more work for the domain but many domains will likely be bought in to data mesh in spite of more work. Also, data engineers should be deployed into the domains to better help them tackle creating and maintaining/evolving their data products.
You Need to Reorg the Entire Company Upfront for Data Mesh
Your company is not either a 1 or a 0 relative to data mesh. You can have some domains moving towards data mesh while others are not.
An initial data mesh implementation should not impact most of your teams. Potentially, you may begin to split your data engineering team into platform and not platform but even that feels premature.
You should probably deploy a data engineer or two into the initial domains doing data mesh but it’s not going to be your entire data engineering team moving into domains at the start.
As you get going, there will likely be significant changes in needs, e.g. how are you doing data literacy, and org structure, but those don’t need to be solved upfront. Your big goal here is to enable your domain teams to produce valuable/reliable data products.
To sum up: your initial data mesh deployment doesn’t have to disrupt your current org set up for 95%+ of your people (depending on company size). A data engineer or two probably moves into the domain and that’s probably it to start.
Governance Goes Out the Window or Is All-Consuming
For some reason, many think if you implement data mesh, you suddenly have zero control or governance. And that all definitions no longer mean anything.
The definitions part is probably from the data silo days of data marts. But there is a requirement in data mesh for teams to define their data in documentation (and even make it discoverable, so that goes beyond a simple definition).
When you are starting out, agreeing to some global terms is highly advisable. New terms will bubble up that will need global definitions but trying to define everything upfront is a massive waste of brain power. Some terms, especially “customer” should be defined globally but others should be on an as-needed basis. Don’t make the mistake of trying to control for all situations upfront, things will evolve that you didn’t foresee and that is fine.
Access control decisions should be, in general, part of the data product creation process. Credera put out a great piece that initially brought this concept to my attention.
If you have a event-based architecture, you probably want to create a (or better yet, choose an existing) specification early on for harmonization across domains. There should be some thought to how you harmonize data in general too, not just events. But, if your team is prepared for breaking changes, you don’t even have to do that.
To sum up: Governance is still important in data mesh, domains don’t get to do whatever they want. But there is a balance between what you have to do upfront, e.g. automation and global definitions, so you can move quickly but still retain required control. Consider how you will harmonize data across domains and potentially use a specification.
Throw Out All Your Tooling - All Your Previous $ Spent is Wasted
This one is quick: data mesh calls for tools to be used. If anyone says you have to build everything yourself in data mesh, they are oh so very wrong 😅 If you like what you have, use it in a decentralized way 😎 Data lakes are great. One centralized data lake isn’t*.
Your Getting Started Plan
I started this post with a lot of misconceptions. And the pushback to those misconceptions builds up to a plan, but here it is laid out plainly.
If there is one piece of advice: do the needful now, save the hopeful for later.
It can be tough to figure out what NEEDS to be done upfront but many of the “big scaries” don’t need to be solved for perfectly before getting going. E.g. how to do access control in a completely automated way is not required when you have a single data product.
Start with one to three domains that are bought in to the concept of data mesh.
Their reasoning for being bought in isn’t really relevant.
Good first use case(s) is important to show ROI.
Build out an MVP of a self-serve data platform to make producing the initial data products easy on the domains.
Don’t overbuild for use cases you don’t have yet.
Your data product production should essentially work like microservices deployments (so DataOps+) and be lightweight on the domain teams.
Deploy data engineers - if available - only into your chosen domains.
No need to push all data engineers into either platform team or into domains at the start.
Each data product should have access governance from the data product owner at product creation.
If you need data harmonization at the start, choose or create a specification.
If not, be prepared to (breaking) change your initial data products but there isn’t a requirement to create specifications.
Potentially look to already created specifications.
Look to reuse tools you already know (and love?).
There is no reason to get rid of tools that you like. Love your data lake tooling? Great, move to smaller lakes managed by the domains.
Create initial global definitions but be prepared for those to evolve.
A potential (but also dangerous) easy path is domain_term for some definitions so you do not have to say, globally define every potentially reusable term upfront. Just be prepared again for potentially breaking changes.
Other useful bits:
Find some people to bounce your ideas off of to figure out what is necessary for getting started (I obviously recommend the community Slack for finding them…). Or you can sign up for free data mesh reviews with Scott which usually end up as 60% review, 40% counseling session… 🤣
For every part of your data mesh implementation, ask “is this a one-off, a few-off, or something that needs to be done a lot?” If it is something that will happen a lot, try to automate it as soon as possible. The more self-serve your implementation becomes, the happier everyone will be 😄
If your application (microservices) teams are not currently organized in domains (via Domain Driven Design or similar), your going may be tough.
Internal buy-in is on you. DML will be launching something about great data mesh buy-in content soon but you have to build up enough support to get to a trial run.
DPG Media, both their presentation at our meetup and their recent write-up. Both cover evolution of their team setup, especially the deployment model of data engineers into domains including a shared data engineer pool for smaller domains.
Adevinta post (on Medium) about their evolution from data products as produced by their core analytics team to domains producing data products. There is a lot of good content around how to make data products discoverable and some core tenets for aligning domain teams with data mesh.
Delivery Hero presentation covering how only some of their BUs (domain equivalent) have moved towards data mesh.
Zhamak on Barry O’Reilly podcast: great commentary starting around min 30 re what you need to get started.
Kolibri Games’ presentation at our meetup re their evolution towards data mesh including their team evolution.
Y’all, just join our meetup group…
*Data Lakehouse gets squishy here because essentially, you pump the raw data in and then create the data product in the data lake. Why not just create the data product?