The Data Menu vs Data Creation Question
Do we only create data products from what we have or do we create data to add to our data products?
Noticing an interesting trend re people talking about data products, especially related to data mesh. Almost every single person is talking about creating data products out of the data already available from operational systems. They ask: to create my data product, what is available to package as is or transform for consumption?
But, when do we start to think about creating incremental data, data that isn’t part of our day-to-day operations? And who the heck owns that responsibility?
The Question at Hand
So, the real crux of the question is: would it be more valuable, have a good return-on-investment, to create incremental data on the operational plane specifically to be used for analytics?
The over-arching answer definitely feels like “it depends”. If you create incremental data about what eye color your users have, unless you are a makeup company or something, that data is probably worthless and costs money to create and store.
We will also leave the question of ethics to the end. Because if you are collecting incremental data on your users/prospects/customers/employees/etc., depending on what it is, that could be pretty bad. Google stopped using the motto “don’t be evil” for a reason…
The Current State
It seems pretty evident reading between the lines that the operational plane is the initial reason data is generated. Not just place but the reason the data exists is to serve the operational plane. Companies may purchase outside datasets but mostly, to create your data products, companies are looking at the menu of what already exists internally data wise.
This makes sense. Data mesh requires a pretty big shift and realignment for most companies so asking domains to not only take on creating data products but also asking them to generate incremental data that might be interesting seems like a big ask.
The industry is just trying to get its head around data mesh in general - it feels too early to start talking about generating information purely for analytical purposes that the domain isn’t already generating. Small tweaks to what they are doing to improve analytic “yield” - yes. Major increase - no.
But…Here be (Data) Swamp Dragons
A pretty likely end-state of trying to generate a lot of incremental data that might be useful in data products but isn’t necessary to outward facing products is… that’s kinda what was causing the data swamp issues.
Yes, we might be able to prevent it but…we are trying to get away from having crap data that we only keep because it is harder to figure out if we should delete said crap data than just paying to store it.
So where do we draw the line? Is it more of a pull - data product consumers asking for incremental data as part of a data product? How would a domain even approach the question of what data would be valuable to create? Zhamak’s overlapping figure of usable, feasible, and valuable re a product also comes to mind…especially feasible.
1…2…3… NOT IT - Who Drives Incremental Data Creation? What About Retirement?
There is the actual underlying incremental data to be created. If that is from customer interactions, the operational plane should own the actual creation of the click stream. But who should own driving that incremental data to be created in the first place? Should the domains have to come up with lots of interesting ideas that people might want to consume?
Doesn’t seem like there is a great answer here yet. Harkening back to “The Intuit Way” as we call it (see their blog post): domain teams should be doing lots of interviews with potential data product consumers to figure out what those data product consumers want.
It seems silly to ask domains to generate every piece of incremental data that data product consumers could want. But, the domains know the data best so they should also be on the lookout for ways to add additional value to their data products, both of their own accord and from requests from consumers. Push and pull with a balance maybe?
There also needs to be mechanisms to track this incremental data consumption in data products. If it turns out the incremental data wasn’t useful, there needs to be a way to easily retire the operational plane collecting it. Luckily, the domains should be the ones best equipped to do that. If it’s just a new microservice, easy-peasy. If not, feature flags might be the best way to turn off? This whitepaper (pdf; not gated) by Accenture has some thoughts about having a data generation lifecycle (see slide 11) which is an interesting concept.
Incremental Data - Mass Data Collection Ethics
The thing people hopefully keep in mind is the ethics of what data is collected and why. There are tons of companies doing some pretty unethical things with data that, at best, is bad. But collecting data that isn’t about product interactions could also cause all sorts of brand issues re PII and why are you collecting that. So we hope domains also keep ethics in mind - some incremental data may have value but your values are more valuable (see what we did there).
There aren’t hard and fast rules but ethics in data collection and analysis should always be a concern for every company. Especially when thinking about what data you might be able to collect instead of what could ethically drive a good ROI by collecting.
Note:
DML is not the data mesh community; this an opinion piece