Are we ready for Data Mesh ?
Over the last decade Organizations have realized the immense value of data and have embarked upon transformation journeys of their Enterprise Architectures to realize the potential of data to become a data driven organization.
Major realization towards in the Journey came in form of fragmented data silos making it next to impossible to leverage data for advanced level of decision making. Hence, organizations start moving towards a new architecture, leveraging cloud, with centralized Data Lake and Data Warehouse using ETL pipelines to transform the data.
The current architectures (in its simplified form) can be represented below -
This architecture does help Organizations in their quest to be driven by Data, however, there are below main issues with this architecture –
1. Owner of data engineering [ETL] pipelines do not have any control of Data Sources or Domain knowledge of Data. There is lot of coordination required for data engineering pipelines and data governance.
2. There is no end-to-end ownership of data. This leads to lack of trust of data by consumers of Data. Often the data discovery involves manual intervention and is not a seamless automated process.
3. Creation of Data swamps under the guise of data lakes. Since the data source owners, data pipelines and storage owners are different; over a period, track of available data in data lake is completely lost and it end up having a lot of undiscoverable data.
4. Separation of Operational and Analytical data plane. Dual investment of operational and analytical stack.
5. Monolithic nature of Data Lake / Data Warehouse becomes a bottleneck when it comes to addition of new feature to the product, introducing a new service or optimization of a workflow.
Data Mesh offers to fix the above issues by creating a Distributed Domain Oriented data stores using the following 4 principles –
1. Domain-oriented ownership — Data Mesh proposes to align the data architecture according to the business domains and have the domain team takes complete end to end lifecycle ownership of the data and offer DATA AS A PRODUCT to the consumers/other domains. If we think about it, it is not a new concept. Comway’s Law states that ‘Organizations design systems that mirror their own communication structure.’ This means that Enterprise Architecture often have team aligned to different domains who takes complete owner ship of the technical stack for that domain. For e.g. if you see the IT stack associated with Telecom industry; it is by and large aligned to different domains like CRM, Billing, Order Fulfilment, Provisioning, Networking etc. The respective domain teams take ownership of their respective stack. Same parallelism can be drawn with other industries arguably (though in Telecom industry the domain based organizational structure is more prominent due to standard e-TOM model). However, what separates this principle from otherwise domain-based alignment is that there should be no separation between operational and analytics plane and the domain teams have complete end to end ownership of ‘Data Product’ from the source of the data till its consumption. This also includes any Data Engineering pipelines that are required to make data business consumable and governance of the data [quality management, master data management, metadata management, lineage, business glossary etc]. This principle aims to reduce coordination of data pipelines and data governance and instead enable autonomous domain teams to manage their data.
Given that large organizations today are adopting a hybrid multi-cloud environment for their transformation journeys, it is obvious that these domain data products will not be located on a single cloud but will be split across multi-cloud [and some might be in transition at on-prem as well].
To comply with the above principle, it is imperative that we have a database that can ingest the data at petabyte scale, can support both operational data [need row based read/write] and analytical data [columnar, high throughput, write once — read many], support multiple models that are needed to service the domain. Currently there are multi-model databases but those haven’t proven themselves at petabyte scale or we have likes of Snowflake / Redshift / Synapse which can work with petabyte scale but are not multi-model.
2. Data as a Product — This principle introduces a new unit of logical architecture called DATA PRODUCT. Data Product encapsulates all the structural components — data, code, policy and infrastructure dependencies and share Data as a Product autonomously. This proposes a peer-to-peer approach in data collaboration when serving and consuming data. This architecture principle enables consumers to directly discover and use the data right from the source. Data is treated as product and its users as customers. This means that the Data is made discoverable, trustworthy, useable, valuable, interoperable, and secure. The Data Product is managed by Service Level Objectives with metrices monitored and adhered to. To provide Data as a product there are new roles of Data Product Owner and Developers defined. These are cross functional roles who understand the domain as well as are responsible for developing, serving, and maintaining the domain’s data product to service satisfied data consumers. In the case of Microservices architecture, data serves the code; it maintains state so that code can complete its job, serving business capabilities. In the case of a data product and Data Mesh this relationship is inverse, code serves data; the transformation logic is there to create the data and ultimately serve it. Data products share data with each other and are interconnected in a mesh.
To meet the principle of Data as a product, we need interoperable API and protocols to allow faster transfer of data across a hybrid multi-cloud scenario.
If we now look back at the issues currently plaguing the current centralised architecture, having domain teams own end to end Data Product would eliminate the issues created due to fragmented ownership. Also having single data stack for operational and analytics plane would rationalise investments and data pipelines.
While the first 2 principles are around organising and servicing the data to eliminate issues with current architecture, the remaining 2 principles talk about enabling the first 2 principles.
3. Self-Serve Data Infrastructure Platform — This principle proposes that underlying infra should be provided as self-serve platform for the autonomous domain teams to create data platform. The underlying self-serve platform provides common capabilities which domain teams can leverage to reduce duplication of efforts, contain cost of operation, and most importantly avoid large scale inconsistencies and incompatibilities across domains. The platform allows domain teams to be made of technology generalists by hiding redundant complexity. The Self-serve platform enables autonomous domain teams (without depending on a centralised platform team) to focus on creating business domain data products end to end, autonomously. It should promote declarative modelling [like Kubernetes and terraform] while provisioning the underlying infrastructure. The platform should enable seamless interoperability between the domain Data Products and generate logs, metrices to truly enable principle of Data as a Product. It is the self-serve platform that enables the implementation of the policies laid out in Federated Data Governance.
To meet the principle of Self-serve platform, we need a killer product that can provision across hybrid multi-cloud scenario but offer simple declarative screens hiding all the underlying complexity to enable domain owners/developers to easily provision the underlying domain data product platforms.
4. Federated Computational Governance — While the first three principles ensure that data is distributed in domains and serviced as a Product, we need to ensure uniformity and interoperability across domains. This is where the fourth principle comes into play. It proposes to have federated governance team to provide global standards and protocols applied to each data product to create a happy ecosystem of interoperable data products. The principle proposes to create an incentive and accountability structure that balances the autonomy and agility of domains, while respecting the global conformance, interoperability, and security of the mesh. The governance model delegates responsibility of modelling and quality of the data to individual domains, and automates the computational instructions that assure data is secure, compliant, of quality and usable. The federate governance team decides on the metrices for Data Product SLOs like user’s satisfaction rating, quality metrics, usages timeliness, completeness, features, discoverable rank, interoperability, consistent set of APIs etc. This enables to provide a feedback loop for e.g. if the Domain Data Product is not used frequently or has very few users it will eventually rank low in the discovery operation and over a period of time get highlighted to the domain team to be discontinued.
For enabling the fourth principle as well, we would need a governance product that can work in a hybrid multi-cloud environment to enforce common rules on the domain data platforms and collect the metrices.
So in summary, a Data Mesh can be explained with below figure –
There surely is some technical catch up to be done before Organizations can embrace Data Mesh in its true spirit; however, that should not deter the organizations in their journey towards implementing Data Mesh, they can start with utilizing combination of data stores and a wrapper around those to offer Data as a Product and once the right technology is available in the market, the data stores can be replaced.