Data architecture

from the 1970s until today

Get a comprehensive overview of the evolution of data architectures from hierarchical network databases to data mesh and learn why the data warehouse is still the most widely used architectural model and why there is no universal architecture for all organisations.

Data Warehouse
Data Lake
Data Lakehouse
Data Fabric
Data Mesh

The evolution of data architecture has been driven by the growing importance of data in organisations. From traditional data warehouses to modern data fabric and data mesh approaches, these architectural approaches have overcome specific challenges and opened up new opportunities.

Data Architecture Comparison
Data Architecture Overview

The 70s: Hierarchical and network databases
In the 1970s, computer systems were mainly dominated by centrally managed mainframes. Data was organised using hierarchical or network database models. These models offered the possibility of organising data in a database in different ways - whether in a hierarchical structure that represented a relationship from one element to another, or in a network that linked many elements together.
The 80s: The client-server model
In the late 1980s and early 1990s, a new paradigm of data architecture emerged with the advent of the client-server model. This model meant a move away from centralised mainframe systems towards a distributed system in which responsibilities were divided between servers (providers of resources or services) and clients (users of these services). In the area of databases, this meant that the database software (DBMS) could be installed on a server, while users or applications could access the data from client computers. This approach revolutionised scalability and accessibility and simplified the management of growing volumes of data and an increasing number of users.

The 90s: Traditional Data Warehousing
In the late 1990s, the concept of data warehousing fundamentally changed how companies approached the storage and analysis of data. At its core, a data warehouse is a large, centralized repository for data from various sources.
The architecture uses a three-tier structure: the data source layer, the data warehouse layer, and the front-end client layer. ETL processes (Extract, Transform, Load) were used to pull data from different operational databases, convert it into a consistent format, and then load it into the data warehouse. The data was typically stored in a relational database and organized based on an OLAP cube model (Online Analytical Processing), which allowed for complex analytical and ad-hoc queries.
The 2000s: Big Data and Hadoop
In the 2000s, the proliferation of the internet, social media, and IoT devices led to a drastic increase in data volume, variety, and velocity—giving rise to what is known as "Big Data." Traditional data warehouses could no longer effectively handle these heterogeneous, large volumes of data generated at high speeds.
The open-source framework Hadoop revolutionized data architecture starting in 2005. It was specifically designed for processing massive amounts of data in computer clusters. The framework introduced the concept of distributed storage and processing, meaning that data was no longer confined to a single storage location but could be stored and processed across multiple nodes.

The 2010s: Cloud and Data Lake Architectures
In the 2010s, the concept of cloud computing emerged as a new paradigm, providing scalable resources as a service over the internet. This development had significant impacts on data architecture and led to the creation of data lakes. Unlike traditional data warehouses, which use the ETL process (Extract, Transform, Load) to ingest data, data lakes employ an Extract-Load-Transform (ELT) process. Data extracted from various sources is first loaded into cost-effective BLOB storage, then transformed, and finally transferred to a data warehouse using expensive block storage.
The need to process large volumes of data in real time gave rise to the Lambda and Kappa architecture models. The Lambda architecture employs a hybrid approach, utilizing both batch and stream processing to gain accurate and up-to-date insights. All incoming data is captured and stored as an append-only log, creating an immutable historical record. This architecture is divided into three layers: the Batch Layer, the Speed Layer, and the Serving Layer. In the Kappa architecture, all data is ingested and processed as an unbounded stream of events. This architecture consists of three main components: stream ingestion, stream processing, and long-term storage.

The 2020s: Data Lakehouse
Data Lakehouses represent a new generation of data platforms: a Data Lakehouse combines the advantages of Data Lakes and Data Warehouses to store structured, semi-structured, and unstructured data in a unified Data Lake. This eliminates the need for separate data silos and allows data teams to perform analyses and derive insights directly from raw data without the need to move or duplicate data. The Medallion Architecture, also known as the "Multi-Hop" architecture, is used for the logical organization of data in a Lakehouse. Its goal is to gradually and progressively improve the structure and quality of data as it flows through each layer of the architecture (Bronze – Silver – Gold).

The 2020s: Data Fabric
The Data Fabric represents the fourth generation of data platform architecture. Its goal is to make data available anytime and anywhere. A Data Fabric consists of a network of data platforms such as Data Warehouses, Data Lakes, IoT/Edge devices, and transactional databases that interact with each other and are distributed across the enterprise's computing ecosystem. One node in the fabric can supply raw data to another, which then performs analyses. These analyses can be provided as REST APIs within the fabric, allowing them to be used by transactional systems for decision-making. Data assets can be published in various categories, enabling the creation of an enterprise-wide data marketplace.

Future Concept: Data Mesh
Data Mesh is an architectural concept for organizing data in large enterprises. Instead of storing and managing data centrally, it is decentralized in a Data Mesh. This means that data remains within individual domains or business areas, and mechanisms are introduced to enable access and exchange between these domains.
Data Mesh is typically based on four principles: domain orientation, self-service, data productization, and infrastructure automation. By implementing a Data Mesh, companies can respond more flexibly to changes, as data management is tailored to the specific needs of individual business areas, while simultaneously increasing the scalability and reusability of data.

Comparison of Data Architectures

Data Warehouse remains the most common Data Architecture Model

Although new architectures like Data Lakes and Data Meshes are gaining importance, Data Warehouses remain the most common data architecture variant today. They have established themselves as a proven method for centrally storing and analyzing large volumes of structured data. Companies value their reliability and stability, which they have demonstrated over the years. Additionally, Data Warehouses are closely integrated with Business Intelligence (BI) and analytics tools, enabling seamless analysis of stored data.

Another important aspect is the ability of Data Warehouses to efficiently store and process historical data. This allows companies to identify trends, patterns and changes over time and make informed decisions. The centralized storage and management of data in Data Warehouses also support high data quality and consistency, which is crucial for businesses.

Modern Data Warehouse technologies also offer scalability options that allow companies to expand their infrastructure as needed to keep up with the growth of data volumes.

Architecture Selection must be based on the Needs

There is no universal architecture suitable for all use cases and every company. Rather, the choice of the appropriate architecture is determined by a variety of factors. These include both current and future use cases, the diversity of the data landscape, as well as the technologies and platforms used. Each organization has its own requirements and challenges that may necessitate a tailored architecture. Therefore, it is essential to develop an architecture that meets both current and future needs while being flexible enough to adapt to changing requirements.

DATA WAREHOUSE
Technology: DBMS
Platforms: On-prem or Cloud
Data sources: Structured
Data Integration: Batch
Data models: Dimensional, data vault
Data quality: Secured
Data Governance: Centralized
Importance of Metadaten: Medium
Usage: Standard reports, ad-hoc analysis
DATA LAKE
Technology: Object Stores
Platforms: Cloud
Data sources: All Data
Data Integration: Copy
Data models: Schema-less
Data quality: Unverified
Data Governance: Undefined
Importance of Metadaten: Low
Usage: Data Science
Lambda/Kafka
Technology: Streaming
Platforms: On-prem and/or Cloud
Data sources: Structured and semi-structured
Data Integration: Stream und Batch
Data models: Stream and modeled
Data quality: Monitoring of streams
Data governance: Minimally defined
Importance of metadata: Low to medium
Usage: AI-driven real-time analytics
DATA LAKEHOUSE
Technology: DBMS and Object Stores
Platforms: Cloud
Data sources: Hybrid
Data integration: Copy and Batch
Data models: Hybrid
Data quality: Partially secured
Data governance: Central
Importance of metadata: Medium
Usage: Standard reports, ad hoc analysis, data science
DATA FABRIC
Technology: Data Virtuality
Platforms: On-prem and/or Cloud
Data sources: Structured
Data integration: Virtual
Data models: Dimensional, data vault
Data quality: Monitoring
Data governance: Hybrid
Importance of metadata: High
Usage: Standard reports, ad hoc analysis
DATA MESH
Technology: Various formats and data catalogs
Platforms: On-prem and/or Cloud
Data sources: Hybrid
Data integration: Copy, Batch, Stream
Data models: Hybrid
Data quality: Decentralized
Data governance: Decentralized
Importance of metadata: High
Usage: Standard reports, ad hoc analysis, Data Science, AI-driven real-time analysis