Chapter 12: The Future of Data Systems
So far, this book has been mostly about describing things as they are at present. In this final chapter, we will shift our perspective toward the future and discuss how things should be.
If you have a problem such as “I want to store some data and look it up again later,” there is no one right solution, but many different approaches that are each appropri‐ ate in different circumstances. A software implementation typically has to pick one particular approach. It’s hard enough to get one code path robust and performing well — trying to do everything in one piece of software almost guarantees that the implementation will be poor. Thus, the most appropriate choice of software tool also depends on the circumstances. Every piece of software, even a so-called “general-purpose” database, is designed for a particular usage pattern.
Combining Specialized Tools by Deriving Data
For example, it is common to need to integrate an OLTP database with a full-text search index in order to handle queries for arbitrary keywords. Although some databases (such as PostgreSQL) include a full-text indexing feature, which can be sufficient for simple applications, more sophisticated search facilities require specialist information retrieval tools. Conversely, search indexes are generally not very suitable as a durable system of record, and so many applications need to combine two differ‐ ent tools in order to satisfy all of the requirements.
As the number of different representations of the data increases, the integration problem becomes harder. The need for data integration often only becomes apparent if you zoom out and consider the dataflows across an entire organization.
Reasoning about dataflows
When copies of the same data need to be maintained in several storage systems in order to satisfy different access patterns, you need to be very clear about the inputs and outputs. Writing to the database is the only way of supplying new input into this system. Allowing the application to directly write to both the search index and the database introduces a problem, in which two clients concurrently send conflicting writes, and the two storage systems process them in a different order.
If it is possible for you to funnel all user input through a single system that decides on an ordering for all writes, it becomes much easier to derive other representations of the data by processing the writes in the same order. Whether you use change data capture or an event sourcing log is less important than simply the principle of deciding on a total order.
Derived data versus distributed transactions
The classic approach for keeping different data systems consistent with each other involves distributed transactions. How does the approach of using derived data systems fare in comparison to distributed transactions?
At an abstract level, they achieve a similar goal by different means. Distributed transactions decide on an ordering of writes by using locks for mutual exclusion, while CDC and event sourcing use a log for ordering. Distributed transactions use atomic commit to ensure that changes take effect exactly once, while log-based systems are often based on deterministic retry and idempotence.
The biggest difference is that transaction systems usually provide linearizability, which implies useful guarantees such as reading your own writes. On the other hand, derived data systems are often updated asynchronously, and so they do not by default offer the same timing guarantees.
Batch and Stream Processing
I would say that the goal of data integration is to make sure that data ends up in the right form in all the right places. Doing so requires consuming inputs, transforming, joining, filtering, aggregating, training models, evaluating, and eventually writing to the appropriate outputs. Batch and stream processors are the tools for achieving this goal. The outputs of batch and stream processes are derived datasets such as search indexes, materialized views, recommendations to show to users, aggregate metrics, and so on.
Batch and stream processing have a lot of principles in common, and the main fundamental difference is that stream processors operate on unbounded datasets whereas batch process inputs are of a known, finite size. There are also many detailed differences in the ways the processing engines are implemented, but these distinctions are beginning to blur.
Maintaining derived state
Derived data systems could be maintained synchronously, just like a relational database updates secondary indexes synchronously within the same trans‐ action as writes to the table being indexed. However, asynchrony is what makes systems based on event logs robust: it allows a fault in one part of the system to be contained locally, whereas distributed transactions abort if any one participant fails, so they tend to amplify failures by spreading them to the rest of the system.
Reprocessing data for application evolution
When maintaining derived data, batch and stream processing are both useful. Stream processing allows changes in the input to be reflected in derived views with low delay, whereas batch processing allows large amounts of accumulated historical data to be reprocessed in order to derive new views onto an existing dataset.
In particular, reprocessing existing data provides a good mechanism for maintaining a system, evolving it to support new features and changed requirements. Without reprocessing, schema evolution is limited to simple changes like adding a new optional field to a record, or adding a new type of record. On the other hand, with reprocessing it is possible to restructure a dataset into a completely different model in order to better serve new requirements. Derived views allow gradual evolution.
The lambda architecture
If batch processing is used to reprocess historical data, and stream processing is used to process recent updates, then how do you combine the two? The lambda architecture is a proposal in this area that has gained a lot of attention.
The core idea of the lambda architecture is that incoming data should be recorded by appending immutable events to an always-growing dataset, similarly to event sourcing. From these events, read-optimized views are derived. The lambda architecture proposes running two different systems in parallel: a batch processing system such as Hadoop MapReduce, and a separate stream processing system such as Storm.
In the lambda approach, the stream processor consumes the events and quickly produces an approximate update to the view; the batch processor later consumes the same set of events and produces a corrected version of the derived view. The reason‐ ing behind this design is that batch processing is simpler and thus less prone to bugs, while stream processors are thought to be less reliable and harder to make faulttolerant. Moreover, the stream process can use fast approximate algorithms while the batch process uses slower exact algorithms.
However, it has a number of practical problems:
- Having to maintain the same logic to run both in a batch and in a stream processing framework is significant additional effort.
- Since the stream pipeline and the batch pipeline produce separate outputs, they need to be merged in order to respond to user requests.
- Although it is great to have the ability to reprocess the entire historical dataset, doing so frequently is expensive on large datasets.
Unbundling Databases
At a most abstract level, databases, Hadoop, and operating systems all perform the same functions: they store some data, and they allow you to process and query that data. A database stores data in records of some data model (rows in tables, documents, vertices in a graph, etc.) while an operating system’s filesystem stores data in files — but at their core, both are “information management” systems. The Hadoop ecosystem is somewhat like a distributed version of Unix.
The meta-database of everything
In this light, I think that the dataflow across an entire organization starts looking like one huge database. Whenever a batch, stream, or ETL process transports data from one place and form to another place and form, it is acting like the database sub‐system that keeps indexes or materialized views up to date. Viewed like this, batch and stream processors are like elaborate implementations of triggers, stored procedures, and materialized view maintenance routines.
Designing Applications Around Dataflow
The approach of unbundling databases by composing specialized storage and processing systems with application code is also becoming known as the “database inside-out” approach.
Things happen inside a database when a trigger fires due to a data change, or when a secondary index is updated to reflect a change in the table being indexed. Unbundling the database means taking this idea and applying it to the creation of derived datasets outside of the primary database: caches, full-text search indexes, machine learning, or analytics systems. We can use stream processing and messaging systems for this purpose.
Dataflow: Interplay between state changes and application code
Thinking about applications in terms of dataflow implies renegotiating the relation‐ ship between application code and state management. Instead of treating a database as a passive variable that is manipulated by the application, we think much more about the interplay and collaboration between state, state changes, and code that processes them. Application code responds to state changes in one place by triggering state changes in another place.
Stream processors and services
The currently trendy style of application development involves breaking down functionality into a set of services that communicate via synchronous network requests such as REST APIs. The advantage of such a service-oriented architecture over a single monolithic application is primarily organizational scalability through loose coupling: different teams can work on different services, which reduces coordination effort between teams (as long as the services can be deployed and updated independently).
Observing Derived State
At an abstract level, the dataflow systems discussed in the last section give you a process for creating derived datasets (such as search indexes, materialized views, and predictive models) and keeping them up to date. Let’s call that process the write path: whenever some piece of information is written to the system, it may go through multiple stages of batch and stream processing, and eventually every derived dataset is updated to incorporate the data that was written. Figure 12–1 shows an example of updating a search index.
But why do you create the derived dataset in the first place? Most likely because you want to query it again at a later time. This is the read path: when serving a user request you read from the derived dataset, perhaps perform some more processing on the results, and construct the response to the user.
Taken together, the write path and the read path encompass the whole journey of the data, from the point where it is collected to the point where it is consumed (probably by another human). The write path is the portion of the journey that is precomputed — i.e., that is done eagerly as soon as the data comes in, regardless of whether anyone has asked to see it. The read path is the portion of the journey that only happens when someone asks for it. If you are familiar with functional programming languages, you might notice that the write path is similar to eager evaluation, and the read path is similar to lazy evaluation.
The derived dataset is the place where the write path and the read path meet, as illustrated in Figure 12–1. It represents a trade-off between the amount of work that needs to be done at write time and the amount that needs to be done at read time.
Aiming for Correctness
With stateless services that only read data, it is not a big deal if something goes wrong: you can fix the bug and restart the service, and everything returns to normal. Stateful systems such as databases are not so simple: they are designed to remember things forever (more or less), so if something goes wrong, the effects also potentially last forever — which means they require more careful thought.
We want to build applications that are reliable and correct (i.e., programs whose semantics are well defined and understood, even in the face of various faults). For a topic that is so important, our understanding and our engineering methods are surprisingly flaky.
Doing the Right Thing
Every system is built for a purpose; every action we take has both intended and unintended consequences. The purpose may be as simple as making money, but the con‐ sequences for the world may reach far beyond that original purpose. We, the engineers building these systems, have a responsibility to carefully consider those consequences and to consciously decide what kind of world we want to live in.
We talk about data as an abstract thing, but remember that many datasets are about people: their behavior, their interests, their identity. We must treat such data with humanity and respect. Users are humans too, and human dignity is paramount.
Software development increasingly involves making important ethical choices. There are guidelines to help software engineers navigate these issues, such as the ACM’s Software Engineering Code of Ethics and Professional Practice, but they are rarely discussed, applied, and enforced in practice. As a result, engineers and product managers sometimes take a very cavalier attitude to privacy and potential negative consequences of their products.
A technology is not good or bad in itself — what matters is how it is used and how it affects people.