Jc-alt logo
jc
system design

Textbook: Designing Data Intensive Applications

Textbook: Designing Data Intensive Applications
37 min read
#system design
Table Of Contents

Preface

Intro

Hello! all of these notes will be in short hand form, to avoid rewriting the textbook from scratch. Additionally, I am reading/writing this in my vscode, i know i know, editing text in vscode, an editor for code, WITH CODE IN THE NAME, disgraceful, buttt the rest of my LeetCode notes are here so ¯\(ツ)/¯. As such, the notes look much prettier (and colored) in vscode, if you want to read along, plz check my github! - jonathan

Additionally, if this looks bad in full screen, that is because I only read the notes in 2/3rds split screen and I dont have the need to fix it, again: ¯\(ツ)/¯.

Backend swe's have new common buzz words relating to the storage and processing of data.

NoSQL! BIG DATA! Web scale, Sharing, EvEnTuAl CoNsIsTeNcY, ACID!!! (no not the drug) The Cap Theorem, Clooooud services, (you get the point).

Data intensive applications are pushing the boundaries of what is possible by inventing new buzzwords, strategies, and technologies.

Fortunately, regardless of the rapid change basic principles remain true no matter what tool or version you are using.

We are here to learn the 'why' of those basic principles.

The goal of these notes is to organize shorthand to learn these concepts.

Who Should Read This Book?

Fellow developers, old friends, random strangers on the internet who have stumbled upon my chicken scratch of notes. Welcome all!

But again, if you are backend these notes are for you.

  1. Role wise this is for:
  • software engineers the builders

  • software architects the designers

  • basically anybody making decisions on the architecture of the systems you are building.

  • for people with natural curiosity for what is going on under the hood, rule number 1: its never magic (thanks bobby ;) ) rule number 2: i will not bail you out of jail (also thank you bobby, rip cmsc417 gang)

  1. App wise, this is for
  • scalable data systems (e.g., supporting web or mobile apps with millions of users)
  • minimizing downtime (e.g., making apps available)
  • maintainable systems (e.g., making systems easier scale and update as technologies change)
  1. Counter Argument wise, this is for
  • when someone complains "you're not google or amazon, stop worrying about scale and just use a relational database". You can slap them with a fish and say "fair, while building for scale you don't need is wasted effort, it may lock you into an inflexible design. You should know its important to choose the right tool for the job and while relational databases are important, they are not the final word on dealing with data". They will then view you as a worthy advisory and challenge you to a fight to the death (dont forget what you learned in your 4 years of computer science undergrad and give them the classic fork() and os.kill_child() leading to instant victory, for legal reasons this is a joke please dont stop reading its like 4 in the morning im sorry, the rest of the notes dont have jokes, also if you laughed at that youre a nerd :) )

Scope Of This Book

Again, the architecture of data systems.

We cover the principles and tradeoffs to data systems and explore design decisions taken by different products.

We leave deployment, operations, security, management, and other areas, to be covered by other books.

Outline Of This Book

  1. Part I: The Basics: our bearings, and definitions

We define the fundamental ideas for the design of data intensive applications.

ChapterTopic
1Top down point of view. What are we trying to achieve? Reliability, scalability, maintainability.
2Start by comparing different data models and query languages, and see how they are appropriate to different situations
3Storage engines. How databases arrange data on disk so that we can find it again efficiently
4Formats for data encoding (serialization) and evolution of schemas over time
  1. Part II: From a single machine to a distributed system

Shift from single machine to distributed across multiple machines, as often necessary for scalability, and discuss variety of unique challenges

ChapterTopic
5Replication
6Partitioning and sharding
7Transactions
8Dive into more details on the problems prevalent with distributed systems
9What is means to achieve consistency and consensus in a distributed system
  1. Part III: Heterogeneous systems, where no one data base can do it all

Learn about systems that derive some datasets from other datasets. in heterogenous systems, when no one database can do everything well. Applications need to connect different databases, caches, indexes and so on.

ChapterTopic
10Batch processing approach to derived data.
11Build upon batch processing with stream processing.
12Tape everything together and discuss approaches for building reliable, scalable, and maintainable applications in the future.

References and Further Reading

Nothing new here!

Most of everything in this book has been said before! Free references and links as we go along.

Part I: Foundations of Data Systems

Intro

First 4 chapters: fundamental ideas that apply to all data systems, regardless of single machine or distributed across a cluster of machines.

ChapterTopic
1Reliability, Scalability, and Maintainability. Terminology and how to achieve these.
2Compare data models and query languages in different situations.
3Internals of storage engines. Data on disk. Storage engine optimization for different workload. Performance effects from database choices.
4Comparing data encoding and serialization formats, and put in environment where requirements and schemas change over time.

Chapter 1: Reliable, Scalable, and Maintainable Applications

Intro

The Internet was build so well that most people think of it like a natural resource like the ocean or mountains, rather than being something man-made.

The problems facing apps today is that they are data intensive instead of compute intensive. The amount, complexity, and speed at which data needs to be process ends up creating the most issues, as opposed to issues regarding applications being compute intensive. Rarely is CPU power a limiting factor for these applications.

To start, most data intensive apps today are built from the same standard building blocks and principles:

  • Database: To store data
  • Caches: Store result of expensive operations, to avoid double work
  • Search Indexes: Allow to query by keywords or filter databases
  • Batch Processes: Periodically process larges amount of accumulated data in sections
  • etc.

These building blocks have been established and well built that most 'sane' engineers wouldn't think of writing a new data storage engine from scratch because perfectly good ones already exists.

But while there are many existing database systems, each have different characteristics as maybe benefit different applications as requirements change.

Thinking About Data Systems: Data Systems and API Layers

Data Systems

We can lump databases, queues, caches, etc, each with different patterns, performance, and implementations, under the same umbrella term data systems as each of those act as building blocks for the larger 'data system'.

No single tool can meet all data processing and storage needs, as so work is broken down into tasks, that is abstracted to a single 'data system'.

And as new data storage and processing tools continue to emerge, this leads to trouble as traditional boundaries blur and categories fade.

  • Redis: Both a datastore and message queue
  • Apache Kafka: Both a message queue with database durability

API Layers

API layers hide implementation details for individual systems as well as how different systems interact with one another allowing developers to create a new special purpose data system from smaller, general purpose data systems.

APIs serve to hide implementation details from clients and developers, becoming a special purpose data system from smaller, general purpose components.

This special purpose data system could perform single tasks across components to create contracts or guarantees such as cache invalidation upon writes so clients see consistent results

This leads API wrappers to have the responsibility for both both data storage and system design.

Diagram: Simple Example of an Architecture for Data System

                            +--------+
                            | Client |          
                            +--------+          
                                ^               
                                |                            
                                |    Client Requests                
                                |               
                                v  
                        +---------------------+                 Asynchronous tasks
                        |   Application Code  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
                     /  +---------------------+ \                                       |
 Reads requests     /           |                \                                      |
 and first checks  /            |                 \  Search                             |
 if the data      /             |                  \ Requests                           |
 is cached       /              |                   \                                   |
   +-----------+        +----------------------+     +-----------+                +-----------+  
   | In-memory |        |   Primary Database   |     | Full-text |                |  Message  |
   |   Cache   |        +----------------------+     |   index   |                |  Queue    | 
   +-----------+                |                    +-----------+                +-----------+
                ^               |                   /                                   |
                 \              | Capture          /                                    |
                  \             | changes to      /                                     |
                   \            | data           /                                      |
        Invalidate  \           v               /   Apply updates                       |
        or update    \  +---------------------+     to search index          +---------------------+
        the cache       |   Application Code  |                              |   Application Code  |
                        +---------------------+                              +---------------------+
                                                                                        |
                                                                                        |   e.g. send email
                                                                                        V

                                                                                    'outside world'

And this is where the interesting questions arise!

  • How do you provide good performance?
  • How do we handle increase in load?
  • etc.

Lets put that in our back pocket for now and continue.

Reliability: Fault-Tolerant Systems vs Hardware, Software, Humans Errors

Intro

Reliability: a system should 'continuing to work correctly, even when things go wrong.'

The reflection of reliability, or what causes it to break, are faults and failures. Fault is a bad result of a single data system within the larger data system. Failure is a bad result of the data system as a whole

We make our systems fault-tolerant by trigger faults deliberately to ensure faults are handled gracefully when they occur naturally.

Reliability is defined as a system continuing to work correctly, performing the correct function at the desired level of performance, even in the face of adversity: hardware faults, software faults, human error, etc..

Basic expectations for reliable software can include:

  • Application performs function user expected
  • Tolerates user making mistakes or using software in unexpected ways
  • Acceptable performance for the required use case under the expected load and volume
  • System prevents unauthorized access and abuse

In summary:

A system that anticipates faults is fault-tolerant or resilient.

A fault is our system design wrapper word for all the things that can go wrong. A fault is usually defined as one component of the system deviating from its expectation.

A failure is when the system as a whole stops providing the required service to the user.

It is nearly impossible to reduce the probability of faults to zero, which is why we prefer fault-tolerant systems.

System become fault-tolerant by triggering faults deliberately to test our tolerance. Many critical bugs are due to poor error handling. Thus, by continuously testing we ensure faults will be handled gracefully when they occur naturally.

Hardware Faults

Intro

Hardware faults are fairly common when discussing system failures

Hardware faults that can cause full system failures:

  • Hard disk crashes
  • Faulty RAM
  • Power grid blackout
  • Unplugging the wrong network cable

In data centers especially hardware faults are common. For a given storage cluster that contains 10,000 disks, one disk dies per day on average as the MTTF (mean to time failure) of a hard disk is about 10 to 15 years.

Solution: Single Hardware Redundancy

The pattern being, when one physical component dies, the redundant component can take its place while its original is replaced.

The first solution to hardware faults is to add redundancy to individual hardware components with RAID configuration, add dual power supplies to servers, hot swappable CPUS, or datacenters with batteries and diesel generators for back up power.

While this does not help the physical hardware from breaking, its helpful in keeping machines running uninterrupted for years.

Solution: Shift to Software Fault Tolerance for Loss of Entire Machines

A shift towards systems that can tolerate the loss of entire machine by using software fault-tolerance techniques in addition to hardware redundancy increase availability

As data volumes and application computing demands increase, applications have begun using a large number of machine which proportionally increases the rate of hardware faults.

In some cloud platforms (AWS), it is fairly common for virtual machine instances to become unavailable without warning as the platforms are designed to prioritize flexibility and elasticity over single-machine reliability.

This full tolerance has operational advantages: Single server systems require planned downtime if you need to reboot the machine, but a system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system.

Applications that require high availability have shift software fault tolerance as a solution for hardware faults.

Software Errors

Intro

Software faults are harder to anticipate because they are 'non-random' across nodes. They tend to cause more system failures than 'random' hardware faults (e.g., MTTF). The are 'non-random' as they occur because of actions such as bad input.

Among the first that come to mind as the reason behind full system failures:

  • Software bug that crashes app instance on bad input
  • Runaway process using up shared resources, CPU, memory, disk, bandwidth
  • Dependent services slowing other services down
  • Cascading faults from service to service leading to fault

These 'non-random' faults lie dormant for a long time until they are triggered by an unusual set of circumstances.

Solution: Multiple Small Fault Tolerant Solutions Working Together

There is no one quick solution to faults in software, but lots of small solutions put together can help odds:

  • Think about assumptions and interactions between systems
  • Thorough testing
  • Process Isolation
  • Allowing Processes to crash and restart
  • Measuring, monitoring, and analyzing system behavior in production
  • Using some full system guarantee (incoming messages in queue = outgoing messages)

Layers of fault-tolerance within a software system enable the overall system to become more fault-tolerant.

Human Errors

Intro

Humans are known to be unreliable, even with the best intentions.

Study of large internet services showed configuration errors by operators were the leading causes of outages, where hardware faults were only 10-25%

Solution: Separate Small Fault Tolerant Approaches Working Together

Human configuration occurs around the entire systems making it hard to anticipate. But lots of small solutions can help:

  • Design systems with minimal opportunity for error: APIs, admin interfaces, good abstraction
  • Decouple places where people make the most mistakes from places where they can cause errors
  • Test thoroughly on all levels, from unit tests to whole system integration tests
  • Allow quick and easy recovery from human errors to minimize impact
  • Telemetry for performance metrics and error rates

We make the system as fault-tolerant and anticipate a wide array of human errors in order to account for the future.

How Important is Reliability?

Reliability is just as much for dangerous power stations and air traffic control as mundane applications such as photo applications for parents as an example.

In development, there will be situations where we can sacrifice reliability in order to reduce development cost, but we should always be aware of the potential effects of cutting corners.

Scalability: Plans to Cope Load, System Load Parameters, Performance Distribution

Intro

Scalability: As a system grows, in data volume, traffic volume, complexity, etc, there should be a ready reasonable solution to deal with that growth

Scalability is a system's ability to cope with increased load across different target parameters. 'if the system grows in a particular way, what are our options for coping with that growth' or 'how can we add computing resources to handle the additional load'

If a system is reliable today, there is no guarantee that it will work reliably in the future as load changes.

Increased load is a common reason for degradation as jumping from 10,000 concurrent to 100,000 or 1 million to 10 million will require the system to process much larger volumes of data than before.

Describing Load: Scaling Twitter With Fan Out

Intro

Load: described by load parameters dependent on the architecture of your system.

Load parameters for a system could be:

  • Requests per second to a web server.
  • Ratio of reads to writes in a database.
  • Number of simultaneously active users in a chat room
  • Hit rate on a cache
  • The average case or a bottleneck of some specific case
Twitter Load Parameters

Twitters two main operations are PostTweet() and HomeTimeline() which essentially end up being writes() and reads().

Thus, our load parameters are the number of reads and writes to a database as well as the ratio of reads to write

Twitter Write() and Read()
  1. PostTweet()

User: can publish a new message to their followers: ~4.6k requests/sec on average, over ~12k request/sec at peak

  1. HomeTimeline()

User: can view tweets posted by the people they follow ~300k requests/sec.

Twitter Problem To Solve: Where to Fan Out

Handling 12,000 writes per second is simple.

The issue lies not with tweet volume, but with fan out from tweeter to follower. Each user follows many people, and each user is followed by many people. A user has ~75 followers on average.

There are two main ways of implementing these two operations, one focusing on the Read() action and one focused on the Write() action.

1. SQL Merge: Fan Out on Read

Optimized for Write: Write() frequent, Read() infrequent

Idea: Do no extra work when on Write/PostTweet() and insert it into global tweets table.

Main Work: on Read() -> query + merge

User PostTweet():

  1. Insert into global tweets table

User HomeTimeline():

  1. Look up list of people they follow
  2. Fetch tweets for each of those users,
  3. merge them and sorted by timestamp
Read()
User requests timeline → query who they are following and merge:

    -- 1. FROM: start with base table of all tweets
    SELECT 
        tweets.*, -- 4. SELECT: grab tweet info (id, text, timestamp, etc.)
        users.*   --    grab user info (name, username, etc.)
    FROM tweets

    -- 2. JOIN users: merge tweets with users table based on author id
    --    Users Table: user info
    --    Tweets Table: tweet info
    --    Result Table: columns (user + tweet)
    JOIN users 
        ON tweets.sender_id = users.id

    -- 3. JOIN Table: merge previous result with follows table based on author
    --    Follows Table: follower -> followed
    --    Result Table: columns (user + tweet + follows (reader) + followed (author)) 
    JOIN follows 
        ON follows.followee_id = users.id

    -- 4. WHERE: filter to keep only rows where current user is the follower
    --    Current_User: user info
    --    Result Action: return (tweets.* + users.*) where 
    --                   (user.id == (user + tweet + follows + followed).follows (reader))
    WHERE follows.follower_id = current_user

    -- 5. ORDER BY: sort by tweets.created_at DESC to show newest tweets first
    -- 6. LIMIT 50: only return first 50 tweets

     For current logged-in user: 170
                                  |
| - - - - - - - - - - - - - - - - |
|
|     follows table                       tweets table
|     +---------------------------+       +----------------------------------------------+
|     | follower_id | followee_id |       | id (tweet_id) | sender_id | text | timestamp |
|     +---------------------------+       +----------------------------------------------+
| - > | 170         | 12          |       | 20           | 12        | "hi"  | 114297    |
      +---------------------------+       +----------------------------------------------+
                    / |                                     ^           
                   /  | - - - - - - - - - - - - - - - - - - | 
                  /                                                         
                 /                                         
                /  users table                                      
               /   +-----------------------------------+  
              /    | id  | screen_name | profile_image |
             /     +-----------------------------------+   
            - - - >| 12  | alice       | alice.jpg     |
                   | 170 | bob         | bob.jpg       |
                   +-----------------------------------+
2. Cache User Timeline: Fan Out on Write

Optimized for Read: Read() frequent, Write() infrequent

Idea: Do little extra work on read, grab precomputed 'mailbox' (cache) of tweets for each user.

Main Work: on Write() -> copy tweet into many followers cache

User PostTweet():

  1. Find all their followers
  2. Insert that tweet into each follower's cached home timeline.

User HomeTimeline():

  1. Return their cached timeline, no merging required
Write()
User posts a tweet → update all timelines cache:

                                   +--------+
                                   |  User  |
                                   +--------+
    POST Tweet:                        |
    4.6 k writes/sec                   |
                                       v
                                +---------------+
                                |  All Tweets   | 
                                | (Primary DB)  |     
                                +---------------+
                                       | 
    Average: 75 followers        |-----------------|-----------------|
    Write to 75 Timeline Caches  |                 |                 |
    4.6 * 75 = 345               v                 v                 v
    345k writes/sec     +--------------+   +--------------+   +--------------+
                        | Timeline     |   | Timeline     |   | Timeline     |  
                        | Cache: User1 |   | Cache: User2 |   | Cache: User3 |
                        +--------------+   +--------------+   +--------------+
    READ Timeline:
    300k reads/sec  
Twitter Iterations: Tried Solution 1, then Solution 2 as more Reads() > Writes()

The first version of Twitter used approach 1. However, systems struggled to keep up so they switched to approach 2.

It ended up performing better as the average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads. So its preferable to do more work at write time and less at read.

The downside of approach 2 is posting a tweet now requires more work. On average, a tweet is delivered to about 75 followers. So 4.6k tweets per second becomes 345k writes per second to user's home timeline caches.

Twitter Iterations: Celebrity Fan Out Problem

Some users have over 30 millions followers, which means a single tweet may result in over 30 million writes to timelines, and as twitter tries to deliver tweets to followers within five seconds, this creates a significant challenge.

Twitter Iterations: Celebrity Fan Out Solution: Celebrity Merge + User Cache Timeline Hybrid

Twitter eventually moved to hybrid of both approaches. With most user's tweets being fanned out to timelines at the time when they are posted (Solution 2: Cache).

But tweets from celebrities with a large number of follows are fetched separately and merged with that users' home timeline when it is read (Solution 1: Merge) to avoid overloading by passing to 30 million timeline caches.

We will revisit this example in Chapter 12 after having covered more technical ground.

Describing Performance: Testing Load Effect On Performance

Intro

Once you have described load and load parameters on your architecture and system, you can investigate what happens when load increases.

To ensure scalability, it is important to test different loads effect on performance.

  • When you increase a load parameter and keep system resources unchanged, how is the performance of your system affected?

  • When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged

Both of these require performance numbers, so lets define how to numerate the performance of a system

Performance Distribution: p50, p95, p99, p999 and Client Load Generators

Throughput could be the first thing that comes to mind when thinking about numerical performance, such as in batch processing systems.

Throughput would be the number of records we can process per second, or the total time to complete a job on a dataset of a certain size

However, in online systems, the response time is usually more important

Response time would be the total time between a client sending a request and receiving a response

A single request sent multiple times will get slightly different response times each time its sent. In practice, a single system is handling a variety of different requests, leading to the response time to vary. Therefore, we need to think of response time not as a single number, but as a distribution.

Outliers may seem like they are slower because they could be processing more data, but in the case where all requests do the same, random additional latency could be introduced by the loss of a packet, TCP retransmission, garbage collection pause, etc.

Diagram: High Percentile Response Time
                        Response Time
                        (ms)
  99th percentile ----  |          *                                          
                        |          *                                          
                        |          *                                         
                        |          *                 *                       
  95th percentile ----  |  *       *       *         *        *              
                        |  *       *       *         *        *              
  Mean  --------------  |  *       *   *   *       *   *    *   *   *        
  Median -------------  |  *   *   *   *   *   *   *   *    *   *   *        
                        |  *   *   *   *   *   *   *   *    *   *   *        
                        |  *   *   *   *   *   *   *   *    *   *   *        
                        ----------------------------------------------------
                          R1  R2  R3  R4  R5  R6  R7  R8  R9  R10  R11     Request Number

Average response time is common, but note that 'average' does not refer to any particular formula, but is still not a good metric.

Usually percentiles are better, by taking the median. Thus if the median response time is 200 ms, half of your requests return in less than 200ms, and half of the requests take longer than that. Median is also abbreviated as 50th percentile or p50.

The outliers are then 95th. 99th, and 99.9th percentiles (p95, p99, p999). Similarly is p95 is 1.5 seconds, that means 95 our of 100 requests take less than 1.5 seconds and 5 out of 100 take 1.5 seconds or more.

These outliers are known as tail latencies and are important because they directly affect the user's experience of a service.

In practice, tail latencies may be the most important as the customers with the lowest requests are often those that have the most data on their accounts (because they've made the most purchases) and are thus valuable customers.

Happy customer = larger profit. A study found that a 100ms increase in response time reduces sales by 1%.

But on the other hand, optimizing the p999 may be too expensive and difficult as it is easily affected by random events outside of your control with diminishing returns.

Another use of percentiles is in service level objectives (SLOs) and service level agreements (SLAs), which are contracts defining the expected performance and availability of a service.

  • Def: Head-of-line blocking: when slow requests clog up the max parallel process in a system

It only takes a small number of slow requests to slow down a system as a whole, which is why we must measure response times on the client side rather than server side.

When generating load artificially to test the scalability of a system, the generating client needs to keep sending requests independently of the response time. If a client waits for a previous request to complete before sending the next one, that behavior keeps the queues shorted in test than they would be in reality due to head-of-line blocking, which would skew the measurements.

Percentiles In Practice: High Percentiles,

High percentiles become especially important in backend services that are called multiple times as part of serving a single end-user request as a single slow response time slows down the entire user response time.

Even if you make the calls in parallel, the client response will still wait on the slowest of the parallel calls.

  • Def: Tail Latency Application: effect of multiple parallel calls from a single client call, where overall response must wait on the slowest parallel response, increasing the chance of a high percentile slow call

Even if a small percentage of backend calls are slow, the chance of getting a slow call increases from multiple backend calls, leads to a higher proportion of end-user requests being slow.

Measuring High Percentiles: Naive and Algorithms

It is ideal to add response time percentiles to monitoring dashboards for your backend services, in efficient ways to calculate them on an ongoing basis with minimal CPU and memory cost.

A naive implementation is to keep a rolling window of response times of requests in the last 10 minutes, and every minute calculate the median and various percentiles as this costs CPU and memory.

A naive mistake is to average the percentile of two different machines. This is mathematically meaningless as a percentile depends on the distribution of all combined requests, not an average of the 95th or 50th percentile between machines. Machine A might have fast requests while machine B has slow ones and averaging those separately losses the true combined distribution.

Algorithms specific for good approximation of percentiles at minimal CPU and memory cost exist:

  1. Forward Decay

Streaming algorithm for approximating metrics like percentiles over a rolling time window. Older data is 'decayed' (weighted less) automatically so recent values have more impact. Useful for monitoring metrics over time without storing every data point

  1. t-digest

Compact data structure for approximating percentiles and very accurate for extreme tails (p99, p999). Works by clustering data points into centroids, then computing percentiles from those clusters. Efficient in memory and CPU, even fro large datasets

  1. HdrHistogram

High Dynamic Range Histogram. Stores response time counts in a histogram. Tracks latencies from microseconds to hours in a single histogram structure. Supports fast percentile calculations and is suitable for high-throughput monitoring

Approaches for Coping with Load: Ensuring Performance While Increasing Load

There are many ways to deal with scaling and there is no 'magic scaling sauce'

The basis for a scalable architecture lies on the assumptions of which operations will be common and which will be rare, our load parameters.

An architecture that is appropriate for one level of load is unlikely to cope with 10 times that load. For fast growing services, it is common to rethink the architecture on every order of magnitude load increase, or even more often than that.

  • Def: Scaling Up (Vertical Scaling): moving to a more powerful machine

  • Def: Scaling Out (Horizontal Scaling): distributing load across multiple smaller machines

  • Def: Shared-Nothing Architecture: distributing load across multiple machines

In reality, good architecture involves a pragmatic mixture of approaches. Such as using several fairly powerful machines leading to a simpler and cheaper approach rather than a large number of small virtual machines.

  • Def: Elastic: systems that can automatically add computing resources when they detect a load increase. Useful when load is highly unpredictable.

Distributing stateless services across multiple machines is fairly straightforward.

Distributing stateful data systems from a single node to a distributed set will introduce a lof of additional complexity. Thus, its common wisdom to keep your database on a single node until scaling cost or high availability requirements force you to make it distributed.

  • Def: Magic Scaling Sauce: there is no such thing as a generic, one-size-fits-all scalable architecture (informally known as magic scaling sauce). The problem may vary from the volume of reads, writes, data to store, complexity, etc. Thus the architecture of systems that operate at a large scale are usually highly specific to the application.

The foundation for an architecture that scales well for a particular application is built around assumptions of which operations will be common and which are rare, indicated by our load parameters.

The issue lies in incorrect assumptions, which will lead to the engineering effort for scaling to be at best wasted and at worst be counterproductive.

Maintainability: Happy Operators, Simple Abstractions, Happy Future Developers

Intro

Over time, different people will build, work, and add to systems both maintaining behavior and adapting the system, and they should all be able to work on it 'productively'

The most expensive cost of software is not in its initial development, but in its ongoing maintenance, fixing bugs, keeping systems operational, investigating failures, adapting to new platforms, repaying technical depth, etc.

The unfortunate part is that many developers dislike maintenance of these 'legacy' systems. Thus to avoid discomfort we should design software in such a way that it will hopefully minimize pain during maintenance and thus save us from creating 'legacy' software ourselves.

These three design principles for software systems aim to create standards for development:

Operability: Making Life Easy For Operations

Make it easy for operations teams to keep the systems running smoothly

Operations teams are vital to keeping a software system running smoothly. A good operations team is typically responsible for the following:

  • Monitoring health of system and restoring services if it goes into a bad state

  • Tracking down cause of problems, such as system failures or degraded performance

  • Keeping software and platforms up to date, including security patches

Good operability means making routine tasks easy, allowing the operations team to focus their efforts on high-value activities. Data systems can do various things to make routine tasks easy, including:

  • Providing visibility into runtime behavior and internals of the system, with good monitoring

  • Providing support for automation and integration with standard tools

  • Avoiding dependency on individual machines, allowing machines to be taken down for maintenance while the system as a whole continues uninterrupted

  • Documentation and easy to understand operational model

Simplicity: Managing Complexity

Make it easy for new engineers to understand the system, by removing as much complexity from the system.

Small software projects can have delightfully simple and expressive code, but as projects get larger, they often become very complex and difficult to understand slowing down everyone who needs to work on the system and further increasing the cost of maintenance.

Making a system simpler could mean:

  • Removing accidental complexity
  • Abstracting implementation to hide large chunks of functionality under a simple to understand facade
  • Avoiding tight coupling of modules, tangles dependencies, inconsistent naming, weak hacks at solving performance problems, etc.

Evolvability: Making Changes Easy

Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also know as extensibility, modifiability, or plasticity.

Its unlikely that a systems requirements will remain unchanged forever. There are more likely in a constant flux, as we learn new facts, encounter unanticipated use cases, a business priorities change, etc.

The east at which you can modify a data system is closely linked to its simplicity and abstractions as simple and easy to understand systems are easy to modify than complex ones.

But since this is an important idea, we separate this from simplicity in order to specify the agility on a data system level for changes.

Summary

Functional Requirements: what it should do

Nonfunctional Requirements: general properties such as security, reliability, compliance scalability, compatibility, and maintainability, usually translatable to some metric

Reliability: Making systems that work even when faults occur

Scalability: Having strategies for keeping performance good, even when load increases

Maintainability: Making like easier for engineering and operations that need to work with the system

There is no easy fix for making applications reliable, scalable, or maintainable, but patterns and technique exist that help us work towards these goals.

Chapter 2: Data Models and Query Languages

Intro

The limits of my language mean the limits of my world

Data models are perhaps the most important part of developing software, because they have such a profound effect not only on how the software is written, but also on how we think about the problem we are solving

Most applications are built by layering one data model on top of another. For each new layer, the key question is how is it represented in terms of the next-lower layer.

Layer 1. Most apps look at the real world (people, organizations, goods, actions, etc.) and model it in terms of objects or data structures, and APIs manipulate those data structures

Layer 2. When we store these data structures, we express them in a general purpose data model, such as JSON or XML, tables in a relational database, or a graph model.

Layer 3. The database software decided on representing the JSON, XML, relational, and graph data in terms of bytes in memory, on disk or on network. That representation or layer allows the data to be queries, searched, manipulated, and processed.

Layer 4. Hardware engineers figured out how to represent bytes in terms of electrical currents, pulses of lights, magnetic fields, and more.

Each layer hides the complexity of the layers below it by providing a clean data model. In complex applications there may be more intermediary levels, such as APIs build upon APIs but the idea of layers still applies.

In this chapter we will:

  • Look at a range of general purpose data models for storage and querying
  • Compare the relational model, the document model, and a few graph based data models
  • Look at various query languages and compare their use cases

Relational Model vs Document Model

Intro

The best known model today is probably that of SQL, based on the relational model proposed by Edgar Codd in 1970.

  • Def: Relational Model: Data is organized into relations (called tables in SQL), where each relation is an unordered collection of tuples (rows in SQL).

The goal of relational model is to hide implementations of the internal representation of data in the database behind a cleaner interface.

In the 1970s and early 1980s, the network model and hierarchical model were the main alternatives to relational model, but the relational model dominance continued to last around 25-30 years, which is an eternity in computing history.

Much of what we used today on the web is still powered by relational databases, whether it be online publishing, discussion, social networking, saas, etc.

The Birth of NoSQL

Now in the 2010s, NoSQL is the latest attempt to overthrow the relational models dominance.

NoSQL stands for Not Only SQL, and there are several forces behind the adoption of NoSQL database:

  • Need for greater scalability than relational database can easily achieve, including very large datasets or high write throughput

  • Widespread preference for free and open source software over commercial database products

  • Specialized query operations not supported well by the relational model

  • Frustration with the restrictiveness of relational schemas, and a desire for more dynamic and expressive data models

  • Def: Polyglot Persistence: Relational databases will continue to be used alongside a broad variety of nonrelational datastores

The Object-Relational Mismatch

If data is stored in relational tables, an awkward transition layer is required between objects in the application code and the database models of tables, rows, and columns. This disconnect between relational and object-oriented is called an impedance mismatch

Most development done today is done in object-oriented programming languages. This leads to a common criticism of the SQL data model: an extra translation layer is needed between relational tables and objects in application code.

  • Def: Object-relational mapping (ORM): frameworks to reduce boilerplate required for relational to object-oriented translation

One to Many, Many to One, and Many to Many Relationships

One to Many: LinkedIn Profile

A profile as a whole can be identified by a unique identifier, user_id. Fields like first_name, last_name appear exactly once per user, so they can be modeled as columns on the user table.

However, most people have had more than one job in their career, and people by have varying number of periods of education. This leads to a one-to-many relationships from the user to these items which can be represented in many ways:

  • Traditional SQL (prior to SQL:1999), the most common normalized representation is the put positions, education, and contact info in separate tables, with a foreign key reference to the users table.

  • Later versions of SQL standard added support for structured datatypes and XML data, allowing multi-valued data to be stored in a single row, with support for querying and indexing inside those documents. These features are more or less supported by MS SQL Server, PostgresSQL, Oracle, etc.

  • We can encode jobs, education, and contact info as JSON or XML document, store it on a text column in the database, but in this setup you cannot use the database to query for values inside the encoded column.

  • JSON model reduced the impedance mismatch between application code and storage layer, but comes with problems as well, to be discussed, although the lack of scheme is often cited as an advantage

Many to One: LinkedIn Job Location

For common items across profiles such as job location, region_id and industry_id are given as IDs, not as plain-text strings 'greater seattle area' or 'philanthropy'.

The advantage of using an ID is that because it has no meaning to humans, it never has to change. The ID can remain the same, even if the information it identifies changes. Usually, anything meaningful to humans may need to change in the future, and if that information is duplicated, location in multiple user profiles, all redundant copies need to be updated which incurs write overheads and risks inconsistencies.

  • Def: Normalization: aims to remove duplication among values in databases

Normalizing data requires many-to-one relationships. Many people live in one particular region, many people work in one particular industry, and this does not fit nicely into the document model.

For joins, you have to emulate a join in application code by making multiple queries to the database, in our case the lists of regions and industries is small and slow changing enough that the application can keep it in memory, but still the work of making the join is shifted from the database to the application code.

Many to Many:

Now if we add multiple features for a single user profile, such as converting school and organization to an id which points to a school and organization profile, adding comments on other uses profile, now there is many-to-many relationships between the profiles themselves

Are Document Databases Repeating History?

The Network Model
The Relational Model
Comparison to Document Databases

Relational vs Document Databases Today

Which Data Model leads to Simpler Application Code
Schema Flexibility in the Document Model
Data Locality for Queries
Convergence of Document and Relational Databases

Query Languages for Data

Graph Like Data Models

Summary

Chapter 3: Storage and Retrieval

Intro

Chapter 4: Encoding and Evolution

Intro

Part II: Distributed Data

Chapter 5: Replication

Chapter 6: Partitioning

Chapter 7: Transactions

Chapter 8: The Trouble with Distributed Systems

Chapter 9: Consistency and Consensus

Part III: Derived Data

Chapter 10: Batch Processing

Chapter 11: Stream Processing

Chapter 12: The Future of Data Systems