Summarize by Aili

Big Data is Dead

https://motherduck.com/blog/big-data-is-dead/

🌈 Abstract

The article discusses the evolution of the "Big Data" concept and argues that the era of Big Data is over. It examines how the predictions of a "data cataclysm" have not materialized, and how most organizations do not actually have massive data sizes that require specialized big data technologies. The article also explores why data size is not the primary challenge, and how separation of storage and compute, as well as techniques like column projection and partition pruning, have made it easier to manage data at scale.

🙋 Q&A

[01] The Decline of the "Big Data" Narrative

1. What were the key predictions and claims made about the growth of data and the need for new technologies to handle it?

The article states that for over a decade, the narrative was that data was growing so rapidly that traditional systems would not be able to handle it, and that new "Big Data" technologies were needed to address this problem.
This was exemplified by the "scare" slide used in many pitch decks, which showed exponential growth in data generation and the need for new solutions.

2. How does the article challenge this narrative?

The article argues that the predicted "data cataclysm" has not actually materialized, as data sizes have not grown as rapidly as predicted, and hardware has improved at an even faster rate.
It provides data from the author's experience at Google BigQuery, showing that the majority of customers had less than a terabyte of data, contrary to the dire predictions.
Industry analysts also confirmed that most enterprises have data warehouses smaller than a terabyte, much less than the massive data sizes portrayed.

3. What factors contributed to the overestimation of data growth?

The article suggests that data is not distributed equally, and most applications do not actually need to process massive amounts of data.
It also explains that data size increases much faster than compute size, and that techniques like column projection and partition pruning can significantly reduce the amount of data that needs to be processed.

[02] The Changing Data Landscape

1. How has the separation of storage and compute impacted data management?

The article explains that the rise of scalable and fast object storage like S3 and GCS has allowed customers to decouple storage and compute, providing more flexibility in how data is managed.
This has meant that customers are not tied to a single form factor and can scale storage and compute independently.

2. What trends have emerged in the usage of different data management systems?

The article notes a resurgence in traditional database systems like SQLite, PostgreSQL, and MySQL, while "NoSQL" and "NewSQL" systems have stagnated.
It also observes a shift from on-premise to cloud-based analytical systems, but no significant growth in scale-up cloud analytical systems.

3. How do data access patterns and data age impact data management requirements?

The article explains that a large percentage of data accesses are for data less than 24 hours old, and data older than a month is often queried infrequently.
This means that the working set size of data is often much smaller than the total data size, reducing the need for massive scale-out solutions.

[03] The Costs and Challenges of Keeping Data

1. What are the economic incentives to reduce the amount of data processed?

The article notes that there are acute economic pressures to process less data, as running queries on a large number of nodes can be very expensive, even with pay-per-byte-scanned pricing models.
Reducing the amount of data processed can lead to cost savings, faster query times, and the ability to run more concurrent queries.

2. What are the potential legal and operational challenges of keeping large amounts of data?

The article discusses how keeping large amounts of historical data can lead to regulatory compliance issues, legal liability, and the gradual "bit rot" of data, where the meaning and context of the data becomes harder to maintain over time.

3. How should organizations approach decisions about retaining historical data?

The article suggests that organizations should carefully consider why they are keeping historical data, whether they are actually using it, and whether storing aggregates or deleting older data would be more cost-effective.

[04] Defining "Big Data"

1. How has the definition of "Big Data" evolved over time?

The article provides two definitions of "Big Data": 1) "whatever doesn't fit on a single machine", and 2) "when the cost of keeping data around is less than the cost of figuring out what to throw away".
It notes that the number of workloads that qualify as "Big Data" under the first definition has been decreasing as hardware capabilities have improved.

2. What questions can organizations ask to determine if they are a "Big Data One-Percenter"?

The article suggests asking questions like: Do you have more than a terabyte of data? Do you regularly process more than 100 GB of data per query? Do you need to scale out to multiple nodes to get reasonable query performance?
If the answer to any of these questions is "no", the article suggests that the organization may not need to worry about "Big Data" and could benefit from newer data tools designed for more moderate data sizes.

Shared by Daniel Chen ·

Install fromChrome Web Store