Delta Lake: The Definitive Guide - Denny Lee

Delta Lake: The Definitive Guide

Denny Lee

出版社

出版时间

2021-04-01

ISBN

9781098104528

评分

★★★★★
书籍介绍

Analysis and machine learning models are only as good as the data they're built on. Querying processed data and getting insights from it requires a robust data pipeline--and an effective storage solution that ensures data quality, data integrity, and performance.

This guide introduces you to Delta Lake, an open-source format that enables building a lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS. Delta Lake enhances Apache Spark and makes it easy to store and manage massive amounts of complex data by supporting data integrity, data quality, and performance. Data engineers, data scientists, and data practitioners will learn how to build reliable data lakes and data pipelines at scale using Delta Lake.

- Understand key data reliability challenges and how to tackle them

- Learn how to use Delta Lake to realize data reliability improvements

- Concurrently run streaming and batch jobs against your data lake

- Execute update, delete, and merge commands against your data lake

- Use time travel to roll back and examine previous versions of your data

- Learn best practices to build effective, high-quality end-to-end data pipelines for real world use cases

- Integrate with other data technologies like Presto, Athena, Redshift and other BI tools

Learn how thousands of companies are processing exabytes of data per month with their lakehouse architecture using Delta Lake.

目录
1. Basic Operations on Delta Lakes
2. Time Travel with Delta Lake
3. Continuous Applications with Delta Lake
用户评论
只有3章
很好的介绍了数据仓库到数据湖再到湖仓一体的演进,delta lake本质上是把对象存储包起来提供支持ACID的数据分析框架,值得注意的是对于增量数据的合并,是delta lake虽然支持但是并不擅长,delta lake虽然能搞定该场景的ACID,但由于要重写所有相关的parquet文件,数据合并带来的复杂度将是灾难性的,对于数据天然划分好的场景来说,或者只有新增没有修改的场景则较为适用。而针对频繁数据修改的场景,有更好的增量式的架构方案。
收藏