Hadoop: The Definitive Guide

Tom White

出版时间

2015-04-11

ISBN

9781491901632

评分

★★★★★
书籍介绍

Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing.

Learn fundamental components such as MapReduce, HDFS, and YARN

Explore MapReduce in depth, including steps for developing applications with it

Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN

Learn two data formats: Avro for data serialization and Parquet for nested data

Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)

Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop

Learn the HBase distributed database and the ZooKeeper distributed configuration service

AI导读
核心看点
  • 全面解析Hadoop 2架构,涵盖HDFS、MapReduce及YARN核心原理
  • 深入讲解数据完整性、压缩序列化及Hadoop I/O底层机制
  • 介绍Spark、Flume等周边生态项目,拓展大数据处理视野
适合谁读
  • 希望构建可靠可扩展分布式系统的程序员与系统管理员
  • 需要分析任意规模数据集,进行即席分析的技术人员
  • 对系统底层设计感兴趣,欲夯实Hadoop基础的新手
读前提醒
  • 重点研读前两部分基础与MapReduce,后部分生态项目可略读
  • 建议直接阅读英文原版,中文版翻译质量参差不齐,易误导
  • 结合具体项目实践阅读,单纯理论阅读较枯燥且难以消化
读者共识
  • Hadoop领域经典权威指南,原理讲解透彻,知识体系清晰
  • 内容详实但篇幅较长,适合做工具书查阅或入门系统学习
  • 虽MapReduce编程少用,但理解其思想对掌握大数据至关重要

本导读基于书籍简介、目录、原文摘录、短评和书评生成,不等同于全文精读。

精彩摘录
  • "* The architecture of HDFS is described in “The Hadoop Distributed File System” by Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler (Proceedings of MSST2010, May 2010, http:// storageconference.org/2010/Papers/MSST/Shvachko.pdf). † “Scaling Hadoop to 4000 nodes at Yahoo!,” http:"
  • "在许多情况下,可以视Mapreduce为关系型数据库管理系统的补充。MapReduce比较适合以批处理的方式处理需要分析整个数据集的问题,尤其是即席分析。RDBMS适用于点查询和更新,数据集被索引后,数据库系统能够提供低延迟的数据检索和快速的少量数据更新。MapReduce适合数据一次写入、多次读取的应用,而关系型数据库更适合持续更新数据集."
  • "MapReduce is a programming model for data processing. MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: t"
  • "Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined map function for each record in the split. Hadoop does its best to run the map task on a node where the input data resides in"
  • "HDFS is a filesystem designed for storing very large files with streaming data access patterns(write-once, read-many-times pattern), running on clusters of commodity hardware. HDFS blocks(>64M) are large compared to disk blocks, and the reason is to minimize the cost of seeks. Map tasks in MapReduce"
  • "One important aspect of this design is that the client contacts datanodes directly to retrieve data and is guided by the namenode to the best datanode for each block. This design allows HDFS to scale to a large number of concurrent clients, since the data traffic is spread across all the datanodes i"
  • "As the client writes data (step 3), DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue . The data queue is consumed by the Data Streamer , whose responsibility it is to ask the namenode to allocate new blocks by picking a list of suitable datanodes to"
  • "HDFS Federation, introduced in the 0.23 release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace. For example, one namenode might manage all the files rooted under /user , say, and a second namenode might handle files under /share . T"
作者简介
Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation. He works for Cloudera, a company set up to offer Hadoop support and training. Previously he was as an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has written numerous articles for O'Reilly, java.net and IBM's developerWorks, and has spoken at several conferences, including at ApacheCon 2008 on Hadoop. Tom has a Bachelor's degree in Mathematics from the University of Cambridge and a Master's in Philosophy of Science from the University of Leeds, UK.
目录
Hadoop Fundamentals
Chapter 1Meet Hadoop
Data!
Data Storage and Analysis
Querying All Your Data

显示全部
用户评论
看前两部分就行,相关的pig hive spark如果不实践也不需要深入。本科上课读过那google三篇论文,扫这本书还是很快的。
读完了,第一次接触大数据相关的内容。这本书的内容相当全面,第一部分讲原理,中间详细介绍基于hadoop的project,最后有具体的应用举例。很多地方理解的还不是很透彻,需要进一步的阅读。
2016 NO.4 深入浅出,原理讲的非常透彻。核心是 Hadoop Fundamentals 和 MapReduce 两章,但是后面的 Related Projects 也写的言简意赅,能够突出重点。比如 Flume 这一章会提到一些在 Flume 官网教程上也没提到的要点。
还好我用的时候不需要写 Java(
真尼玛长。介绍了生态圈里的大部分工具,用来总结回顾比较适合,没有实践过的读者看前两部分mr和yarn核心,扫一遍后面所有工具是做什么用的就可以了。
不必读得太详细,Hadoop生态现在很少直接上MapReduce编程了,Hadoop-Spark-Flink。
刚开始看没多少(Part I 一半不到),各种方面写得都相当清楚,不愧是基金会 member 讲自己参与设计的系统……连如何安装和配置 Hadoop cluster 写得都比垃圾官方文档详尽(…)真是高下立判啊 🤣 (Update Dec 5, 2020)跳过了关于 Pig Hive 等等 Apache 生态组件的介绍还有 Case Study。产生了已经完全掌握 Hadoop 了的错觉。不过,学了没人用的东西还真是对不起啊(半恼)
仔细读了 Part I Hadoop Fundamentals,作为新手收获挺大的。跳读了 Part IV Related Projects 大概了解了一下周边。期待理论讲得更深一些,现在真的好喜欢 System Design。
很棒
经典
收藏