Hadoop: The Definitive Guide

Name: Hadoop: The Definitive Guide
Availability: InStock
Rating: 8.3 (274 reviews)
Author: 未知作者
ISBN: 9780596521998

出版社

O'Reilly Media, Inc.

出版时间

2009-01-01

ISBN

9780596521998

评分

★★★★★

标签

hadoop MapReduce 分布式 Cloud

书籍介绍

Apache Hadoop is ideal for organizations with a growing need to store and process massive application datasets. Hadoop: The Definitive Guide is a comprehensive resource for using Hadoop to build reliable, scalable, distributed systems. Programmers will find details for analyzing large datasets with Hadoop, and administrators will learn how to set up and run Hadoop clusters. The book includes case studies that illustrate how Hadoop solves specific problems. Organizations large and small are adopting Apache Hadoop to deal with huge application datasets. Hadoop: The Definitive Guide provides you with the key for unlocking the wealth this data holds. Hadoop is ideal for storing and processing massive amounts of data, but until now, information on this open-source project has been lacking -- especially with regard to best practices. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems. Programmers will find details for analyzing large datasets with Hadoop, and administrators will learn how to set up and run Hadoop clusters. With case studies that illustrate how Hadoop solves specific problems, this book helps you: * Learn the Hadoop Distributed File System (HDFS), including ways to use its many APIs to transfer data * Write distributed computations with MapReduce, Hadoop's most vital component * Become familiar with Hadoop's data and IO building blocks for compression, data integrity, serialization, and persistence * Learn the common pitfalls and advanced features for writing real-world MapReduce programs * Design, build, and administer a dedicated Hadoop cluster * Use HBase, Hadoop's database for structured and semi-structured data And more. Hadoop: The Definitive Guide is still in progress, but you can get started on this technology with the Rough Cuts edition, which lets you read the book online or download it in PDF format as the manuscript evolves.

AI导读

核心看点

全面解析Hadoop架构，涵盖HDFS与MapReduce核心原理
深入讲解分布式系统构建，提供集群搭建与运维实战指南
通过案例研究展示如何利用Hadoop解决海量数据处理难题

适合谁读

希望系统学习Hadoop生态的程序员与系统管理员
需要处理大规模数据集并构建可扩展系统的技术人员
对分布式存储与批处理计算模型感兴趣的计算机专业学生

读前提醒

建议直接阅读英文原版，中文版翻译质量参差不齐且易出错
需具备扎实的Java基础，包括反射、多线程及网络编程知识
部分版本内容可能滞后，建议结合最新文档了解生态演进

读者共识

公认的经典入门教材，对Hadoop来龙去脉介绍清晰透彻
英文原版内容详实权威，但中文版翻译被广泛吐槽不靠谱
适合初学者建立知识体系，但需配合实践以克服理解困难

本导读基于书籍简介、目录、原文摘录、短评和书评生成，不等同于全文精读。

精彩摘录

"* The architecture of HDFS is described in “The Hadoop Distributed File System” by Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler (Proceedings of MSST2010, May 2010, http:// storageconference.org/2010/Papers/MSST/Shvachko.pdf). † “Scaling Hadoop to 4000 nodes at Yahoo!,” http:"
"在许多情况下，可以视Mapreduce为关系型数据库管理系统的补充。MapReduce比较适合以批处理的方式处理需要分析整个数据集的问题，尤其是即席分析。RDBMS适用于点查询和更新，数据集被索引后，数据库系统能够提供低延迟的数据检索和快速的少量数据更新。MapReduce适合数据一次写入、多次读取的应用，而关系型数据库更适合持续更新数据集."
"MapReduce is a programming model for data processing. MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: t"
"Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined map function for each record in the split. Hadoop does its best to run the map task on a node where the input data resides in"
"HDFS is a filesystem designed for storing very large files with streaming data access patterns(write-once, read-many-times pattern), running on clusters of commodity hardware. HDFS blocks(>64M) are large compared to disk blocks, and the reason is to minimize the cost of seeks. Map tasks in MapReduce"
"One important aspect of this design is that the client contacts datanodes directly to retrieve data and is guided by the namenode to the best datanode for each block. This design allows HDFS to scale to a large number of concurrent clients, since the data traffic is spread across all the datanodes i"
"As the client writes data (step 3), DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue . The data queue is consumed by the Data Streamer , whose responsibility it is to ask the namenode to allocate new blocks by picking a list of suitable datanodes to"
"HDFS Federation, introduced in the 0.23 release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace. For example, one namenode might manage all the files rooted under /user , say, and a second namenode might handle files under /share . T"