集体智慧编程

[美] Toby Segaran

出版时间

2015-02-28

ISBN

9787121254437

评分

★★★★★

标签

算法

书籍介绍

本书以机器学习与计算统计为主题背景,专门讲述如何挖掘和分析Web 上的数据和资源,如何分析用户体验、市场营销、个人品味等诸多信息,并得出有用的结论,通过复杂的算法来从Web 网站获取、收集并分析用户的数据和反馈信息,以便创造新的用户价值和商业价值。全书内容翔 实,包括协作过滤技术(实现关联产品推荐功能)、集群数据分析(在大规模数据集中发掘相似的数据子集)、搜索引擎核心技术(爬虫、索引、查询引擎、PageRank算法等)、搜索海量信息并进行分析统计得出结论的优化算法、贝叶斯过滤技术(垃圾邮件过滤、文本过滤)、用决策树技术实现预测和决策建模功能、社交网络的信息匹配技术、机器学习和人工智能应用等。

本书是Web 开发者、架构师、应用工程师等的绝佳选择。

AI导读
核心看点
  • 以Python代码实现机器学习算法,侧重实战应用
  • 涵盖协同过滤、聚类、搜索排名及优化算法
  • 通过Web数据挖掘案例,解析推荐系统原理
适合谁读
  • 希望入门机器学习与数据挖掘的Python开发者
  • 对推荐系统、搜索引擎原理感兴趣的初学者
  • 缺乏项目经验,需快速上手算法实践的工程师
读前提醒
  • 书中代码基于Python 2,阅读时需注意语法差异
  • 理论讲解较简略,建议结合其他教材补充数学基础
  • 部分数据源已过时,需自行寻找替代数据运行代码
读者共识
  • 代码清晰直观,是极佳的机器学习入门实战书
  • 理论深度有限,不适合追求算法原理深究的读者
  • 应用场景贴近生活,有助于理解算法的实际价值

本导读基于书籍简介、目录、原文摘录、短评和书评生成,不等同于全文精读。

精彩摘录
  • "Next, get a list of random people to make up the dataset. Fortunately, Hot or Not provides an API call that returns a list of people with specified criteria. In this exam- ple, the only criteria will be that the people have “meet me” profiles, since only from these profiles can you get other informa"
  • "What Does This Have to Do with the Articles Matrix? So far, what you have is a matrix of articles with word counts. The goal is to factorize this matrix, which means finding two smaller matrices that can be multiplied together to reconstruct this one. The two smaller matrices are: The features matri"
  • "Another feature that applies more evenly to a couple of companies is this one: Feature 2 (46151801.813632453, 'GOOG') (24298994.720555616, 'YHOO') (10606419.91092159, 'PG') (7711296.6887903402, 'CVX') (4711899.0067871698, 'BIIB') (4423180.7694432881, 'XOM') (3430492.5096612777, 'DNA') (2882726.88776"
  • "Because new connections are only created when necessary, this method has to return a default value if there are no connections. For links from words to the hidden layer, the default value will be –0.2 so that, by default, extra words will have a slightly negative effect on the activation level of a "
  • "Pearson Correlation Score A slightly more sophisticated way to determine the similarity between people’s inter- ests is to use a Pearson correlation coefficient. The correlation coefficient is a mea- sure of how well two sets of data fit on a straight line. The formula for this is more complicated t"
  • "Simulated annealing is an optimization method inspired by physics. Annealing is the process of heating up an alloy and then cooling it down slowly. Because the atoms are first made to jump around a lot and then gradually settle into a low energy state, the atoms can find a low energy configuration."
  • "The flight scheduling example works because moving a person from the second to the third flight of the day would probably change the overall cost by a smaller amount than moving that person to the eighth flight of the day would. If the flights were in random order, the optimization methods would wor"
  • "Squaring the numbers is common practice because it makes large differences count for even more. This means an algorithm that is very close most of the time but far off occasionally will fare worse than an algorithm that is always somewhat close. This is often desired behavior, but there are situatio"
作者简介
Toby Segaran 是 Genstruct 的软件开发主管,这家公司涉足计算生物领域,他本人的职责是设计算法,并利用数据挖掘技术来辅助了解药品机理。他还为其他几家公司和数个开源项目服务,帮助它们从收集到的数据当中分析并发掘价值。除此以外,他还建立了几个免费的网站应用,包括流行的 tasktoy 和 Lazybase。他非常喜欢滑雪与品酒,其博客地址是 blog.kiwitobes.com,现居于旧金山。
目录
前言 viii
第1章 集体智慧导言 1
什么是集体智慧 2
什么是机器学习 3
机器学习的局限 4

显示全部
用户评论
感觉比较适合建模的时候看
很棒的代码级机器学习教程,可以马上学以致用。
github 完成作业:https://github.com/wanggang3333/CollectiveIntelligence 虽然是2015年版本,但是已略显陈旧。
对于有经验的人在这本书真的不怎么样,最多三分,这对于只想了解的小白勉强能给四分。书中内容大部分都是传统机器学习内容,但是讲解都是草草带过,一点都没讲透,书中代码还挺多,但我想说代码实在太烂了。这本书不知道怎么得的这么高分,顶多当个入门目录看看。
粗略读完这些古典互联网的典籍,还是觉得不如一门视频公开课好
代码简约清晰,由繁化简,讲清楚了底层算法的原理;和生产生活结合,极大地增强知识点的印象。不失数据挖掘入门的好书
适合缺乏项目经验的同学学习,对理论和原理介绍的比较少。
算是理论与实际结合的过程看得明白了,就是有一些理论不太明白,需要自行查阅相关概念
补标
下载
收藏