"Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin ..."
"To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges are data flows and the nodes are operators that process the data."
"That is, one reducer will get 10 or more times the data than other reducers. Pig has join and order by operators that will handle this case and (in some cases) rebalance the reducers."
"Users = load 'users' as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load 'pages' as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5"
"Because Hadoop is a distributed system and usually processes data in parallel, when it outputs data to a “file” it creates a directory with the file’s name, and each writer creates a separate part file in that directory."
"The only thing Pig needs to know to run on your cluster is the location of your cluster’s NameNode and JobTracker. The NameNode is the manager of HDFS, and the JobTracker coordinates MapReduce job"
"Casts to bytearrays are never allowed because Pig does not know how to represent the various data types in binary format."
"Pig does these joins in MapReduce by using the map phase to annotate each record with which input it came from. It then uses the join key as the shuffle key. Thus join forces a new reduce phase"