Friday 19 September 2014

Pregel: Distributed Graph Processing Framework My Notes

Pregel: Distributed Graph Processing Framework

Motivation:-

Efficient processing of large graph is facing following problems.
  • Poor locality of memory access
  • Very little work per vertex
  • Changing degree of parallelism over the course of execution

No scalable general purpose system available for implementing arbitrary graph algorithm over arbitrary graph representation in large-scale distributed environment.

Algorithm implementation to process large graph can be done by one of the following options.
1) Designing custom distributed infrastructure which require considerable implementation effort for each new algorithm or graph representation.
2) Using available distributed computing platform which are not always well suited for graph processing like MapReduce.
3) Use of single computer graph algorithm like BGL, LEDA, NetworkX, JDSL which limit the scalability
4) Using existing parallel graph system like BGL & CGMgraph but they do not address fault tolerances or other distributed system issue.

Proposed Solution:-

Valiant’s Bulk Synchronous Parallel Model


Vertex-Centric approach to solve problem
Pragel computations consist of sequences of iterations called super Supersteps.
During each Superstep S following operations can be performed,
  1. It can compute user defined function for each vertex V
  2. Each V can read message, sent it while S-1 Superstep.
  3. It can send message to V that will be received in S+1 Superstep
  4. It can modify state of V & outgoing edge. It can also change graph topology
Note: Messages are typically sent along outgoing edge, but message can be sent to any vertex whose identifier is known.

Computation Model:-
Input: Directed Graph
Each vertex has Vertex-Identifier and modifiable user defined value. Each directed edge has Source Vertices, modifiable user defined value and target vertex identifier.
Pregel computation consists of,
  • Input
  • Supersteps separated by global synchronization points
  • Algorithm termination
  • Output
Each vertex compute in parallel with same user defined function.
Algorithm Termination: It terminates when every vertex voting for halt.
Example,
Consider Superstep 0 when all vertexes are active. A vertex deactivates itself by voting for halt. If halted vertex receives any messages then it again activate. To go again in deactivate stage, vertex must vote for halt again. Algorithm terminates when every vertex vote for halt.
Vertex & edge can be added & removed during computations.


Advantage over MapReduce:-
  1. Graph algorithm can be written as series of chained MapReduce invocation. This has bad performance & usability. Pregel overcome this problems.
  2. Pregel keeps vertices & edges on machine where it performs computation & only use network for message passing. But heavy network bandwidth is used in MapReduce.
  3. MapReduce is functional type programming, so expressing graph algorithm as chained MapReduce require passing entire state of the graph from one stage to next stage producing large overhead on communication & associated serializability.
  4. There is a need to co-ordinate the steps of chained MapReduce which adds programming complexity. That is avoided by Pregel because of Bulk Synchronization Model.

Sample Problem Solved using Pregel:-

Objective: Find maximum number
Here dotted lines are messages. Dark vertices have voted to halt. In each step supersteps, vertices send maximum vertex value to their neighbor vertices. If vertex itself is bigger than its neighbor vertex then it votes for halt. Algorithm terminates when every vertex halt.


References:-


 [1] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert,I. Horn, N. Leiser, and G. zajkowski. Pregel: A System for Large-Scale Graph Processing. In SIGMOD, 2011.

1 comment: