Pregel: Distributed Graph Processing Framework
Motivation:-
Efficient processing of large graph is facing following problems.
Poor locality of memory access
Very little work per vertex
Changing degree of parallelism over the course of execution
No scalable general purpose system available for implementing
arbitrary graph algorithm over arbitrary graph representation in
large-scale distributed environment.
Algorithm implementation to process large graph can be done by one of
the following options.
1) Designing custom distributed infrastructure which require
considerable implementation effort for each new algorithm or graph
representation.
2) Using available distributed computing platform which are not
always well suited for graph processing like MapReduce.
3) Use of single computer graph algorithm like BGL, LEDA, NetworkX,
JDSL which limit the scalability
4) Using existing parallel graph system like BGL & CGMgraph but
they do not address fault tolerances or other distributed system
issue.
Proposed Solution:-
Valiant’s Bulk Synchronous Parallel Model
Vertex-Centric approach to solve problem
Pragel computations consist of sequences of iterations called super
Supersteps.
During each Superstep S following operations can be performed,
It can compute user defined function for each vertex V
Each V can read message, sent it while S-1 Superstep.
It can send message to V that will be received in S+1 Superstep
It can modify state of V & outgoing edge. It can also change
graph topology
Note: Messages are typically sent along outgoing edge, but message
can be sent to any vertex whose identifier is known.
Computation Model:-
Input: Directed Graph
Each vertex has Vertex-Identifier and modifiable user defined value.
Each directed edge has Source Vertices, modifiable user defined value
and target vertex identifier.
Pregel computation consists of,
Each vertex compute in parallel with same user defined function.
Algorithm Termination: It terminates when every vertex voting
for halt.
Example,
Consider Superstep 0 when all vertexes are active. A vertex
deactivates itself by voting for halt. If halted vertex receives any
messages then it again activate. To go again in deactivate stage,
vertex must vote for halt again. Algorithm terminates when every
vertex vote for halt.
Vertex & edge can be added & removed during computations.
Advantage over MapReduce:-
Graph algorithm can be written as series of chained MapReduce
invocation. This has bad performance & usability. Pregel
overcome this problems.
Pregel keeps vertices & edges on machine where it performs
computation & only use network for message passing. But heavy
network bandwidth is used in MapReduce.
MapReduce is functional type programming, so expressing graph
algorithm as chained MapReduce require passing entire state of the
graph from one stage to next stage producing large overhead on
communication & associated serializability.
There is a need to co-ordinate the steps of chained MapReduce which
adds programming complexity. That is avoided by Pregel because of
Bulk Synchronization Model.
Sample Problem Solved using Pregel:-
Objective: Find maximum number
Here dotted lines are messages. Dark vertices have voted to halt. In
each step supersteps, vertices send maximum vertex value to their
neighbor vertices. If vertex itself is bigger than its neighbor
vertex then it votes for halt. Algorithm terminates when every vertex
halt.
References:-
[1] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert,I. Horn,
N. Leiser, and G. zajkowski. Pregel: A System for Large-Scale Graph
Processing. In SIGMOD, 2011.