Nishant Gandhi

Friday, 9 December 2016

Execute python data science packages from Java

Goal: To be able to execute python scripts from Java. Python Scripts include the use of python data science package like numpy, scipy, pandas.

Experiment 1: Jython with PythonInterpreter

Pros:

Can run single-single python statement or python script
Objects inside python can be excessed from java

Cons:

Because it work on top of Jython distribution of python, Third party package(i.e. pandas, pymongo) based on Cpython might not be available
JyNI project helps to add some third party Cpython type packages but it is in initial stage and import Data Science package like “Numpy”, “Panda” not supported.

Experiment 2: Jython with ScriptEngine

Skipping this part for now. It is very similar to “PythonInterpreter” approach and core problem still of supporting Cpython packages is still there.

Experiment 3: Java Embedded Python (Initial Experiment Successful)

Pros:

Can work on default python packages
Python Script processing
Objects inside python can be excessed from java
Third Party Cpython type package support

Cons:

Complicated setup

Summary:

Jython:

Jython based solution might not work because java based python compiler does not support Cpython based packages like Pandas, Numpy etc. The best effort made to make Jython compatible with Cpython based package is via project “JyNi”. But Jyni also does not support Numpy or panda as of now.

JEP:

JEP is the best option for our use case so far. It works with Cpython interpreter which we use by default. It allow us to use all the packages/libraries supported by regular Cpython. Though it is tested successfully for importing packages like “pandas” and “nltk”, more test are required to check its boundary of compatibility.

Java Embedded Python (JEP)

Project Page: https://github.com/mrj0/jep

Documentation: https://github.com/mrj0/jep/wiki

Prerequisite:

JDK 6+
Python 2.6+

Installation:

sudo pip install jep

How to run JEP Shell?

The jep run script is available at /usr/local/bin directory. You can go to terminal and type “jep” to launch JEP shell.

How to Setup and Run python script in java using JEP?

Setting up Environment Variable

LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libpython2.7.so"
LD_LIBRARY_PATH="/usr/lib:/usr/local/lib/python2.7/dist-packages/"

Note: if you are not sure of the values you should set for this environment line variables, open the file “/usr/local/bin/jep” and check the values for same.

add jar file => “/usr/local/lib/python2.7/dist-packages/jep/jep-3.5.0.jar” into your classpath or buildpath.
Instantiate “Jep” class and use it to execute python code.

Example 1:

Main.java:

import jep.Jep;

import jep.JepException;

public class Main {

public static void main(String[] args) throws JepException {

Jep jep = new Jep();

jep.eval("import sys");

jep.eval("s = 'Hello World'");

jep.eval("print s");

String java_string = jep.getValue("s").toString();

System.out.println("Java String:" + java_string);

jep.runScript("src/myscript.py");

}

myscript.py:

for i in range(1,5):

print i

Thursday, 12 March 2015

Worker assignment for apache giraph job

Apache giraph job must be assigned the number of workers(which is basically count of mappers). But with how much workers should we assign for best runtime of our job?
Well few tips I got from my friend semih, from standard university who also developed GPS, an alternative to apache giraph project.

Tips:

Each node should be configured to have mapper count same as number of processor cores. So if you have 2 core in the processor of your hadoop node than set mapred.map.tasks=2.

Each job must be assigned workers in the multiple of hadoop nodes. Let say you have 8 node cluster than you must assign "-w" value as 7 or 15 or 23 etc.. Note: worker count start from zero(0).

Again thanks to semih, phD scholar from Stanford University for these tips.

Firefox page refresh on given interval

Apache giraph convert it's job into maprduce job. So after launching giraph job, i can view the progress on job tracker portal. Now apache giraph runs in iteration called supersteps. So if i want to know the current job/superstep status than i need to keep pressing button F5 for refresh. Than i thought if i can find something for autorefresh and i found very good addon for firfox called ReloadEvery. It has fix intetvals options for any page to refresh. Just right click and you can find Reload Every. Great tool and works best for my problem.

Keylogger for firefox

Few days back , i was thinking to test keylogger for linux but could not find something good. Than i thought to find it browser. Found few keylogger addones but kl works great for me. Does not store any info about visiting pages. Just pure key press tracker..

Monday, 9 March 2015

Writeup I sent to semih, Stanford University for my Giraph Tester project.

Question: How to write and run your own apache giraph code (Computation / InputFormat / OutputFormat)?

Answer: There are two ways according to my knowledge.

Way 1:

1. Write your custom computation code.

2. Copy that file in apache giraph source code's “giraph-examples” folder.

3. Compile whole giraph source code again with maven.

4. Use giraph jar to run your code, similar to example given on giraph quick start page.

Problem:

Very slow process.

Hadoop psudo-mode setup require.

Input and output files are in HDFS.

Everytime you need to compile giraph-source code with maven.

Way 2:

1. Use Eclipse or other IDE

2. Add jar files from Apache-Hadoop's lib folder & Apache-Giraph's lib folder in Build-path.

3. Write your custom inputFormat/outputFormat/Computation java code

4. Write your giraph Runner java code. (Giraph Job Runner file)

5. Run and test your code on single Click

Advantages:

Faster than Way1.

No Hadoop setup require.

Input and Output files are on local Systems only.

In development phase, we make lot of changes in our code and we need fast result on our sample test input file. So way-2 works better in this case.

Question: How we can debug in those two cases?

Answer: We have already discussed two Ways to write/run giraph code.

For way-1, semih and his team at Stanford University have developed tool called Graft and now it is part of Apache Giraph project.

For way-2, I am not sure if Graft can also work in this case. If not than we can build another debugger to work for way-2.

Question: How to approach for building debugger for Way-2?

Answer: Basic idea is to trace the state of vertice,edges & messages of each superstep, store them in JSON format and plot them using your custom Graph Visulization program.

Question: How much progress I have done for building debugger for way-2?

Answer: I have partially developed Graph Visulization program which is inspired by Graft. I have defined my own JSON format and using it to plot graphs according to coresponding supersteps.

Question: Where Am I stuck?

Answer: I am not able figure out on how to trace program. I must store trace in JSON format and visualize it.

I can either write my trace method which need to be called by user and use will feed current vertex & message status as parameter in it.

I may also change the original java source code of standard files(like BasicComputation) and repackage them in jar. So user must use my modified jar files instead of original giraph-lib jars.

Friday, 19 September 2014

Pregel: Distributed Graph Processing Framework My Notes

Pregel: Distributed Graph Processing Framework

Motivation:-

Efficient processing of large graph is facing following problems.

Poor locality of memory access
Very little work per vertex
Changing degree of parallelism over the course of execution

No scalable general purpose system available for implementing arbitrary graph algorithm over arbitrary graph representation in large-scale distributed environment.

Algorithm implementation to process large graph can be done by one of the following options.

1) Designing custom distributed infrastructure which require considerable implementation effort for each new algorithm or graph representation.

2) Using available distributed computing platform which are not always well suited for graph processing like MapReduce.

3) Use of single computer graph algorithm like BGL, LEDA, NetworkX, JDSL which limit the scalability

4) Using existing parallel graph system like BGL & CGMgraph but they do not address fault tolerances or other distributed system issue.

Proposed Solution:-

Valiant’s Bulk Synchronous Parallel Model

Vertex-Centric approach to solve problem

Pragel computations consist of sequences of iterations called super Supersteps.

During each Superstep S following operations can be performed,

It can compute user defined function for each vertex V
Each V can read message, sent it while S-1 Superstep.
It can send message to V that will be received in S+1 Superstep
It can modify state of V & outgoing edge. It can also change graph topology

Note: Messages are typically sent along outgoing edge, but message can be sent to any vertex whose identifier is known.

Computation Model:-

Input: Directed Graph

Each vertex has Vertex-Identifier and modifiable user defined value. Each directed edge has Source Vertices, modifiable user defined value and target vertex identifier.

Pregel computation consists of,

Input
Supersteps separated by global synchronization points
Algorithm termination
Output

Each vertex compute in parallel with same user defined function.

Algorithm Termination: It terminates when every vertex voting for halt.

Example,

Consider Superstep 0 when all vertexes are active. A vertex deactivates itself by voting for halt. If halted vertex receives any messages then it again activate. To go again in deactivate stage, vertex must vote for halt again. Algorithm terminates when every vertex vote for halt.

Vertex & edge can be added & removed during computations.

Advantage over MapReduce:-

Graph algorithm can be written as series of chained MapReduce invocation. This has bad performance & usability. Pregel overcome this problems.
Pregel keeps vertices & edges on machine where it performs computation & only use network for message passing. But heavy network bandwidth is used in MapReduce.
MapReduce is functional type programming, so expressing graph algorithm as chained MapReduce require passing entire state of the graph from one stage to next stage producing large overhead on communication & associated serializability.
There is a need to co-ordinate the steps of chained MapReduce which adds programming complexity. That is avoided by Pregel because of Bulk Synchronization Model.

Sample Problem Solved using Pregel:-

Objective: Find maximum number

Here dotted lines are messages. Dark vertices have voted to halt. In each step supersteps, vertices send maximum vertex value to their neighbor vertices. If vertex itself is bigger than its neighbor vertex then it votes for halt. Algorithm terminates when every vertex halt.

References:-

[1] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert,I. Horn, N. Leiser, and G. zajkowski. Pregel: A System for Large-Scale Graph Processing. In SIGMOD, 2011.