Spark+scala implementation of neighborhood sampling and feature learning for large graphs:
This repository includes the implementation of node2vec.
- Scalable node2vec
 - Second-order random walk
 - Memory-efficient graph data structure introduced @ Spark Summit 2017 "Random Walks on Large Scale Graphs with Apache Spark" presented by Min Shen (LinkedIn)
 - Leverages graph partition information in order to optimize the communication and to speed up the random walk computation
 - Compatible with Spark-JobServer
 
- Scala 2.11 or later.
 - Maven 3+
 - Java 8+
 - Apache Spark 2.2.0 or later.
 - (Optional): Spark-JobServer 0.8.0 or later.
 
- 
Get the random walk application source code:
- using git: ' git clone [email protected]:data61/stellar-random-walk.git '
 - using http: download the zip file and unzip it.
 
 - 
Go to the source code directory. A pre-built jar file, named randomwalk-0.0.1-SNAPSHOT.jar, is available at ./target. To run the application, you use this jar file. (If you want to build the jar file from the source code, you need to have Apache Maven installed and run:
mvn clean package) - 
Download Apache Spark 2.2.0 or later (e.g release 2.2.1, pre-built for apache hadoop 2.7)
 
To run the application on your machine, you can use spark-submit script. Go to the Apache Spark directory. Run the application with the following command:
bin/spark-submit --class au.csiro.data61.randomwalk.Main [random walk dir]/target/randomwalk-0.0.1-SNAPSHOT.jar
- 
make sure that the prerequisites are installed:
- Apache spark (e.g release 2.2.1, pre-built for apache hadoop 2.7)
 - Java Virtual Machine (e.g. 9.0.1)
 - sbt
 
 - 
git clone job-server
 - 
Create a
.bashrcwith the paths to JVM, sbt and spark, e.g., for Mac OS users it will be the following:export SBT=/usr/local/Cellar/sbt/1.1.0 export SPARK_HOME=~/spark-2.2.1-bin-hadoop2.7 export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-9.0.1.jdk export PATH=$JAVA_HOME/bin:$SBT/bin:$PATH - 
Run
source .bashrc - 
Go to spark folder and run
start-all.sh: 
cd spark-2.2.1-bin-hadoop2.7/
sbin/sbin/start-all.sh
- From spark Job-server run sbt shell:
 
cd spark-jobserver/job-server/
sbt
- 
Once inside sbt shell, run
reStart. - 
Go to
http://localhost:8090/and make sure that Spark Job Server UI is working (Note: in Chrome binaries were not updated properly, while in Firefox it was ok) - 
upload randomwalk jar to the server:
curl --data-binary @randomwalk/target/randomwalk-0.0.1-SNAPSHOT.jar localhost:8090/jars/randomwalk - 
submit a job:
curl -d "rw.input = --cmd randomwalk --numWalks 1 --p 1 --q 1 --walkLength 10 --rddPartitions 10 --directed false --input [random walk dir]/src/test/resources/karate.txt --output [output dir] --partitioned false" 'localhost:8090/jobs?appName=randomwalk&classPath=au.csiro.data61.randomwalk.Main' - 
Check the status in the Spark Job-server UI
 
The following options are available:
   --walkLength <value>     walkLength: 80
   --numWalks <value>       numWalks: 10
   --p <value>              return parameter p: 1.0
   --q <value>              in-out parameter q: 1.0
   --rddPartitions <value>  Number of RDD partitions in running Random Walk and Word2vec: 200
   --weighted <value>       weighted: true
   --directed <value>       directed: false
   --w2vPartitions <value>  Number of partitions in word2vec: 10
   --input <value>          Input edge-file/paths-file: empty
   --output <value>         Output path: empty
   --singleOutput <value>   Whether to write the output in a single file: true
   --cmd <value>            command: randomwalk/embedding/node2vec (to run randomwalk + embedding)
   --partitioned <value>    Whether the graph is partitioned: false
   --lr <value>             Learning rate in word2vec: 0.025
   --iter <value>           Number of iterations in word2vec: 10
   --dim <value>            Number of dimensions in word2vec: 128
   --window <value>         Window size in word2vec: 10
For example:
   bin/spark-submit --class au.csiro.data61.randomwalk.Main ./randomwalk/target/randomwalk-0.0.1-SNAPSHOT.jar \
   --cmd randomwalk --numWalks 1 --p 1 --q 1 --walkLength 10 --rddPartitions 10 \
   --input [input edge list] --output [output directory] --partitioned false
You can choose the algorithm to run by the --cmd option:
- randomwalk: to run just the second-order randomwalk.
 - embedding: to run just word2vec given paths as input.
 - node2vec: to run randomwalk + embedding
 
The input graph must be an edge list with integer vertex IDs. For example:
src1-id dst1-id
src1-id dst2-id
... 
If the graph is weighted, it must include the weight in the last column for each edge. For example:
src1-id dst1-id 1.0
If the graph is partitioned, each edge should have a partition number, i.e., should be assigned to a partition. The partition number must be in the third column of the edge list. For example:
src1-id dst1-id 1 1.0
src1-id dst2-id 1 1.0
src3-id dst1-id 2 1.0
... 
The application itself will replicate (cut) those vertices that span among multiple partitions.
The application writes output to the disk in the directory given by the --output parameter. In this directory, three folders may be created:
- /path: the result for randomwalk.
 - /vec: vector representation of vertices.
 - /bin: word2vec model's metadata.
 
The output of the randomwalk (in the /path directory) is in the format of tab-separated integers (vertex-ids), where each line represents random steps starting from a vertex. The number of lines are equal to the number of generated paths (--numWalks*|V|). The result is partitioned in --rddPartitions number of files as plain text unless you set --singleOutput to true.
The embeddings are also in the the format of tab-separated numbers per line, where the first number represents the vertex-id and the rest of the numbers in that line represent the vertex's vector representation. The result is partitioned in --rddPartition number of files and is written as plain text unless you set --singleOutput to true.
An example of randomwalk output:
4	2	4	2	4
32	26	32	33	16
20	2	31	2	31
...
An example of vecotor representations:
34	-0.055403516	-0.007816158	0.053844377	-0.038890388	0.13484207	-0.07744342
12	-0.025323598	0.054739047	-0.077121474	-0.016575111	-0.041371193	-0.046176624
8	-0.07687951	0.120588094	-0.11744399	-0.058044944	-0.061426725	-0.13111477
...
- (Grover, Aditya, and Jure Leskovec. "node2vec: Scalable feature learning for networks." Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016.).
 - Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
 - Random Walks on Large Scale Graphs with Apache Spark