adding Dockerfiles
This commit is contained in:
parent
0d9c4a7392
commit
676c30f5f4
3
.dockerignore
Normal file
3
.dockerignore
Normal file
@ -0,0 +1,3 @@
|
|||||||
|
.git
|
||||||
|
Dockerfile*
|
||||||
|
.gitignore
|
6
Dockerfile
Normal file
6
Dockerfile
Normal file
@ -0,0 +1,6 @@
|
|||||||
|
FROM gcr.io/tensorflow/tensorflow:1.3.0
|
||||||
|
|
||||||
|
RUN pip install networkx==1.11
|
||||||
|
RUN rm /notebooks/*
|
||||||
|
|
||||||
|
COPY . /notebooks
|
6
Dockerfile.gpu
Normal file
6
Dockerfile.gpu
Normal file
@ -0,0 +1,6 @@
|
|||||||
|
FROM gcr.io/tensorflow/tensorflow:1.3.0-gpu
|
||||||
|
|
||||||
|
RUN pip install networkx==1.11
|
||||||
|
RUN rm /notebooks/*
|
||||||
|
|
||||||
|
COPY . /notebooks
|
42
README.md
42
README.md
@ -7,21 +7,21 @@
|
|||||||
### Overview
|
### Overview
|
||||||
|
|
||||||
This directory contains code necessary to run the GraphSage algorithm.
|
This directory contains code necessary to run the GraphSage algorithm.
|
||||||
GraphSage can be viewed as a stochastic generalization of graph convolutions, and it is especially useful for massive, dynamic graphs that contain rich feature information.
|
GraphSage can be viewed as a stochastic generalization of graph convolutions, and it is especially useful for massive, dynamic graphs that contain rich feature information.
|
||||||
See our [paper](https://arxiv.org/pdf/1706.02216.pdf) for details on the algorithm.
|
See our [paper](https://arxiv.org/pdf/1706.02216.pdf) for details on the algorithm.
|
||||||
|
|
||||||
*Note:* GraphSage now also has better support for training on smaller, static graphs and graphs that don't have node features.
|
*Note:* GraphSage now also has better support for training on smaller, static graphs and graphs that don't have node features.
|
||||||
The original algorithm and paper are focused on the task of inductive generalization (i.e., generating embeddings for nodes that were not present during training),
|
The original algorithm and paper are focused on the task of inductive generalization (i.e., generating embeddings for nodes that were not present during training),
|
||||||
but many benchmarks/tasks use simple static graphs that do not necessarily have features.
|
but many benchmarks/tasks use simple static graphs that do not necessarily have features.
|
||||||
To support this use case, GraphSage now includes optional "identity features" that can be used with or without other node attributes.
|
To support this use case, GraphSage now includes optional "identity features" that can be used with or without other node attributes.
|
||||||
Including identity features will increase the runtime, but also potentially increase performance (at the usual risk of overfitting).
|
Including identity features will increase the runtime, but also potentially increase performance (at the usual risk of overfitting).
|
||||||
See the section on "Running the code" below.
|
See the section on "Running the code" below.
|
||||||
|
|
||||||
The example_data subdirectory contains a small example of the protein-protein interaction data,
|
The example_data subdirectory contains a small example of the protein-protein interaction data,
|
||||||
which includes 3 training graphs + one validation graph and one test graph.
|
which includes 3 training graphs + one validation graph and one test graph.
|
||||||
The full Reddit and PPI datasets (described in the paper) are available on the [project website](http://snap.stanford.edu/graphsage/).
|
The full Reddit and PPI datasets (described in the paper) are available on the [project website](http://snap.stanford.edu/graphsage/).
|
||||||
|
|
||||||
If you make use of this code or the GraphSage algorithm in your work, please cite the following paper:
|
If you make use of this code or the GraphSage algorithm in your work, please cite the following paper:
|
||||||
|
|
||||||
@inproceedings{hamilton2017inductive,
|
@inproceedings{hamilton2017inductive,
|
||||||
author = {Hamilton, William L. and Ying, Rex and Leskovec, Jure},
|
author = {Hamilton, William L. and Ying, Rex and Leskovec, Jure},
|
||||||
@ -38,30 +38,46 @@ Recent versions of TensorFlow, numpy, scipy, and networkx are required.
|
|||||||
|
|
||||||
The example_unsupervised.sh and example_supervised.sh files contain example usages of the code, which use the unsupervised and supervised variants of GraphSage, respectively.
|
The example_unsupervised.sh and example_supervised.sh files contain example usages of the code, which use the unsupervised and supervised variants of GraphSage, respectively.
|
||||||
|
|
||||||
If your benchmark/task does not require generalizing to unseen data, we recommend you try setting the "--identity_dim" flag to a value in the range [64,256].
|
If your benchmark/task does not require generalizing to unseen data, we recommend you try setting the "--identity_dim" flag to a value in the range [64,256].
|
||||||
This flag will make the model embed unique node ids as attributes, which will increase the runtime and number of parameters but also potentially increase the performance.
|
This flag will make the model embed unique node ids as attributes, which will increase the runtime and number of parameters but also potentially increase the performance.
|
||||||
Note that you should set this flag and *not* try to pass dense one-hot vectors as features (due to sparsity).
|
Note that you should set this flag and *not* try to pass dense one-hot vectors as features (due to sparsity).
|
||||||
The "dimension" of identity features specifies how many parameters there are per node in the sparse identity-feature lookup table.
|
The "dimension" of identity features specifies how many parameters there are per node in the sparse identity-feature lookup table.
|
||||||
|
|
||||||
Note that example_unsupervised.sh sets a very small max iteration number, which can be increased to improve performance.
|
Note that example_unsupervised.sh sets a very small max iteration number, which can be increased to improve performance.
|
||||||
We generally found that performance continued to improve even after the loss was very near convergence (i.e., even when the loss was decreasing at a very slow rate).
|
We generally found that performance continued to improve even after the loss was very near convergence (i.e., even when the loss was decreasing at a very slow rate).
|
||||||
|
|
||||||
*Note:* For the PPI data, and any other multi-ouput dataset that allows individual nodes to belong to multiple classes, it is necessary to set the `--sigmoid` flag during supervised training. By default the model assumes that the dataset is in the "one-hot" categorical setting.
|
*Note:* For the PPI data, and any other multi-ouput dataset that allows individual nodes to belong to multiple classes, it is necessary to set the `--sigmoid` flag during supervised training. By default the model assumes that the dataset is in the "one-hot" categorical setting.
|
||||||
|
|
||||||
|
#### Docker
|
||||||
|
|
||||||
|
You can run GraphSage inside a [docker](https://docs.docker.com/) image. After cloning the project, build and run the image as following:
|
||||||
|
|
||||||
|
$ docker build -t graphsage .
|
||||||
|
$ docker run -it graphsage bash
|
||||||
|
|
||||||
|
or start a Jupyter Notebook instead of bash:
|
||||||
|
|
||||||
|
$ docker run -it -p 8888:8888 graphsage
|
||||||
|
|
||||||
|
You can also run the GPU image using [nvidia-docker](https://github.com/NVIDIA/nvidia-docker):
|
||||||
|
|
||||||
|
$ docker build -t graphsage:gpu -f Dockerfile.gpu .
|
||||||
|
$ nvidia-docker run -it graphsage:gpu bash
|
||||||
|
|
||||||
#### Input format
|
#### Input format
|
||||||
As input, at minimum the code requires that a --train_prefix option is specified which specifies the following data files:
|
As input, at minimum the code requires that a --train_prefix option is specified which specifies the following data files:
|
||||||
|
|
||||||
* <train_prefix>-G.json -- A networkx-specified json file describing the input graph. Nodes have 'val' and 'test' attributes specifying if they are a part of the validation and test sets, respectively.
|
* <train_prefix>-G.json -- A networkx-specified json file describing the input graph. Nodes have 'val' and 'test' attributes specifying if they are a part of the validation and test sets, respectively.
|
||||||
* <train_prefix>-id_map.json -- A json-stored dictionary mapping the graph node ids to consecutive integers.
|
* <train_prefix>-id_map.json -- A json-stored dictionary mapping the graph node ids to consecutive integers.
|
||||||
* <train_prefix>-id_map.json -- A json-stored dictionary mapping the graph node ids to classes.
|
* <train_prefix>-id_map.json -- A json-stored dictionary mapping the graph node ids to classes.
|
||||||
* <train_prefix>-feats.npy [optional] --- A numpy-stored array of node features; ordering given by id_map.json. Can be omitted and only identity features will be used.
|
* <train_prefix>-feats.npy [optional] --- A numpy-stored array of node features; ordering given by id_map.json. Can be omitted and only identity features will be used.
|
||||||
* <train_prefix>-walks.txt [optional] --- A text file specifying random walk co-occurrences (one pair per line) (*only for unsupervised version of graphsage)
|
* <train_prefix>-walks.txt [optional] --- A text file specifying random walk co-occurrences (one pair per line) (*only for unsupervised version of graphsage)
|
||||||
|
|
||||||
To run the model on a new dataset, you need to make data files in the format described above.
|
To run the model on a new dataset, you need to make data files in the format described above.
|
||||||
To run random walks for the unsupervised model and to generate the <prefix>-walks.txt file)
|
To run random walks for the unsupervised model and to generate the <prefix>-walks.txt file)
|
||||||
you can use the `run_walks` function in `graphsage.utils`.
|
you can use the `run_walks` function in `graphsage.utils`.
|
||||||
|
|
||||||
#### Model variants
|
#### Model variants
|
||||||
The user must also specify a --model, the variants of which are described in detail in the paper:
|
The user must also specify a --model, the variants of which are described in detail in the paper:
|
||||||
* graphsage_mean -- GraphSage with mean-based aggregator
|
* graphsage_mean -- GraphSage with mean-based aggregator
|
||||||
* graphsage_seq -- GraphSage with LSTM-based aggregator
|
* graphsage_seq -- GraphSage with LSTM-based aggregator
|
||||||
@ -70,7 +86,7 @@ The user must also specify a --model, the variants of which are described in det
|
|||||||
* n2v -- an implementation of [DeepWalk](https://arxiv.org/abs/1403.6652) (called n2v for short in the code.)
|
* n2v -- an implementation of [DeepWalk](https://arxiv.org/abs/1403.6652) (called n2v for short in the code.)
|
||||||
|
|
||||||
#### Logging directory
|
#### Logging directory
|
||||||
Finally, a --base_log_dir should be specified (it defaults to the current directory).
|
Finally, a --base_log_dir should be specified (it defaults to the current directory).
|
||||||
The output of the model and log files will be stored in a subdirectory of the base_log_dir.
|
The output of the model and log files will be stored in a subdirectory of the base_log_dir.
|
||||||
The path to the logged data will be of the form `<sup/unsup>-<data_prefix>/graphsage-<model_description>/`.
|
The path to the logged data will be of the form `<sup/unsup>-<data_prefix>/graphsage-<model_description>/`.
|
||||||
The supervised model will output F1 scores, while the unsupervised model will train embeddings and store them.
|
The supervised model will output F1 scores, while the unsupervised model will train embeddings and store them.
|
||||||
@ -86,5 +102,5 @@ The `eval_scripts` directory contains examples of feeding the embeddings into si
|
|||||||
#### Acknowledgements
|
#### Acknowledgements
|
||||||
|
|
||||||
The original version of this code base was originally forked from https://github.com/tkipf/gcn/, and we owe many thanks to Thomas Kipf for making his code available.
|
The original version of this code base was originally forked from https://github.com/tkipf/gcn/, and we owe many thanks to Thomas Kipf for making his code available.
|
||||||
We also thank Yuanfang Li and Xin Li who contributed to a course project that was based on this work.
|
We also thank Yuanfang Li and Xin Li who contributed to a course project that was based on this work.
|
||||||
Please see the [paper](https://arxiv.org/pdf/1706.02216.pdf) for funding details and additional (non-code related) acknowledgements.
|
Please see the [paper](https://arxiv.org/pdf/1706.02216.pdf) for funding details and additional (non-code related) acknowledgements.
|
||||||
|
Loading…
Reference in New Issue
Block a user