Merge ppi eval script modification with branch 'master' of https://github.com/williamleif/GraphSAGE

This commit is contained in:
RexYing 2017-10-13 13:29:41 -07:00
commit d77df9ef65
14 changed files with 288 additions and 85 deletions

3
.dockerignore Normal file
View File

@ -0,0 +1,3 @@
.git
Dockerfile*
.gitignore

6
Dockerfile Normal file
View File

@ -0,0 +1,6 @@
FROM gcr.io/tensorflow/tensorflow:1.3.0
RUN pip install networkx==1.11
RUN rm /notebooks/*
COPY . /notebooks

6
Dockerfile.gpu Normal file
View File

@ -0,0 +1,6 @@
FROM gcr.io/tensorflow/tensorflow:1.3.0-gpu
RUN pip install networkx==1.11
RUN rm /notebooks/*
COPY . /notebooks

View File

@ -1,4 +1,4 @@
## GraphSAGE: Inductive Representation Learning on Large Graphs ## GraphSage: Representation Learning on Large Graphs
#### Authors: [William L. Hamilton](http://stanford.edu/~wleif) (wleif@stanford.edu), [Rex Ying](http://joy-of-thinking.weebly.com/) (rexying@stanford.edu) #### Authors: [William L. Hamilton](http://stanford.edu/~wleif) (wleif@stanford.edu), [Rex Ying](http://joy-of-thinking.weebly.com/) (rexying@stanford.edu)
#### [Project Website](http://snap.stanford.edu/graphsage/) #### [Project Website](http://snap.stanford.edu/graphsage/)
@ -6,55 +6,87 @@
### Overview ### Overview
This directory contains code necessary to run the GraphSAGE algorithm. This directory contains code necessary to run the GraphSage algorithm.
GraphSage can be viewed as a stochastic generalization of graph convolutions, and it is especially useful for massive, dynamic graphs that contain rich feature information.
See our [paper](https://arxiv.org/pdf/1706.02216.pdf) for details on the algorithm. See our [paper](https://arxiv.org/pdf/1706.02216.pdf) for details on the algorithm.
*Note:* GraphSage now also has better support for training on smaller, static graphs and graphs that don't have node features.
The original algorithm and paper are focused on the task of inductive generalization (i.e., generating embeddings for nodes that were not present during training),
but many benchmarks/tasks use simple static graphs that do not necessarily have features.
To support this use case, GraphSage now includes optional "identity features" that can be used with or without other node attributes.
Including identity features will increase the runtime, but also potentially increase performance (at the usual risk of overfitting).
See the section on "Running the code" below.
The example_data subdirectory contains a small example of the protein-protein interaction data, The example_data subdirectory contains a small example of the protein-protein interaction data,
which includes 3 training graphs + one validation graph and one test graph. which includes 3 training graphs + one validation graph and one test graph.
The full Reddit and PPI datasets (described in the paper) are available on the [project website](http://snap.stanford.edu/graphsage/). The full Reddit and PPI datasets (described in the paper) are available on the [project website](http://snap.stanford.edu/graphsage/).
If you make use of this code or the GraphSAGE algorithm in your work, please cite the following paper: If you make use of this code or the GraphSage algorithm in your work, please cite the following paper:
@article{hamilton2017inductive, @inproceedings{hamilton2017inductive,
author = {Hamilton, William L. and Ying, Rex and Leskovec, Jure}, author = {Hamilton, William L. and Ying, Rex and Leskovec, Jure},
title = {Inductive Representation Learning on Large Graphs}, title = {Inductive Representation Learning on Large Graphs},
journal = {arXiv preprint, arXiv:1603.04467}, booktitle = {NIPS},
year = {2017} year = {2017}
} }
### Requirements ### Requirements
Recent versions of TensorFlow, numpy, scipy, and networkx are required. Recent versions of TensorFlow, numpy, scipy, and networkx are required (but networkx must be <=1.11). To guarantee that you have the right package versions, you can use [docker](https://docs.docker.com/) to easily set up a virtual environment. See the Docker subsection below for more info.
#### Docker
If you do not have [docker](https://docs.docker.com/) installed, you will need to do so. (Just click on the preceding link, the installation is pretty painless).
You can run GraphSage inside a [docker](https://docs.docker.com/) image. After cloning the project, build and run the image as following:
$ docker build -t graphsage .
$ docker run -it graphsage bash
or start a Jupyter Notebook instead of bash:
$ docker run -it -p 8888:8888 graphsage
You can also run the GPU image using [nvidia-docker](https://github.com/NVIDIA/nvidia-docker):
$ docker build -t graphsage:gpu -f Dockerfile.gpu .
$ nvidia-docker run -it graphsage:gpu bash
### Running the code ### Running the code
The example_unsupervised.sh and example_supervised.sh files contain example usages of the code, which use the unsupervised and supervised variants of GraphSAGE, respectively. The example_unsupervised.sh and example_supervised.sh files contain example usages of the code, which use the unsupervised and supervised variants of GraphSage, respectively.
If your benchmark/task does not require generalizing to unseen data, we recommend you try setting the "--identity_dim" flag to a value in the range [64,256].
This flag will make the model embed unique node ids as attributes, which will increase the runtime and number of parameters but also potentially increase the performance.
Note that you should set this flag and *not* try to pass dense one-hot vectors as features (due to sparsity).
The "dimension" of identity features specifies how many parameters there are per node in the sparse identity-feature lookup table.
Note that example_unsupervised.sh sets a very small max iteration number, which can be increased to improve performance. Note that example_unsupervised.sh sets a very small max iteration number, which can be increased to improve performance.
We generally found that performance continued to improve even after the loss was very near convergence (i.e., even when the loss was decreasing at a very slow rate). We generally found that performance continued to improve even after the loss was very near convergence (i.e., even when the loss was decreasing at a very slow rate).
*Note:* For the PPI data, and any other multi-ouput dataset that allows individual nodes to belong to multiple classes, it is necessary to set the `--sigmoid` flag during supervised training. By default the model assumes that the dataset is in the "one-hot" categorical setting. *Note:* For the PPI data, and any other multi-ouput dataset that allows individual nodes to belong to multiple classes, it is necessary to set the `--sigmoid` flag during supervised training. By default the model assumes that the dataset is in the "one-hot" categorical setting.
#### Input format #### Input format
As input, at minimum the code requires that a --train_prefix option is specified which specifies the following data files: As input, at minimum the code requires that a --train_prefix option is specified which specifies the following data files:
* <train_prefix>-G.json -- A networkx-specified json file describing the input graph. Nodes have 'val' and 'test' attributes specifying if they are a part of the validation and test sets, respectively. * <train_prefix>-G.json -- A networkx-specified json file describing the input graph. Nodes have 'val' and 'test' attributes specifying if they are a part of the validation and test sets, respectively.
* <train_prefix>-id_map.json -- A json-stored dictionary mapping the graph node ids to consecutive integers. * <train_prefix>-id_map.json -- A json-stored dictionary mapping the graph node ids to consecutive integers.
* <train_prefix>-id_map.json -- A json-stored dictionary mapping the graph node ids to classes. * <train_prefix>-id_map.json -- A json-stored dictionary mapping the graph node ids to classes.
* <train_prefix>-feats.npy --- A numpy-stored array of node features; ordering given by id_map.json * <train_prefix>-feats.npy [optional] --- A numpy-stored array of node features; ordering given by id_map.json. Can be omitted and only identity features will be used.
* <train_prefix>-walks.txt --- A text file specifying random walk co-occurrences (one pair per line) (*only for unsupervised version of graphsage) * <train_prefix>-walks.txt [optional] --- A text file specifying random walk co-occurrences (one pair per line) (*only for unsupervised version of graphsage)
To run the model on a new dataset, you need to make data files in the format described above. To run the model on a new dataset, you need to make data files in the format described above.
To run random walks for the unsupervised model and to generate the <prefix>-walks.txt file) To run random walks for the unsupervised model and to generate the <prefix>-walks.txt file)
you can use the `run_walks` function in `graphsage.utils`. you can use the `run_walks` function in `graphsage.utils`.
#### Model variants #### Model variants
The user must also specify a --model, the variants of which are described in detail in the paper: The user must also specify a --model, the variants of which are described in detail in the paper:
* graphsage_mean -- GraphSAGE with mean-based aggregator * graphsage_mean -- GraphSage with mean-based aggregator
* graphsage_seq -- GraphSAGE with LSTM-based aggregator * graphsage_seq -- GraphSage with LSTM-based aggregator
* graphsage_pool -- GraphSAGE with max-pooling aggregator * graphsage_maxpool -- GraphSage with max-pooling aggregator (as described in the NIPS 2017 paper)
* gcn -- GraphSAGE with GCN-based aggregator * graphsage_meanpool -- GraphSage with mean-pooling aggregator (a variant of the pooling aggregator, where the element-wie mean replaces the element-wise max).
* gcn -- GraphSage with GCN-based aggregator
* n2v -- an implementation of [DeepWalk](https://arxiv.org/abs/1403.6652) (called n2v for short in the code.) * n2v -- an implementation of [DeepWalk](https://arxiv.org/abs/1403.6652) (called n2v for short in the code.)
#### Logging directory #### Logging directory
@ -67,7 +99,7 @@ Note that the full log outputs and stored embeddings can be 5-10Gb in size (on t
#### Using the output of the unsupervised models #### Using the output of the unsupervised models
The unsupervised variants of GraphSAGE will output embeddings to the logging directory as described above. The unsupervised variants of GraphSage will output embeddings to the logging directory as described above.
These embeddings can then be used in downstream machine learning applications. These embeddings can then be used in downstream machine learning applications.
The `eval_scripts` directory contains examples of feeding the embeddings into simple logistic classifiers. The `eval_scripts` directory contains examples of feeding the embeddings into simple logistic classifiers.

View File

@ -31,11 +31,11 @@ def run_regression(train_embeds, train_labels, test_embeds, test_labels):
if __name__ == '__main__': if __name__ == '__main__':
parser = ArgumentParser("Run evaluation on citation data.") parser = ArgumentParser("Run evaluation on citation data.")
parser.add_argument("dataset_dir", help="Path to directory containing the dataset.") parser.add_argument("dataset_dir", help="Path to directory containing the dataset.")
parser.add_argument("data_dir", help="Path to directory containing the learned node embeddings.") parser.add_argument("embed_dir", help="Path to directory containing the learned node embeddings.")
parser.add_argument("setting", help="Either val or test.") parser.add_argument("setting", help="Either val or test.")
args = parser.parse_args() args = parser.parse_args()
dataset_dir = args.dataset_dir dataset_dir = args.dataset_dir
data_dir = args.data_dir data_dir = args.embed_dir
setting = args.setting setting = args.setting
print("Loading data...") print("Loading data...")

View File

@ -32,11 +32,11 @@ def run_regression(train_embeds, train_labels, test_embeds, test_labels):
if __name__ == '__main__': if __name__ == '__main__':
parser = ArgumentParser("Run evaluation on PPI data.") parser = ArgumentParser("Run evaluation on PPI data.")
parser.add_argument("dataset_dir", help="Path to directory containing the dataset.") parser.add_argument("dataset_dir", help="Path to directory containing the dataset.")
parser.add_argument("data_dir", help="Path to directory containing the learned node embeddings. Set to 'feat' for raw features.") parser.add_argument("embed_dir", help="Path to directory containing the learned node embeddings. Set to 'feat' for raw features.")
parser.add_argument("setting", help="Either val or test.") parser.add_argument("setting", help="Either val or test.")
args = parser.parse_args() args = parser.parse_args()
dataset_dir = args.dataset_dir dataset_dir = args.dataset_dir
data_dir = args.data_dir data_dir = args.embed_dir
setting = args.setting setting = args.setting
print("Loading data...") print("Loading data...")

View File

@ -24,11 +24,11 @@ def run_regression(train_embeds, train_labels, test_embeds, test_labels):
if __name__ == '__main__': if __name__ == '__main__':
parser = ArgumentParser("Run evaluation on Reddit data.") parser = ArgumentParser("Run evaluation on Reddit data.")
parser.add_argument("dataset_dir", help="Path to directory containing the dataset.") parser.add_argument("dataset_dir", help="Path to directory containing the dataset.")
parser.add_argument("data_dir", help="Path to directory containing the learned node embeddings. Set to 'feat' for raw features.") parser.add_argument("embed_dir", help="Path to directory containing the learned node embeddings. Set to 'feat' for raw features.")
parser.add_argument("setting", help="Either val or test.") parser.add_argument("setting", help="Either val or test.")
args = parser.parse_args() args = parser.parse_args()
dataset_dir = args.dataset_dir dataset_dir = args.dataset_dir
data_dir = args.data_dir data_dir = args.embed_dir
setting = args.setting setting = args.setting
print("Loading data...") print("Loading data...")

View File

@ -116,12 +116,12 @@ class GCNAggregator(Layer):
return self.act(output) return self.act(output)
class PoolingAggregator(Layer): class MaxPoolingAggregator(Layer):
""" Aggregates via max-pooling over MLP functions. """ Aggregates via max-pooling over MLP functions.
""" """
def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None, def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs): dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
super(PoolingAggregator, self).__init__(**kwargs) super(MaxPoolingAggregator, self).__init__(**kwargs)
self.dropout = dropout self.dropout = dropout
self.bias = bias self.bias = bias
@ -194,12 +194,91 @@ class PoolingAggregator(Layer):
return self.act(output) return self.act(output)
class TwoLayerPoolingAggregator(Layer): class MeanPoolingAggregator(Layer):
""" Aggregates via mean-pooling over MLP functions.
"""
def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
super(MeanPoolingAggregator, self).__init__(**kwargs)
self.dropout = dropout
self.bias = bias
self.act = act
self.concat = concat
if neigh_input_dim is None:
neigh_input_dim = input_dim
if name is not None:
name = '/' + name
else:
name = ''
if model_size == "small":
hidden_dim = self.hidden_dim = 512
elif model_size == "big":
hidden_dim = self.hidden_dim = 1024
self.mlp_layers = []
self.mlp_layers.append(Dense(input_dim=neigh_input_dim,
output_dim=hidden_dim,
act=tf.nn.relu,
dropout=dropout,
sparse_inputs=False,
logging=self.logging))
with tf.variable_scope(self.name + name + '_vars'):
self.vars['neigh_weights'] = glorot([hidden_dim, output_dim],
name='neigh_weights')
self.vars['self_weights'] = glorot([input_dim, output_dim],
name='self_weights')
if self.bias:
self.vars['bias'] = zeros([self.output_dim], name='bias')
if self.logging:
self._log_vars()
self.input_dim = input_dim
self.output_dim = output_dim
self.neigh_input_dim = neigh_input_dim
def _call(self, inputs):
self_vecs, neigh_vecs = inputs
neigh_h = neigh_vecs
dims = tf.shape(neigh_h)
batch_size = dims[0]
num_neighbors = dims[1]
# [nodes * sampled neighbors] x [hidden_dim]
h_reshaped = tf.reshape(neigh_h, (batch_size * num_neighbors, self.neigh_input_dim))
for l in self.mlp_layers:
h_reshaped = l(h_reshaped)
neigh_h = tf.reshape(h_reshaped, (batch_size, num_neighbors, self.hidden_dim))
neigh_h = tf.reduce_mean(neigh_h, axis=1)
from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])
from_self = tf.matmul(self_vecs, self.vars["self_weights"])
if not self.concat:
output = tf.add_n([from_self, from_neighs])
else:
output = tf.concat([from_self, from_neighs], axis=1)
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)
class TwoMaxLayerPoolingAggregator(Layer):
""" Aggregates via pooling over two MLP functions. """ Aggregates via pooling over two MLP functions.
""" """
def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None, def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs): dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
super(TwoLayerPoolingAggregator, self).__init__(**kwargs) super(TwoMaxLayerPoolingAggregator, self).__init__(**kwargs)
self.dropout = dropout self.dropout = dropout
self.bias = bias self.bias = bias

View File

@ -42,15 +42,15 @@ class EdgeMinibatchIterator(object):
self.train_edges = self.edges = np.random.permutation(edges) self.train_edges = self.edges = np.random.permutation(edges)
if not n2v_retrain: if not n2v_retrain:
self.train_edges = self._remove_isolated(self.train_edges) self.train_edges = self._remove_isolated(self.train_edges)
self.val_edges = [e for e in G.edges_iter() if G[e[0]][e[1]]['train_removed']] self.val_edges = [e for e in G.edges() if G[e[0]][e[1]]['train_removed']]
else: else:
if fixed_n2v: if fixed_n2v:
self.train_edges = self.val_edges = self._n2v_prune(self.edges) self.train_edges = self.val_edges = self._n2v_prune(self.edges)
else: else:
self.train_edges = self.val_edges = self.edges self.train_edges = self.val_edges = self.edges
print(len([n for n in G.nodes_iter() if not G.node[n]['test'] and not G.node[n]['val']]), 'train nodes') print(len([n for n in G.nodes() if not G.node[n]['test'] and not G.node[n]['val']]), 'train nodes')
print(len([n for n in G.nodes_iter() if G.node[n]['test'] or G.node[n]['val']]), 'test nodes') print(len([n for n in G.nodes() if G.node[n]['test'] or G.node[n]['val']]), 'test nodes')
self.val_set_size = len(self.val_edges) self.val_set_size = len(self.val_edges)
def _n2v_prune(self, edges): def _n2v_prune(self, edges):
@ -59,13 +59,18 @@ class EdgeMinibatchIterator(object):
def _remove_isolated(self, edge_list): def _remove_isolated(self, edge_list):
new_edge_list = [] new_edge_list = []
missing = 0
for n1, n2 in edge_list: for n1, n2 in edge_list:
if not n1 in self.G.node or not n2 in self.G.node:
missing += 1
continue
if (self.deg[self.id2idx[n1]] == 0 or self.deg[self.id2idx[n2]] == 0) \ if (self.deg[self.id2idx[n1]] == 0 or self.deg[self.id2idx[n2]] == 0) \
and (not self.G.node[n1]['test'] or self.G.node[n1]['val']) \ and (not self.G.node[n1]['test'] or self.G.node[n1]['val']) \
and (not self.G.node[n2]['test'] or self.G.node[n2]['val']): and (not self.G.node[n2]['test'] or self.G.node[n2]['val']):
continue continue
else: else:
new_edge_list.append((n1,n2)) new_edge_list.append((n1,n2))
print("Unexpected missing:", missing)
return new_edge_list return new_edge_list
def construct_adj(self): def construct_adj(self):
@ -153,7 +158,7 @@ class EdgeMinibatchIterator(object):
def label_val(self): def label_val(self):
train_edges = [] train_edges = []
val_edges = [] val_edges = []
for n1, n2 in self.G.edges_iter(): for n1, n2 in self.G.edges():
if (self.G.node[n1]['val'] or self.G.node[n1]['test'] if (self.G.node[n1]['val'] or self.G.node[n1]['test']
or self.G.node[n2]['val'] or self.G.node[n2]['test']): or self.G.node[n2]['val'] or self.G.node[n2]['test']):
val_edges.append((n1,n2)) val_edges.append((n1,n2))
@ -200,8 +205,8 @@ class NodeMinibatchIterator(object):
self.adj, self.deg = self.construct_adj() self.adj, self.deg = self.construct_adj()
self.test_adj = self.construct_test_adj() self.test_adj = self.construct_test_adj()
self.val_nodes = [n for n in self.G.nodes_iter() if self.G.node[n]['val']] self.val_nodes = [n for n in self.G.nodes() if self.G.node[n]['val']]
self.test_nodes = [n for n in self.G.nodes_iter() if self.G.node[n]['test']] self.test_nodes = [n for n in self.G.nodes() if self.G.node[n]['test']]
self.no_train_nodes_set = set(self.val_nodes + self.test_nodes) self.no_train_nodes_set = set(self.val_nodes + self.test_nodes)
self.train_nodes = set(G.nodes()).difference(self.no_train_nodes_set) self.train_nodes = set(G.nodes()).difference(self.no_train_nodes_set)

View File

@ -7,7 +7,7 @@ import graphsage.layers as layers
import graphsage.metrics as metrics import graphsage.metrics as metrics
from .prediction import BipartiteEdgePredLayer from .prediction import BipartiteEdgePredLayer
from .aggregators import MeanAggregator, PoolingAggregator, SeqAggregator, GCNAggregator, TwoLayerPoolingAggregator from .aggregators import MeanAggregator, MaxPoolingAggregator, MeanPoolingAggregator, SeqAggregator, GCNAggregator
flags = tf.app.flags flags = tf.app.flags
FLAGS = flags.FLAGS FLAGS = flags.FLAGS
@ -191,12 +191,13 @@ class SampleAndAggregate(GeneralizedModel):
def __init__(self, placeholders, features, adj, degrees, def __init__(self, placeholders, features, adj, degrees,
layer_infos, concat=True, aggregator_type="mean", layer_infos, concat=True, aggregator_type="mean",
model_size="small", model_size="small", identity_dim=0,
**kwargs): **kwargs):
''' '''
Args: Args:
- placeholders: Stanford TensorFlow placeholder object. - placeholders: Stanford TensorFlow placeholder object.
- features: Numpy array with node features. - features: Numpy array with node features.
NOTE: Pass a None object to train in featureless mode (identity features for nodes)!
- adj: Numpy array with adjacency lists (padded with random re-samples) - adj: Numpy array with adjacency lists (padded with random re-samples)
- degrees: Numpy array with node degrees. - degrees: Numpy array with node degrees.
- layer_infos: List of SAGEInfo namedtuples that describe the parameters of all - layer_infos: List of SAGEInfo namedtuples that describe the parameters of all
@ -204,16 +205,17 @@ class SampleAndAggregate(GeneralizedModel):
- concat: whether to concatenate during recursive iterations - concat: whether to concatenate during recursive iterations
- aggregator_type: how to aggregate neighbor information - aggregator_type: how to aggregate neighbor information
- model_size: one of "small" and "big" - model_size: one of "small" and "big"
- identity_dim: Set to positive int to use identity features (slow and cannot generalize, but better accuracy)
''' '''
super(SampleAndAggregate, self).__init__(**kwargs) super(SampleAndAggregate, self).__init__(**kwargs)
if aggregator_type == "mean": if aggregator_type == "mean":
self.aggregator_cls = MeanAggregator self.aggregator_cls = MeanAggregator
elif aggregator_type == "seq": elif aggregator_type == "seq":
self.aggregator_cls = SeqAggregator self.aggregator_cls = SeqAggregator
elif aggregator_type == "pool": elif aggregator_type == "maxpool":
self.aggregator_cls = PoolingAggregator self.aggregator_cls = MaxPoolingAggregator
elif aggregator_type == "pool_2": elif aggregator_type == "meanpool":
self.aggregator_cls = TwoLayerPoolingAggregator self.aggregator_cls = MeanPoolingAggregator
elif aggregator_type == "gcn": elif aggregator_type == "gcn":
self.aggregator_cls = GCNAggregator self.aggregator_cls = GCNAggregator
else: else:
@ -224,11 +226,22 @@ class SampleAndAggregate(GeneralizedModel):
self.inputs2 = placeholders["batch2"] self.inputs2 = placeholders["batch2"]
self.model_size = model_size self.model_size = model_size
self.adj_info = adj self.adj_info = adj
self.features = tf.Variable(tf.constant(features, dtype=tf.float32), trainable=False) if identity_dim > 0:
self.embeds = tf.get_variable("node_embeddings", [adj.get_shape().as_list()[0], identity_dim])
else:
self.embeds = None
if features is None:
if identity_dim == 0:
raise Exception("Must have a positive value for identity feature dimension if no input features given.")
self.features = self.embeds
else:
self.features = tf.Variable(tf.constant(features, dtype=tf.float32), trainable=False)
if not self.embeds is None:
self.features = tf.concat([self.embeds, self.features], axis=1)
self.degrees = degrees self.degrees = degrees
self.concat = concat self.concat = concat
self.dims = [features.shape[1]] self.dims = [(0 if features is None else features.shape[1]) + identity_dim]
self.dims.extend([layer_infos[i].output_dim for i in range(len(layer_infos))]) self.dims.extend([layer_infos[i].output_dim for i in range(len(layer_infos))])
self.batch_size = placeholders["batch_size"] self.batch_size = placeholders["batch_size"]
self.placeholders = placeholders self.placeholders = placeholders

View File

@ -2,7 +2,7 @@ import tensorflow as tf
import graphsage.models as models import graphsage.models as models
import graphsage.layers as layers import graphsage.layers as layers
from graphsage.aggregators import MeanAggregator, PoolingAggregator, SeqAggregator, GCNAggregator, TwoLayerPoolingAggregator from graphsage.aggregators import MeanAggregator, MaxPoolingAggregator, MeanPoolingAggregator, SeqAggregator, GCNAggregator
flags = tf.app.flags flags = tf.app.flags
FLAGS = flags.FLAGS FLAGS = flags.FLAGS
@ -13,7 +13,7 @@ class SupervisedGraphsage(models.SampleAndAggregate):
def __init__(self, num_classes, def __init__(self, num_classes,
placeholders, features, adj, degrees, placeholders, features, adj, degrees,
layer_infos, concat=True, aggregator_type="mean", layer_infos, concat=True, aggregator_type="mean",
model_size="small", sigmoid_loss=False, model_size="small", sigmoid_loss=False, identity_dim=0,
**kwargs): **kwargs):
''' '''
Args: Args:
@ -35,10 +35,10 @@ class SupervisedGraphsage(models.SampleAndAggregate):
self.aggregator_cls = MeanAggregator self.aggregator_cls = MeanAggregator
elif aggregator_type == "seq": elif aggregator_type == "seq":
self.aggregator_cls = SeqAggregator self.aggregator_cls = SeqAggregator
elif aggregator_type == "pool": elif aggregator_type == "meanpool":
self.aggregator_cls = PoolingAggregator self.aggregator_cls = MeanPoolingAggregator
elif aggregator_type == "pool_2": elif aggregator_type == "maxpool":
self.aggregator_cls = TwoLayerPoolingAggregator self.aggregator_cls = MaxPoolingAggregator
elif aggregator_type == "gcn": elif aggregator_type == "gcn":
self.aggregator_cls = GCNAggregator self.aggregator_cls = GCNAggregator
else: else:
@ -48,13 +48,23 @@ class SupervisedGraphsage(models.SampleAndAggregate):
self.inputs1 = placeholders["batch"] self.inputs1 = placeholders["batch"]
self.model_size = model_size self.model_size = model_size
self.adj_info = adj self.adj_info = adj
self.features = tf.Variable(tf.constant(features, dtype=tf.float32), trainable=False) if identity_dim > 0:
self.embeds = tf.get_variable("node_embeddings", [adj.get_shape().as_list()[0], identity_dim])
else:
self.embeds = None
if features is None:
if identity_dim == 0:
raise Exception("Must have a positive value for identity feature dimension if no input features given.")
self.features = self.embeds
else:
self.features = tf.Variable(tf.constant(features, dtype=tf.float32), trainable=False)
if not self.embeds is None:
self.features = tf.concat([self.embeds, self.features], axis=1)
self.degrees = degrees self.degrees = degrees
self.concat = concat self.concat = concat
self.num_classes = num_classes self.num_classes = num_classes
self.sigmoid_loss = sigmoid_loss self.sigmoid_loss = sigmoid_loss
self.dims = [(0 if features is None else features.shape[1]) + identity_dim]
self.dims = [features.shape[1]]
self.dims.extend([layer_infos[i].output_dim for i in range(len(layer_infos))]) self.dims.extend([layer_infos[i].output_dim for i in range(len(layer_infos))])
self.batch_size = placeholders["batch_size"] self.batch_size = placeholders["batch_size"]
self.placeholders = placeholders self.placeholders = placeholders

View File

@ -39,13 +39,14 @@ flags.DEFINE_float('dropout', 0.0, 'dropout rate (1 - keep probability).')
flags.DEFINE_float('weight_decay', 0.0, 'weight for l2 loss on embedding matrix.') flags.DEFINE_float('weight_decay', 0.0, 'weight for l2 loss on embedding matrix.')
flags.DEFINE_integer('max_degree', 128, 'maximum node degree.') flags.DEFINE_integer('max_degree', 128, 'maximum node degree.')
flags.DEFINE_integer('samples_1', 25, 'number of samples in layer 1') flags.DEFINE_integer('samples_1', 25, 'number of samples in layer 1')
flags.DEFINE_integer('samples_2', 10, 'number of users samples in layer 2') flags.DEFINE_integer('samples_2', 10, 'number of samples in layer 2')
flags.DEFINE_integer('samples_3', 0, 'number of users samples in layer 3. (Only or mean model)') flags.DEFINE_integer('samples_3', 0, 'number of users samples in layer 3. (Only for mean model)')
flags.DEFINE_integer('dim_1', 128, 'Size of output dim (final is 2x this, if using concat)') flags.DEFINE_integer('dim_1', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_integer('dim_2', 128, 'Size of output dim (final is 2x this, if using concat)') flags.DEFINE_integer('dim_2', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_boolean('random_context', True, 'Whether to use random context or direct edges') flags.DEFINE_boolean('random_context', True, 'Whether to use random context or direct edges')
flags.DEFINE_integer('batch_size', 512, 'minibatch size.') flags.DEFINE_integer('batch_size', 512, 'minibatch size.')
flags.DEFINE_boolean('sigmoid', False, 'whether to use sigmoid loss') flags.DEFINE_boolean('sigmoid', False, 'whether to use sigmoid loss')
flags.DEFINE_integer('identity_dim', 0, 'Set to positive value to use identity embedding features of that dimension. Default 0.')
#logging, saving, validation settings etc. #logging, saving, validation settings etc.
flags.DEFINE_string('base_log_dir', '.', 'base directory for logging and saving embeddings') flags.DEFINE_string('base_log_dir', '.', 'base directory for logging and saving embeddings')
@ -124,13 +125,14 @@ def train(train_data, test_data=None):
features = train_data[1] features = train_data[1]
id_map = train_data[2] id_map = train_data[2]
class_map = train_data[4] class_map = train_data[4]
if isinstance(class_map.values()[0], list): if isinstance(list(class_map.values())[0], list):
num_classes = len(class_map.values()[0]) num_classes = len(list(class_map.values())[0])
else: else:
num_classes = len(set(class_map.values())) num_classes = len(set(class_map.values()))
# pad with dummy zero vector if not features is None:
features = np.vstack([features, np.zeros((features.shape[1],))]) # pad with dummy zero vector
features = np.vstack([features, np.zeros((features.shape[1],))])
context_pairs = train_data[3] if FLAGS.random_context else None context_pairs = train_data[3] if FLAGS.random_context else None
placeholders = construct_placeholders(num_classes) placeholders = construct_placeholders(num_classes)
@ -164,6 +166,7 @@ def train(train_data, test_data=None):
layer_infos, layer_infos,
model_size=FLAGS.model_size, model_size=FLAGS.model_size,
sigmoid_loss = FLAGS.sigmoid, sigmoid_loss = FLAGS.sigmoid,
identity_dim = FLAGS.identity_dim,
logging=True) logging=True)
elif FLAGS.model == 'gcn': elif FLAGS.model == 'gcn':
# Create model # Create model
@ -180,6 +183,7 @@ def train(train_data, test_data=None):
model_size=FLAGS.model_size, model_size=FLAGS.model_size,
concat=False, concat=False,
sigmoid_loss = FLAGS.sigmoid, sigmoid_loss = FLAGS.sigmoid,
identity_dim = FLAGS.identity_dim,
logging=True) logging=True)
elif FLAGS.model == 'graphsage_seq': elif FLAGS.model == 'graphsage_seq':
@ -195,9 +199,10 @@ def train(train_data, test_data=None):
aggregator_type="seq", aggregator_type="seq",
model_size=FLAGS.model_size, model_size=FLAGS.model_size,
sigmoid_loss = FLAGS.sigmoid, sigmoid_loss = FLAGS.sigmoid,
identity_dim = FLAGS.identity_dim,
logging=True) logging=True)
elif FLAGS.model == 'graphsage_pool': elif FLAGS.model == 'graphsage_maxpool':
sampler = UniformNeighborSampler(adj_info) sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1), layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)] SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
@ -210,7 +215,25 @@ def train(train_data, test_data=None):
aggregator_type="pool", aggregator_type="pool",
model_size=FLAGS.model_size, model_size=FLAGS.model_size,
sigmoid_loss = FLAGS.sigmoid, sigmoid_loss = FLAGS.sigmoid,
identity_dim = FLAGS.identity_dim,
logging=True) logging=True)
elif FLAGS.model == 'graphsage_meanpool':
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
model = SupervisedGraphsage(num_classes, placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
aggregator_type="meanpool",
model_size=FLAGS.model_size,
sigmoid_loss = FLAGS.sigmoid,
identity_dim = FLAGS.identity_dim,
logging=True)
else: else:
raise Exception('Error: model name unrecognized.') raise Exception('Error: model name unrecognized.')

View File

@ -43,6 +43,7 @@ flags.DEFINE_boolean('random_context', True, 'Whether to use random context or d
flags.DEFINE_integer('neg_sample_size', 20, 'number of negative samples') flags.DEFINE_integer('neg_sample_size', 20, 'number of negative samples')
flags.DEFINE_integer('batch_size', 512, 'minibatch size.') flags.DEFINE_integer('batch_size', 512, 'minibatch size.')
flags.DEFINE_integer('n2v_test_epochs', 1, 'Number of new SGD epochs for n2v.') flags.DEFINE_integer('n2v_test_epochs', 1, 'Number of new SGD epochs for n2v.')
flags.DEFINE_integer('identity_dim', 0, 'Set to positive value to use identity embedding features of that dimension. Default 0.')
#logging, saving, validation settings etc. #logging, saving, validation settings etc.
flags.DEFINE_boolean('save_embeddings', True, 'whether to save embeddings for all nodes after training') flags.DEFINE_boolean('save_embeddings', True, 'whether to save embeddings for all nodes after training')
@ -115,7 +116,7 @@ def save_val_embeddings(sess, model, minibatch_iter, size, out_dir, mod=""):
with open(out_dir + name + mod + ".txt", "w") as fp: with open(out_dir + name + mod + ".txt", "w") as fp:
fp.write("\n".join(map(str,nodes))) fp.write("\n".join(map(str,nodes)))
def construct_placeholders(feature_size): def construct_placeholders():
# Define placeholders # Define placeholders
placeholders = { placeholders = {
'batch1' : tf.placeholder(tf.int32, shape=(None), name='batch1'), 'batch1' : tf.placeholder(tf.int32, shape=(None), name='batch1'),
@ -133,12 +134,12 @@ def train(train_data, test_data=None):
features = train_data[1] features = train_data[1]
id_map = train_data[2] id_map = train_data[2]
# pad with dummy zero vector if not features is None:
features = np.vstack([features, np.zeros((features.shape[1],))]) # pad with dummy zero vector
feature_size = features.shape[1] features = np.vstack([features, np.zeros((features.shape[1],))])
context_pairs = train_data[3] if FLAGS.random_context else None context_pairs = train_data[3] if FLAGS.random_context else None
placeholders = construct_placeholders(feature_size) placeholders = construct_placeholders()
minibatch = EdgeMinibatchIterator(G, minibatch = EdgeMinibatchIterator(G,
id_map, id_map,
placeholders, batch_size=FLAGS.batch_size, placeholders, batch_size=FLAGS.batch_size,
@ -159,6 +160,7 @@ def train(train_data, test_data=None):
minibatch.deg, minibatch.deg,
layer_infos=layer_infos, layer_infos=layer_infos,
model_size=FLAGS.model_size, model_size=FLAGS.model_size,
identity_dim = FLAGS.identity_dim,
logging=True) logging=True)
elif FLAGS.model == 'gcn': elif FLAGS.model == 'gcn':
# Create model # Create model
@ -173,6 +175,7 @@ def train(train_data, test_data=None):
layer_infos=layer_infos, layer_infos=layer_infos,
aggregator_type="gcn", aggregator_type="gcn",
model_size=FLAGS.model_size, model_size=FLAGS.model_size,
identity_dim = FLAGS.identity_dim,
concat=False, concat=False,
logging=True) logging=True)
@ -186,11 +189,12 @@ def train(train_data, test_data=None):
adj_info, adj_info,
minibatch.deg, minibatch.deg,
layer_infos=layer_infos, layer_infos=layer_infos,
identity_dim = FLAGS.identity_dim,
aggregator_type="seq", aggregator_type="seq",
model_size=FLAGS.model_size, model_size=FLAGS.model_size,
logging=True) logging=True)
elif FLAGS.model == 'graphsage_pool': elif FLAGS.model == 'graphsage_maxpool':
sampler = UniformNeighborSampler(adj_info) sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1), layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)] SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
@ -200,9 +204,25 @@ def train(train_data, test_data=None):
adj_info, adj_info,
minibatch.deg, minibatch.deg,
layer_infos=layer_infos, layer_infos=layer_infos,
aggregator_type="pool", aggregator_type="maxpool",
model_size=FLAGS.model_size, model_size=FLAGS.model_size,
identity_dim = FLAGS.identity_dim,
logging=True) logging=True)
elif FLAGS.model == 'graphsage_meanpool':
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
aggregator_type="meanpool",
model_size=FLAGS.model_size,
identity_dim = FLAGS.identity_dim,
logging=True)
elif FLAGS.model == 'n2v': elif FLAGS.model == 'n2v':
model = Node2VecModel(placeholders, features.shape[0], model = Node2VecModel(placeholders, features.shape[0],
minibatch.deg, minibatch.deg,
@ -354,7 +374,7 @@ def train(train_data, test_data=None):
def main(argv=None): def main(argv=None):
print("Loading training data..") print("Loading training data..")
train_data = load_data(FLAGS.train_prefix) train_data = load_data(FLAGS.train_prefix, load_walks=True)
print("Done loading training data..") print("Done loading training data..")
train(train_data) train(train_data)

View File

@ -4,13 +4,14 @@ import numpy as np
import random import random
import json import json
import sys import sys
import os
from networkx.readwrite import json_graph from networkx.readwrite import json_graph
WALK_LEN=5 WALK_LEN=5
N_WALKS=50 N_WALKS=50
def load_data(prefix, normalize=True): def load_data(prefix, normalize=True, load_walks=False):
G_data = json.load(open(prefix + "-G.json")) G_data = json.load(open(prefix + "-G.json"))
G = json_graph.node_link_graph(G_data) G = json_graph.node_link_graph(G_data)
if isinstance(G.nodes()[0], int): if isinstance(G.nodes()[0], int):
@ -18,29 +19,33 @@ def load_data(prefix, normalize=True):
else: else:
conversion = lambda n : n conversion = lambda n : n
feats = np.load(prefix + "-feats.npy") if os.path.exists(prefix + "-feats.npy"):
feats = np.load(prefix + "-feats.npy")
else:
print("No features present.. Only identity features will be used.")
feats = None
id_map = json.load(open(prefix + "-id_map.json")) id_map = json.load(open(prefix + "-id_map.json"))
id_map = {conversion(k):int(v) for k,v in id_map.iteritems()} id_map = {conversion(k):int(v) for k,v in id_map.items()}
walks = [] walks = []
class_map = json.load(open(prefix + "-class_map.json")) class_map = json.load(open(prefix + "-class_map.json"))
if isinstance(class_map.values()[0], list): if isinstance(list(class_map.values())[0], list):
lab_conversion = lambda n : n lab_conversion = lambda n : n
else: else:
lab_conversion = lambda n : int(n) lab_conversion = lambda n : int(n)
class_map = {conversion(k):lab_conversion(v) for k,v in class_map.iteritems()} class_map = {conversion(k):lab_conversion(v) for k,v in class_map.items()}
## Make sure the graph has edge train_removed annotations ## Make sure the graph has edge train_removed annotations
## (some datasets might already have this..) ## (some datasets might already have this..)
print("Loaded data.. now preprocessing..") print("Loaded data.. now preprocessing..")
for edge in G.edges_iter(): for edge in G.edges():
if (G.node[edge[0]]['val'] or G.node[edge[1]]['val'] or if (G.node[edge[0]]['val'] or G.node[edge[1]]['val'] or
G.node[edge[0]]['test'] or G.node[edge[1]]['test']): G.node[edge[0]]['test'] or G.node[edge[1]]['test']):
G[edge[0]][edge[1]]['train_removed'] = True G[edge[0]][edge[1]]['train_removed'] = True
else: else:
G[edge[0]][edge[1]]['train_removed'] = False G[edge[0]][edge[1]]['train_removed'] = False
if normalize: if normalize and not feats is None:
from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import StandardScaler
train_ids = np.array([id_map[n] for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']]) train_ids = np.array([id_map[n] for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']])
train_feats = feats[train_ids] train_feats = feats[train_ids]
@ -48,9 +53,10 @@ def load_data(prefix, normalize=True):
scaler.fit(train_feats) scaler.fit(train_feats)
feats = scaler.transform(feats) feats = scaler.transform(feats)
with open(prefix + "-walks.txt") as fp: if load_walks:
for line in fp: with open(prefix + "-walks.txt") as fp:
walks.append(map(conversion, line.split())) for line in fp:
walks.append(map(conversion, line.split()))
return G, feats, id_map, walks, class_map return G, feats, id_map, walks, class_map