Initial commit of cleaned repo.

This commit is contained in:
williamleif 2017-05-29 08:35:30 -07:00
commit 40f5f4828f
23 changed files with 1898310 additions and 0 deletions

109
.gitignore vendored Normal file
View File

@ -0,0 +1,109 @@
# Custom
*.idea
*.png
*.pdf
tmp/
*.txt
*swp*
*.sw?
gcn_back
.DS_STORE
*.aux
*.log
*.out
*.bbl
*.synctex.gz
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# IPython Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# dotenv
.env
# virtualenv
venv/
ENV/
# Spyder project settings
.spyderproject
# Rope project settings
.ropeproject
*.pickle
*.pkl

21
LICENSE.txt Normal file
View File

@ -0,0 +1,21 @@
The MIT License
Copyright (c) 2017 William L. Hamilton, Rex Ying
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

41
README.md Normal file
View File

@ -0,0 +1,41 @@
# GraphSAGE code
## Overview
This directory contains code necessary to run the GraphSAGE algorithm.
See our paper for details on the algorithm: TODO arxiv link.
The example_data subdirectory contains a small example of the PPI data,
which includes 3 training networks + one validation network and one test network.
The full Reddit and PPI datasets are available at: TODO
The Web of Science data can be released to groups or individuals with valid WoS access licenses.
## Requirements
Recent versions of TensorFlow, numpy, scipy, and networkx are required.
## Running the code
The example_unsupervised.sh and example_supervised.sh files contain example usages of the code.
(example_unsupervised.sh sets a very small max iteration number, which can be increased to improve performance.)
As input, at minimum the code requires that a --train_prefix option is specified which specifies the following data files:
* <train_prefix>-G.json -- "A networkx-specified json file describing the input graph."
* <train_prefix>-id_map.json -- "A json-stored dictionary mapping the graph node ids to consecutive integers."
* <train_prefix>-id_map.json -- "A json-stored dictionary mapping the graph node ids to classes."
* <train_prefix>-feats.npy --- "A numpy-stored array of node features; ordering given by id_map.json"
* <train_prefix>-walks.txt --- "A text file specifying random walk co-occurrences (one pair per line)" (*only for unsupervised)
The user must also specify a --model, the variants of which are described in detail in the paper:
* graphsage_mean -- GraphSAGE with mean-based aggregator
* graphsage_seq -- GraphSAGE with LSTM-based aggregator
* graphsage_pool -- GraphSAGE with max-pooling aggregator
* gcn -- GraphSAGE with GCN-based aggregator
* n2v -- an implementation of DeepWalk (called n2v for short everywhere)
Finally, a --base_log_dir should be specified (it defaults to the current directory).
The output of the model and log files will be stored in a subdirectory of the base_log_dir.
The supervised model will output F1 scores, while the unsupervised model will train embeddings and store them.
The unsupervised embeddings will be stored at val.npy with val.txt specifying the order of embeddings as a per-line list of node ids.
Note that the full log outputs and stored embeddings can be 5-10Gb in size (on the full data).
The other inputs and hyperparameters are described in the TensorFlow flags.

1
example_data/ppi-G.json Normal file

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

BIN
example_data/ppi-feats.npy Normal file

Binary file not shown.

File diff suppressed because one or more lines are too long

1895817
example_data/ppi-walks.txt Normal file

File diff suppressed because it is too large Load Diff

1
example_supervised.sh Executable file
View File

@ -0,0 +1 @@
python -m graphsage.supervised_train --train_prefix ./example_data/ppi --model graphsage_mean --sigmoid

1
example_unsupervised.sh Executable file
View File

@ -0,0 +1 @@
python -m graphsage.unsupervised_train --train_prefix ./example_data/ppi --model graphsage_mean --max_total_steps 1000

2
graphsage/__init__.py Normal file
View File

@ -0,0 +1,2 @@
from __future__ import print_function
from __future__ import division

371
graphsage/aggregators.py Normal file
View File

@ -0,0 +1,371 @@
import tensorflow as tf
from .layers import Layer, Dense
from .inits import glorot, zeros
class MeanAggregator(Layer):
"""
Aggregates via mean followed by matmul and non-linearity.
"""
def __init__(self, input_dim, output_dim, neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu,
name=None, concat=False, **kwargs):
super(MeanAggregator, self).__init__(**kwargs)
self.dropout = dropout
self.bias = bias
self.act = act
self.concat = concat
if neigh_input_dim is None:
neigh_input_dim = input_dim
if name is not None:
name = '/' + name
else:
name = ''
with tf.variable_scope(self.name + name + '_vars'):
self.vars['neigh_weights'] = glorot([neigh_input_dim, output_dim],
name='neigh_weights')
self.vars['self_weights'] = glorot([input_dim, output_dim],
name='self_weights')
if self.bias:
self.vars['bias'] = zeros([self.output_dim], name='bias')
if self.logging:
self._log_vars()
self.input_dim = input_dim
self.output_dim = output_dim
def _call(self, inputs):
self_vecs, neigh_vecs = inputs
neigh_vecs = tf.nn.dropout(neigh_vecs, 1-self.dropout)
self_vecs = tf.nn.dropout(self_vecs, 1-self.dropout)
neigh_means = tf.reduce_mean(neigh_vecs, axis=1)
# [nodes] x [out_dim]
from_neighs = tf.matmul(neigh_means, self.vars['neigh_weights'])
from_self = tf.matmul(self_vecs, self.vars["self_weights"])
if not self.concat:
output = tf.add_n([from_self, from_neighs])
else:
output = tf.concat([from_self, from_neighs], axis=1)
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)
class GCNAggregator(Layer):
"""
Aggregates via mean followed by matmul and non-linearity.
Same matmul parameters are used self vector and neighbor vectors.
"""
def __init__(self, input_dim, output_dim, neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
super(GCNAggregator, self).__init__(**kwargs)
self.dropout = dropout
self.bias = bias
self.act = act
self.concat = concat
if neigh_input_dim is None:
neigh_input_dim = input_dim
if name is not None:
name = '/' + name
else:
name = ''
with tf.variable_scope(self.name + name + '_vars'):
self.vars['weights'] = glorot([neigh_input_dim, output_dim],
name='neigh_weights')
if self.bias:
self.vars['bias'] = zeros([self.output_dim], name='bias')
if self.logging:
self._log_vars()
self.input_dim = input_dim
self.output_dim = output_dim
def _call(self, inputs):
self_vecs, neigh_vecs = inputs
neigh_vecs = tf.nn.dropout(neigh_vecs, 1-self.dropout)
self_vecs = tf.nn.dropout(self_vecs, 1-self.dropout)
means = tf.reduce_mean(tf.concat([neigh_vecs,
tf.expand_dims(self_vecs, axis=1)], axis=1), axis=1)
# [nodes] x [out_dim]
output = tf.matmul(means, self.vars['weights'])
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)
class PoolingAggregator(Layer):
""" Aggregates via max-pooling over MLP functions.
"""
def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
super(PoolingAggregator, self).__init__(**kwargs)
self.dropout = dropout
self.bias = bias
self.act = act
self.concat = concat
if neigh_input_dim is None:
neigh_input_dim = input_dim
if name is not None:
name = '/' + name
else:
name = ''
if model_size == "small":
hidden_dim = self.hidden_dim = 512
elif model_size == "big":
hidden_dim = self.hidden_dim = 1024
self.mlp_layers = []
self.mlp_layers.append(Dense(input_dim=neigh_input_dim,
output_dim=hidden_dim,
act=tf.nn.relu,
dropout=dropout,
sparse_inputs=False,
logging=self.logging))
with tf.variable_scope(self.name + name + '_vars'):
self.vars['neigh_weights'] = glorot([hidden_dim, output_dim],
name='neigh_weights')
self.vars['self_weights'] = glorot([input_dim, output_dim],
name='self_weights')
if self.bias:
self.vars['bias'] = zeros([self.output_dim], name='bias')
if self.logging:
self._log_vars()
self.input_dim = input_dim
self.output_dim = output_dim
self.neigh_input_dim = neigh_input_dim
def _call(self, inputs):
self_vecs, neigh_vecs = inputs
neigh_h = neigh_vecs
dims = tf.shape(neigh_h)
batch_size = dims[0]
num_neighbors = dims[1]
# [nodes * sampled neighbors] x [hidden_dim]
h_reshaped = tf.reshape(neigh_h, (batch_size * num_neighbors, self.neigh_input_dim))
for l in self.mlp_layers:
h_reshaped = l(h_reshaped)
neigh_h = tf.reshape(h_reshaped, (batch_size, num_neighbors, self.hidden_dim))
neigh_h = tf.reduce_max(neigh_h, axis=1)
from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])
from_self = tf.matmul(self_vecs, self.vars["self_weights"])
if not self.concat:
output = tf.add_n([from_self, from_neighs])
else:
output = tf.concat([from_self, from_neighs], axis=1)
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)
class TwoLayerPoolingAggregator(Layer):
""" Aggregates via pooling over two MLP functions.
"""
def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
super(TwoLayerPoolingAggregator, self).__init__(**kwargs)
self.dropout = dropout
self.bias = bias
self.act = act
self.concat = concat
if neigh_input_dim is None:
neigh_input_dim = input_dim
if name is not None:
name = '/' + name
else:
name = ''
if model_size == "small":
hidden_dim_1 = self.hidden_dim_1 = 512
hidden_dim_2 = self.hidden_dim_2 = 256
elif model_size == "big":
hidden_dim_1 = self.hidden_dim_1 = 1024
hidden_dim_2 = self.hidden_dim_2 = 512
self.mlp_layers = []
self.mlp_layers.append(Dense(input_dim=neigh_input_dim,
output_dim=hidden_dim_1,
act=tf.nn.relu,
dropout=dropout,
sparse_inputs=False,
logging=self.logging))
self.mlp_layers.append(Dense(input_dim=hidden_dim_1,
output_dim=hidden_dim_2,
act=tf.nn.relu,
dropout=dropout,
sparse_inputs=False,
logging=self.logging))
with tf.variable_scope(self.name + name + '_vars'):
self.vars['neigh_weights'] = glorot([hidden_dim_2, output_dim],
name='neigh_weights')
self.vars['self_weights'] = glorot([input_dim, output_dim],
name='self_weights')
if self.bias:
self.vars['bias'] = zeros([self.output_dim], name='bias')
if self.logging:
self._log_vars()
self.input_dim = input_dim
self.output_dim = output_dim
self.neigh_input_dim = neigh_input_dim
def _call(self, inputs):
self_vecs, neigh_vecs = inputs
neigh_h = neigh_vecs
dims = tf.shape(neigh_h)
batch_size = dims[0]
num_neighbors = dims[1]
# [nodes * sampled neighbors] x [hidden_dim]
h_reshaped = tf.reshape(neigh_h, (batch_size * num_neighbors, self.neigh_input_dim))
for l in self.mlp_layers:
h_reshaped = l(h_reshaped)
neigh_h = tf.reshape(h_reshaped, (batch_size, num_neighbors, self.hidden_dim_2))
neigh_h = tf.reduce_max(neigh_h, axis=1)
from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])
from_self = tf.matmul(self_vecs, self.vars["self_weights"])
if not self.concat:
output = tf.add_n([from_self, from_neighs])
else:
output = tf.concat([from_self, from_neighs], axis=1)
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)
class SeqAggregator(Layer):
""" Aggregates via a standard LSTM.
"""
def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
super(SeqAggregator, self).__init__(**kwargs)
self.dropout = dropout
self.bias = bias
self.act = act
self.concat = concat
if neigh_input_dim is None:
neigh_input_dim = input_dim
if name is not None:
name = '/' + name
else:
name = ''
if model_size == "small":
hidden_dim = self.hidden_dim = 128
elif model_size == "big":
hidden_dim = self.hidden_dim = 256
with tf.variable_scope(self.name + name + '_vars'):
self.vars['neigh_weights'] = glorot([hidden_dim, output_dim],
name='neigh_weights')
self.vars['self_weights'] = glorot([input_dim, output_dim],
name='self_weights')
if self.bias:
self.vars['bias'] = zeros([self.output_dim], name='bias')
if self.logging:
self._log_vars()
self.input_dim = input_dim
self.output_dim = output_dim
self.neigh_input_dim = neigh_input_dim
self.cell = tf.contrib.rnn.BasicLSTMCell(self.hidden_dim)
def _call(self, inputs):
self_vecs, neigh_vecs = inputs
dims = tf.shape(neigh_vecs)
batch_size = dims[0]
initial_state = self.cell.zero_state(batch_size, tf.float32)
used = tf.sign(tf.reduce_max(tf.abs(neigh_vecs), axis=2))
length = tf.reduce_sum(used, axis=1)
length = tf.maximum(length, tf.constant(1.))
length = tf.cast(length, tf.int32)
with tf.variable_scope(self.name) as scope:
try:
rnn_outputs, rnn_states = tf.nn.dynamic_rnn(
self.cell, neigh_vecs,
initial_state=initial_state, dtype=tf.float32, time_major=False,
sequence_length=length)
except ValueError:
scope.reuse_variables()
rnn_outputs, rnn_states = tf.nn.dynamic_rnn(
self.cell, neigh_vecs,
initial_state=initial_state, dtype=tf.float32, time_major=False,
sequence_length=length)
batch_size = tf.shape(rnn_outputs)[0]
max_len = tf.shape(rnn_outputs)[1]
out_size = int(rnn_outputs.get_shape()[2])
index = tf.range(0, batch_size) * max_len + (length - 1)
flat = tf.reshape(rnn_outputs, [-1, out_size])
neigh_h = tf.gather(flat, index)
from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])
from_self = tf.matmul(self_vecs, self.vars["self_weights"])
output = tf.add_n([from_self, from_neighs])
if not self.concat:
output = tf.add_n([from_self, from_neighs])
else:
output = tf.concat([from_self, from_neighs], axis=1)
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)

31
graphsage/inits.py Normal file
View File

@ -0,0 +1,31 @@
import tensorflow as tf
import numpy as np
# DISCLAIMER:
# Parts of this code file are derived from
# https://github.com/tkipf/gcn
# (A full license with proper attributions will be provided in the
# public repo of this code base)
def uniform(shape, scale=0.05, name=None):
"""Uniform init."""
initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
return tf.Variable(initial, name=name)
def glorot(shape, name=None):
"""Glorot & Bengio (AISTATS 2010) init."""
init_range = np.sqrt(6.0/(shape[0]+shape[1]))
initial = tf.random_uniform(shape, minval=-init_range, maxval=init_range, dtype=tf.float32)
return tf.Variable(initial, name=name)
def zeros(shape, name=None):
"""All zeros."""
initial = tf.zeros(shape, dtype=tf.float32)
return tf.Variable(initial, name=name)
def ones(shape, name=None):
"""All ones."""
initial = tf.ones(shape, dtype=tf.float32)
return tf.Variable(initial, name=name)

118
graphsage/layers.py Normal file
View File

@ -0,0 +1,118 @@
from __future__ import division
from __future__ import print_function
import tensorflow as tf
from graphsage.inits import zeros
flags = tf.app.flags
FLAGS = flags.FLAGS
# DISCLAIMER:
# Boilerplate parts of this code file were originally forked from
# https://github.com/tkipf/gcn
# which itself was very inspired by the keras package
# (A full license with non-anonymized attributions will be provided in the
# public repo of this code base)
# global unique layer ID dictionary for layer name assignment
_LAYER_UIDS = {}
def get_layer_uid(layer_name=''):
"""Helper function, assigns unique layer IDs."""
if layer_name not in _LAYER_UIDS:
_LAYER_UIDS[layer_name] = 1
return 1
else:
_LAYER_UIDS[layer_name] += 1
return _LAYER_UIDS[layer_name]
class Layer(object):
"""Base layer class. Defines basic API for all layer objects.
Implementation inspired by keras (http://keras.io).
# Properties
name: String, defines the variable scope of the layer.
logging: Boolean, switches Tensorflow histogram logging on/off
# Methods
_call(inputs): Defines computation graph of layer
(i.e. takes input, returns output)
__call__(inputs): Wrapper for _call()
_log_vars(): Log all variables
"""
def __init__(self, **kwargs):
allowed_kwargs = {'name', 'logging', 'model_size'}
for kwarg in kwargs.keys():
assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg
name = kwargs.get('name')
if not name:
layer = self.__class__.__name__.lower()
name = layer + '_' + str(get_layer_uid(layer))
self.name = name
self.vars = {}
logging = kwargs.get('logging', False)
self.logging = logging
self.sparse_inputs = False
def _call(self, inputs):
return inputs
def __call__(self, inputs):
with tf.name_scope(self.name):
if self.logging and not self.sparse_inputs:
tf.summary.histogram(self.name + '/inputs', inputs)
outputs = self._call(inputs)
if self.logging:
tf.summary.histogram(self.name + '/outputs', outputs)
return outputs
def _log_vars(self):
for var in self.vars:
tf.summary.histogram(self.name + '/vars/' + var, self.vars[var])
class Dense(Layer):
"""Dense layer."""
def __init__(self, input_dim, output_dim, dropout=0.,
act=tf.nn.relu, placeholders=None, bias=True, featureless=False,
sparse_inputs=False, **kwargs):
super(Dense, self).__init__(**kwargs)
self.dropout = dropout
self.act = act
self.featureless = featureless
self.bias = bias
self.input_dim = input_dim
self.output_dim = output_dim
# helper variable for sparse dropout
self.sparse_inputs = sparse_inputs
if sparse_inputs:
self.num_features_nonzero = placeholders['num_features_nonzero']
with tf.variable_scope(self.name + '_vars'):
self.vars['weights'] = tf.get_variable('weights', shape=(input_dim, output_dim),
dtype=tf.float32,
initializer=tf.contrib.layers.xavier_initializer(),
regularizer=tf.contrib.layers.l2_regularizer(FLAGS.weight_decay))
if self.bias:
self.vars['bias'] = zeros([output_dim], name='bias')
if self.logging:
self._log_vars()
def _call(self, inputs):
x = inputs
x = tf.nn.dropout(x, 1-self.dropout)
# transform
output = tf.matmul(x, self.vars['weights'])
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)

42
graphsage/metrics.py Normal file
View File

@ -0,0 +1,42 @@
import tensorflow as tf
# DISCLAIMER:
# Parts of this code file were originally forked from
# https://github.com/tkipf/gcn
# which itself was very inspired by the keras package
# (A full license with de-anonymized attributions will be provided in the
# public repo of this code base)
def masked_logit_cross_entropy(preds, labels, mask):
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=preds, labels=labels)
loss = tf.reduce_sum(loss, axis=1)
mask = tf.cast(mask, dtype=tf.float32)
mask /= tf.maximum(tf.reduce_sum(mask), tf.constant([1.]))
loss *= mask
return tf.reduce_mean(loss)
def masked_softmax_cross_entropy(preds, labels, mask):
loss = tf.nn.softmax_cross_entropy_with_logits(logits=preds, labels=labels)
# loss = tf.reduce_sum(loss, axis=1)
mask = tf.cast(mask, dtype=tf.float32)
mask /= tf.maximum(tf.reduce_sum(mask), tf.constant([1.]))
loss *= mask
return tf.reduce_mean(loss)
def masked_l2(preds, actuals, mask):
"""Softmax cross-entropy loss with masking."""
loss = tf.nn.l2(preds, actuals)
mask = tf.cast(mask, dtype=tf.float32)
mask /= tf.reduce_mean(mask)
loss *= mask
return tf.reduce_mean(loss)
def masked_accuracy(preds, labels, mask):
"""Accuracy with masking."""
correct_prediction = tf.equal(tf.argmax(preds, 1), tf.argmax(labels, 1))
accuracy_all = tf.cast(correct_prediction, tf.float32)
mask = tf.cast(mask, dtype=tf.float32)
mask /= tf.reduce_mean(mask)
accuracy_all *= mask
return tf.reduce_mean(accuracy_all)

291
graphsage/minibatch.py Normal file
View File

@ -0,0 +1,291 @@
from __future__ import division
from __future__ import print_function
import numpy as np
np.random.seed(123)
class EdgeMinibatchIterator(object):
""" This minibatch iterator iterates over batches of sampled edges or
random pairs of co-occuring edges.
"""
def __init__(self, G, id2idx,
placeholders, context_pairs=None,batch_size=100, max_degree=25, num_neg_samples=20,
n2v_retrain=False, fixed_n2v=False,
**kwargs):
self.G = G
self.nodes = G.nodes()
self.id2idx = id2idx
self.placeholders = placeholders
self.batch_size = batch_size
self.max_degree = max_degree
self.num_neg_samples = num_neg_samples
self.batch_num = 0
self.nodes = np.random.permutation(G.nodes())
self.adj, self.deg = self.construct_adj()
self.test_adj = self.construct_test_adj()
if context_pairs is None:
edges = G.edges()
else:
edges = context_pairs
self.train_edges = self.edges = np.random.permutation(edges)
if not n2v_retrain:
self.train_edges = self._remove_isolated(self.train_edges)
self.val_edges = [e for e in G.edges_iter() if G[e[0]][e[1]]['train_removed']]
else:
if fixed_n2v:
self.train_edges = self.val_edges = self._n2v_prune(self.edges)
else:
self.train_edges = self.val_edges = self.edges
print(len([n for n in G.nodes_iter() if not G.node[n]['test'] and not G.node[n]['val']]), 'train nodes')
print(len([n for n in G.nodes_iter() if G.node[n]['test'] or G.node[n]['val']]), 'test nodes')
self.val_set_size = len(self.val_edges)
def _n2v_prune(self, edges):
is_val = lambda n : self.G.node[n]["val"] or self.G.node[n]["test"]
return [e for e in edges if not is_val(e[1])]
def _remove_isolated(self, edge_list):
new_edge_list = []
for n1, n2 in edge_list:
if (self.deg[self.id2idx[n1]] == 0 or self.deg[self.id2idx[n2]] == 0) \
and (not self.G.node[n1]['test'] or self.G.node[n1]['val']) \
and (not self.G.node[n2]['test'] or self.G.node[n2]['val']):
continue
else:
new_edge_list.append((n1,n2))
return new_edge_list
def construct_adj(self):
adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))
deg = np.zeros((len(self.id2idx),))
for nodeid in self.G.nodes():
if self.G.node[nodeid]['test'] or self.G.node[nodeid]['val']:
continue
neighbors = np.array([self.id2idx[neighbor]
for neighbor in self.G.neighbors(nodeid)
if (not self.G[nodeid][neighbor]['train_removed'])])
deg[self.id2idx[nodeid]] = len(neighbors)
if len(neighbors) == 0:
continue
if len(neighbors) > self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=False)
elif len(neighbors) < self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=True)
adj[self.id2idx[nodeid], :] = neighbors
return adj, deg
def construct_test_adj(self):
adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))
for nodeid in self.G.nodes():
neighbors = np.array([self.id2idx[neighbor]
for neighbor in self.G.neighbors(nodeid)])
if len(neighbors) == 0:
continue
if len(neighbors) > self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=False)
elif len(neighbors) < self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=True)
adj[self.id2idx[nodeid], :] = neighbors
return adj
def end(self):
return self.batch_num * self.batch_size > len(self.train_edges) - self.batch_size + 1
def batch_feed_dict(self, batch_edges):
batch1 = []
batch2 = []
for node1, node2 in batch_edges:
batch1.append(self.id2idx[node1])
batch2.append(self.id2idx[node2])
feed_dict = dict()
feed_dict.update({self.placeholders['batch_size'] : len(batch_edges)})
feed_dict.update({self.placeholders['batch1']: batch1})
feed_dict.update({self.placeholders['batch2']: batch2})
return feed_dict
def next_minibatch_feed_dict(self):
start = self.batch_num * self.batch_size
self.batch_num += 1
batch_edges = self.train_edges[start : start + self.batch_size]
return self.batch_feed_dict(batch_edges)
def val_feed_dict(self, size=None):
edge_list = self.val_edges
if size is None:
return self.batch_feed_dict(edge_list)
else:
ind = np.random.permutation(len(edge_list))
val_edges = [edge_list[i] for i in ind[:min(size, len(ind))]]
return self.batch_feed_dict(val_edges)
def incremental_val_feed_dict(self, size, iter_num):
edge_list = self.val_edges
val_edges = edge_list[iter_num*size:min((iter_num+1)*size,
len(edge_list))]
return self.batch_feed_dict(val_edges), (iter_num+1)*size >= len(self.val_edges), val_edges
def incremental_embed_feed_dict(self, size, iter_num):
node_list = self.nodes
val_nodes = node_list[iter_num*size:min((iter_num+1)*size,
len(node_list))]
val_edges = [(n,n) for n in val_nodes]
return self.batch_feed_dict(val_edges), (iter_num+1)*size >= len(node_list), val_edges
def label_val(self):
train_edges = []
val_edges = []
for n1, n2 in self.G.edges_iter():
if (self.G.node[n1]['val'] or self.G.node[n1]['test']
or self.G.node[n2]['val'] or self.G.node[n2]['test']):
val_edges.append((n1,n2))
else:
train_edges.append((n1,n2))
return train_edges, val_edges
def shuffle(self):
""" Re-shuffle the training set.
Also reset the batch number.
"""
self.train_edges = np.random.permutation(self.train_edges)
self.nodes = np.random.permutation(self.nodes)
self.batch_num = 0
class NodeMinibatchIterator(object):
"""
This minibatch iterator iterates over nodes for supervised learning.
"""
def __init__(self, G, id2idx,
placeholders, label_map, num_classes, context_pairs=None,
batch_size=100, max_degree=25,
**kwargs):
self.G = G
self.nodes = G.nodes()
self.id2idx = id2idx
self.placeholders = placeholders
self.batch_size = batch_size
self.max_degree = max_degree
self.batch_num = 0
self.label_map = label_map
self.num_classes = num_classes
self.adj, self.deg = self.construct_adj()
self.test_adj = self.construct_test_adj()
self.val_nodes = [n for n in self.G.nodes_iter() if self.G.node[n]['val']]
self.test_nodes = [n for n in self.G.nodes_iter() if self.G.node[n]['test']]
self.no_train_nodes_set = set(self.val_nodes + self.test_nodes)
self.train_nodes = set(G.nodes()).difference(self.no_train_nodes_set)
# don't train on nodes that only have edges to test set
self.train_nodes = [n for n in self.train_nodes if self.deg[id2idx[n]] > 0]
def _make_label_vec(self, node):
label = self.label_map[node]
if isinstance(label, list):
label_vec = np.array(label)
else:
label_vec = np.zeros((self.num_classes))
class_ind = self.label_map[node]
label_vec[class_ind] = 1
return label_vec
def construct_adj(self):
adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))
deg = np.zeros((len(self.id2idx),))
for nodeid in self.G.nodes():
if self.G.node[nodeid]['test'] or self.G.node[nodeid]['val']:
continue
neighbors = np.array([self.id2idx[neighbor]
for neighbor in self.G.neighbors(nodeid)
if (not self.G[nodeid][neighbor]['train_removed'])])
deg[self.id2idx[nodeid]] = len(neighbors)
if len(neighbors) == 0:
continue
if len(neighbors) > self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=False)
elif len(neighbors) < self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=True)
adj[self.id2idx[nodeid], :] = neighbors
return adj, deg
def construct_test_adj(self):
adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))
for nodeid in self.G.nodes():
neighbors = np.array([self.id2idx[neighbor]
for neighbor in self.G.neighbors(nodeid)])
if len(neighbors) == 0:
continue
if len(neighbors) > self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=False)
elif len(neighbors) < self.max_degree:
neighbors = np.random.choice(neighbors, self.max_degree, replace=True)
adj[self.id2idx[nodeid], :] = neighbors
return adj
def end(self):
return self.batch_num * self.batch_size > len(self.train_nodes) - self.batch_size
def batch_feed_dict(self, batch_nodes, val=False):
batch1id = batch_nodes
batch1 = [self.id2idx[n] for n in batch1id]
labels = np.vstack([self._make_label_vec(node) for node in batch1id])
feed_dict = dict()
feed_dict.update({self.placeholders['batch_size'] : len(batch1)})
feed_dict.update({self.placeholders['batch']: batch1})
feed_dict.update({self.placeholders['labels']: labels})
return feed_dict, labels
def node_val_feed_dict(self, size=None, test=False):
if test:
val_nodes = self.test_nodes
else:
val_nodes = self.val_nodes
if not size is None:
val_nodes = np.random.choice(val_nodes, size, replace=True)
# add a dummy neighbor
ret_val = self.batch_feed_dict(val_nodes)
return ret_val[0], ret_val[1]
def incremental_node_val_feed_dict(self, size, iter_num, test=False):
if test:
val_nodes = self.test_nodes
else:
val_nodes = self.val_nodes
val_node_subset = val_nodes[iter_num*size:min((iter_num+1)*size,
len(val_nodes))]
# add a dummy neighbor
ret_val = self.batch_feed_dict(val_node_subset)
return ret_val[0], ret_val[1], (iter_num+1)*size >= len(val_nodes), val_node_subset
def next_minibatch_feed_dict(self):
start = self.batch_num * self.batch_size
self.batch_num += 1
batch_nodes = self.train_nodes[start : start + self.batch_size]
return self.batch_feed_dict(batch_nodes)
def incremental_embed_feed_dict(self, size, iter_num):
node_list = self.nodes
val_nodes = node_list[iter_num*size:min((iter_num+1)*size,
len(node_list))]
return self.batch_feed_dict(val_nodes), (iter_num+1)*size >= len(node_list), val_nodes
def shuffle(self):
""" Re-shuffle the training set.
Also reset the batch number.
"""
self.train_nodes = np.random.permutation(self.train_nodes)
self.batch_num = 0

485
graphsage/models.py Normal file
View File

@ -0,0 +1,485 @@
from collections import namedtuple
import tensorflow as tf
import math
import graphsage.layers as layers
import graphsage.metrics as metrics
from .prediction import BipartiteEdgePredLayer
from .aggregators import MeanAggregator, PoolingAggregator, SeqAggregator, GCNAggregator, TwoLayerPoolingAggregator
flags = tf.app.flags
FLAGS = flags.FLAGS
# DISCLAIMER:
# Boilerplate parts of this code file were originally forked from
# https://github.com/tkipf/gcn
# which itself was very inspired by the keras package
# (A full license with proper attributions will be provided in the
# public repo of this code base)
class Model(object):
def __init__(self, **kwargs):
allowed_kwargs = {'name', 'logging', 'model_size'}
for kwarg in kwargs.keys():
assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg
name = kwargs.get('name')
if not name:
name = self.__class__.__name__.lower()
self.name = name
logging = kwargs.get('logging', False)
self.logging = logging
self.vars = {}
self.placeholders = {}
self.layers = []
self.activations = []
self.inputs = None
self.outputs = None
self.loss = 0
self.accuracy = 0
self.optimizer = None
self.opt_op = None
def _build(self):
raise NotImplementedError
def build(self):
""" Wrapper for _build() """
with tf.variable_scope(self.name):
self._build()
# Build sequential layer model
self.activations.append(self.inputs)
for layer in self.layers:
hidden = layer(self.activations[-1])
self.activations.append(hidden)
self.outputs = self.activations[-1]
# Store model variables for easy access
variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
self.vars = {var.name: var for var in variables}
# Build metrics
self._loss()
self._accuracy()
self.opt_op = self.optimizer.minimize(self.loss)
def predict(self):
pass
def _loss(self):
raise NotImplementedError
def _accuracy(self):
raise NotImplementedError
def save(self, sess=None):
if not sess:
raise AttributeError("TensorFlow session not provided.")
saver = tf.train.Saver(self.vars)
save_path = saver.save(sess, "tmp/%s.ckpt" % self.name)
print("Model saved in file: %s" % save_path)
def load(self, sess=None):
if not sess:
raise AttributeError("TensorFlow session not provided.")
saver = tf.train.Saver(self.vars)
save_path = "tmp/%s.ckpt" % self.name
saver.restore(sess, save_path)
print("Model restored from file: %s" % save_path)
class MLP(Model):
def __init__(self, placeholders, dims, categorical=True, **kwargs):
super(MLP, self).__init__(**kwargs)
self.dims = dims
self.input_dim = dims[0]
self.output_dim = dims[-1]
self.placeholders = placeholders
self.categorical = categorical
self.inputs = placeholders['features']
self.labels = placeholders['labels']
self.optimizer = tf.train.AdamOptimizer(learning_rate=FLAGS.learning_rate)
self.build()
def _loss(self):
# Weight decay loss
for var in self.layers[0].vars.values():
self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)
# Cross entropy error
if self.categorical:
self.loss += metrics.masked_softmax_cross_entropy(self.outputs, self.placeholders['labels'],
self.placeholders['labels_mask'])
# L2
else:
diff = self.labels - self.outputs
self.loss += tf.reduce_sum(tf.sqrt(tf.reduce_sum(diff * diff, axis=1)))
def _accuracy(self):
if self.categorical:
self.accuracy = metrics.masked_accuracy(self.outputs, self.placeholders['labels'],
self.placeholders['labels_mask'])
def _build(self):
self.layers.append(layers.Dense(input_dim=self.input_dim,
output_dim=self.dims[1],
act=tf.nn.relu,
dropout=self.placeholders['dropout'],
sparse_inputs=False,
logging=self.logging))
self.layers.append(layers.Dense(input_dim=self.dims[1],
output_dim=self.output_dim,
act=lambda x: x,
dropout=self.placeholders['dropout'],
logging=self.logging))
def predict(self):
return tf.nn.softmax(self.outputs)
class GeneralizedModel(Model):
"""
Base class for models that aren't constructed from traditional, sequential layers.
Subclasses must set self.outputs in _build method
(Removes the layers idiom from build method of the Model class)
"""
def __init__(self, **kwargs):
super(GeneralizedModel, self).__init__(**kwargs)
def build(self):
""" Wrapper for _build() """
with tf.variable_scope(self.name):
self._build()
# Store model variables for easy access
variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
self.vars = {var.name: var for var in variables}
# Build metrics
self._loss()
self._accuracy()
self.opt_op = self.optimizer.minimize(self.loss)
# SAGEInfo is a namedtuple that specifies the parameters
# of the recursive sampled GCN layers
SAGEInfo = namedtuple("SAGEInfo",
['layer_name', # name of the layer (to get feature embedding etc.)
'neigh_sampler', # callable neigh_sampler constructor
'num_samples',
'output_dim' # the output (i.e., hidden) dimension
])
class SampleAndAggregate(GeneralizedModel):
"""
Implementation of a standard 2-step graph convolutional network
Uses random sampling on neighborhoods
"""
def __init__(self, placeholders, features, adj, degrees,
layer_infos, concat=True, aggregator_type="mean",
model_size="small",
**kwargs):
'''
Args:
- layer_infos: List of SGCInfo namedtuples that describe the parameters of all
the recursive layers. See SGCInfo definition above.
'''
super(SampleAndAggregate, self).__init__(**kwargs)
if aggregator_type == "mean":
self.aggregator_cls = MeanAggregator
elif aggregator_type == "seq":
self.aggregator_cls = SeqAggregator
elif aggregator_type == "pool":
self.aggregator_cls = PoolingAggregator
elif aggregator_type == "pool_2":
self.aggregator_cls = TwoLayerPoolingAggregator
elif aggregator_type == "gcn":
self.aggregator_cls = GCNAggregator
else:
raise Exception("Unknown aggregator: ", self.aggregator_cls)
# get info from placeholders...
self.inputs1 = placeholders["batch1"]
self.inputs2 = placeholders["batch2"]
self.model_size = model_size
self.adj_info = adj
self.features = tf.Variable(tf.constant(features, dtype=tf.float32), trainable=False)
self.degrees = degrees
self.concat = concat
self.dims = [features.shape[1]]
self.dims.extend([layer_infos[i].output_dim for i in range(len(layer_infos))])
self.batch_size = placeholders["batch_size"]
self.placeholders = placeholders
self.layer_infos = layer_infos
self.optimizer = tf.train.AdamOptimizer(learning_rate=FLAGS.learning_rate)
self.build()
def sample(self, inputs, layer_infos, batch_size=None):
""" Sample neighbors to be the supportive fields for multi-layer convolutions.
Args:
inputs: batch inputs
batch_size: the number of inputs (different for batch inputs and negative samples).
"""
if batch_size is None:
batch_size = self.batch_size
samples = [inputs]
# size of convolution support at each layer per node
support_size = 1
support_sizes = [support_size]
for k in range(len(layer_infos)):
t = len(layer_infos) - k - 1
support_size *= layer_infos[t].num_samples
sampler = layer_infos[t].neigh_sampler
node = sampler((samples[k], layer_infos[t].num_samples))
samples.append(tf.reshape(node, [support_size * batch_size,]))
support_sizes.append(support_size)
return samples, support_sizes
def aggregate(self, samples, input_features, dims, num_samples, support_sizes, batch_size=None,
aggregators=None, name=None, concat=False, model_size="small"):
""" At each layer, aggregate hidden representations of neighbors to compute the hidden representations
at next layer.
Args:
samples: a list of samples of variable hops away for convolving at each layer of the
network. Length is the number of layers + 1. Each is a vector of node indices.
input_features: the input features for each sample of various hops away.
dims: a list of dimensions of the hidden representations from the input layer to the
final layer. Length is the number of layers + 1.
num_samples: list of number of samples for each layer.
support_sizes: the number of nodes to gather information from for each layer.
batch_size: the number of inputs (different for batch inputs and negative samples).
Returns:
The hidden representation at the final layer for all nodes in batch
"""
if batch_size is None:
batch_size = self.batch_size
# length: number of layers + 1
hidden = [tf.nn.embedding_lookup(input_features, node_samples) for node_samples in samples]
new_agg = aggregators is None
if new_agg:
aggregators = []
for layer in range(len(num_samples)):
if new_agg:
dim_mult = 2 if concat and (layer != 0) else 1
# aggregator at current layer
if layer == len(num_samples) - 1:
aggregator = self.aggregator_cls(dim_mult*dims[layer], dims[layer+1], act=lambda x : x,
dropout=self.placeholders['dropout'],
name=name, concat=concat, model_size=model_size)
else:
aggregator = self.aggregator_cls(dim_mult*dims[layer], dims[layer+1],
dropout=self.placeholders['dropout'],
name=name, concat=concat, model_size=model_size)
aggregators.append(aggregator)
else:
aggregator = aggregators[layer]
# hidden representation at current layer for all support nodes that are various hops away
next_hidden = []
# as layer increases, the number of support nodes needed decreases
for hop in range(len(num_samples) - layer):
dim_mult = 2 if concat and (layer != 0) else 1
neigh_dims = [batch_size * support_sizes[hop],
num_samples[len(num_samples) - hop - 1],
dim_mult*dims[layer]]
h = aggregator((hidden[hop],
tf.reshape(hidden[hop + 1], neigh_dims)))
next_hidden.append(h)
hidden = next_hidden
return hidden[0], aggregators
def _build(self):
labels = tf.reshape(
tf.cast(self.placeholders['batch2'], dtype=tf.int64),
[self.batch_size, 1])
self.neg_samples, _, _ = (tf.nn.fixed_unigram_candidate_sampler(
true_classes=labels,
num_true=1,
num_sampled=FLAGS.neg_sample_size,
unique=False,
range_max=len(self.degrees),
distortion=0.75,
unigrams=self.degrees.tolist()))
# perform "convolution"
samples1, support_sizes1 = self.sample(self.inputs1, self.layer_infos)
samples2, support_sizes2 = self.sample(self.inputs2, self.layer_infos)
num_samples = [layer_info.num_samples for layer_info in self.layer_infos]
self.outputs1, self.aggregators = self.aggregate(samples1, [self.features], self.dims, num_samples,
support_sizes1, concat=self.concat, model_size=self.model_size)
self.outputs2, _ = self.aggregate(samples2, [self.features], self.dims, num_samples,
support_sizes2, aggregators=self.aggregators, concat=self.concat,
model_size=self.model_size)
neg_samples, neg_support_sizes = self.sample(self.neg_samples, self.layer_infos,
FLAGS.neg_sample_size)
self.neg_outputs, _ = self.aggregate(neg_samples, [self.features], self.dims, num_samples,
neg_support_sizes, batch_size=FLAGS.neg_sample_size, aggregators=self.aggregators,
concat=self.concat, model_size=self.model_size)
dim_mult = 2 if self.concat else 1
self.link_pred_layer = BipartiteEdgePredLayer(dim_mult*self.dims[-1],
dim_mult*self.dims[-1], self.placeholders, act=tf.nn.sigmoid,
bilinear_weights=False,
name='edge_predict')
self.outputs1 = tf.nn.l2_normalize(self.outputs1, 1)
self.outputs2 = tf.nn.l2_normalize(self.outputs2, 1)
self.neg_outputs = tf.nn.l2_normalize(self.neg_outputs, 1)
def build(self):
self._build()
# TF graph management
self._loss()
self._accuracy()
self.loss = self.loss / tf.cast(self.batch_size, tf.float32)
grads_and_vars = self.optimizer.compute_gradients(self.loss)
clipped_grads_and_vars = [(tf.clip_by_value(grad, -5.0, 5.0) if grad is not None else None, var)
for grad, var in grads_and_vars]
self.grad, _ = clipped_grads_and_vars[0]
self.opt_op = self.optimizer.apply_gradients(clipped_grads_and_vars)
def _loss(self):
for aggregator in self.aggregators:
for var in aggregator.vars.values():
self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)
self.loss = self.link_pred_layer.loss(self.outputs1, self.outputs2, self.neg_outputs)
self.loss = self.loss / tf.cast(self.batch_size, tf.float32)
tf.summary.scalar('loss', self.loss)
def _accuracy(self):
# shape: [batch_size]
aff = self.link_pred_layer.affinity(self.outputs1, self.outputs2)
# shape : [batch_size x num_neg_samples]
self.neg_aff = self.link_pred_layer.neg_cost(self.outputs1, self.neg_outputs)
self.neg_aff = tf.reshape(self.neg_aff, [self.batch_size, FLAGS.neg_sample_size])
_aff = tf.expand_dims(aff, axis=1)
self.aff_all = tf.concat(axis=1, values=[self.neg_aff, _aff])
size = tf.shape(self.aff_all)[1]
_, indices_of_ranks = tf.nn.top_k(self.aff_all, k=size)
_, self.ranks = tf.nn.top_k(-indices_of_ranks, k=size)
self.mrr = tf.reduce_mean(tf.div(1.0, tf.cast(self.ranks[:, -1] + 1, tf.float32)))
tf.summary.scalar('mrr', self.mrr)
class Node2VecModel(GeneralizedModel):
def __init__(self, placeholders, dict_size, degrees, name=None,
nodevec_dim=50, lr=0.001, **kwargs):
""" Simple version of Node2Vec algorithm.
Args:
dict_size1: the total number of nodes in set1.
dict_size2: the total number of nodes in set2.
nodevec_dim: dimension of the vector representation of node.
lr: learning rate of optimizer.
"""
super(Node2VecModel, self).__init__(**kwargs)
self.placeholders = placeholders
self.degrees = degrees
self.inputs1 = placeholders["batch1"]
self.inputs2 = placeholders["batch2"]
self.batch_size = placeholders['batch_size']
self.hidden_dim = nodevec_dim
# following the tensorflow word2vec tutorial
self.target_embeds = tf.Variable(
tf.random_uniform([dict_size, nodevec_dim], -1, 1),
name="target_embeds")
self.context_embeds = tf.Variable(
tf.truncated_normal([dict_size, nodevec_dim],
stddev=1.0 / math.sqrt(nodevec_dim)),
name="context_embeds")
self.context_bias = tf.Variable(
tf.zeros([dict_size]),
name="context_bias")
self.optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)
self.build()
def _build(self):
labels = tf.reshape(
tf.cast(self.placeholders['batch2'], dtype=tf.int64),
[self.batch_size, 1])
self.neg_samples, _, _ = (tf.nn.fixed_unigram_candidate_sampler(
true_classes=labels,
num_true=1,
num_sampled=FLAGS.neg_sample_size,
unique=True,
range_max=len(self.degrees),
distortion=0.75,
unigrams=self.degrees.tolist()))
self.outputs1 = tf.nn.embedding_lookup(self.target_embeds, self.inputs1)
self.outputs2 = tf.nn.embedding_lookup(self.context_embeds, self.inputs2)
self.outputs2_bias = tf.nn.embedding_lookup(self.context_bias, self.inputs2)
self.neg_outputs = tf.nn.embedding_lookup(self.context_embeds, self.neg_samples)
self.neg_outputs_bias = tf.nn.embedding_lookup(self.context_bias, self.neg_samples)
self.link_pred_layer = BipartiteEdgePredLayer(self.hidden_dim, self.hidden_dim,
self.placeholders, bilinear_weights=False)
def build(self):
self._build()
# TF graph management
self._loss()
self._minimize()
self._accuracy()
def _minimize(self):
self.opt_op = self.optimizer.minimize(self.loss)
def _loss(self):
aff = tf.reduce_sum(tf.multiply(self.outputs1, self.outputs2), 1) + self.outputs2_bias
neg_aff = tf.matmul(self.outputs2, tf.transpose(self.neg_outputs)) + self.neg_outputs_bias
true_xent = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.ones_like(aff), logits=aff)
negative_xent = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.zeros_like(neg_aff), logits=neg_aff)
loss = tf.reduce_sum(true_xent) + tf.reduce_sum(negative_xent)
self.loss = loss / tf.cast(self.batch_size, tf.float32)
tf.summary.scalar('loss', self.loss)
def _accuracy(self):
# shape: [batch_size]
aff = self.link_pred_layer.affinity(self.outputs1, self.outputs2)
# shape : [batch_size x num_neg_samples]
self.neg_aff = self.link_pred_layer.neg_cost(self.outputs1, self.neg_outputs)
self.neg_aff = tf.reshape(self.neg_aff, [self.batch_size, FLAGS.neg_sample_size])
_aff = tf.expand_dims(aff, axis=1)
self.aff_all = tf.concat(axis=1, values=[self.neg_aff, _aff])
size = tf.shape(self.aff_all)[1]
_, indices_of_ranks = tf.nn.top_k(self.aff_all, k=size)
_, self.ranks = tf.nn.top_k(-indices_of_ranks, k=size)
self.mrr = tf.reduce_mean(tf.div(1.0, tf.cast(self.ranks[:, -1] + 1, tf.float32)))
tf.summary.scalar('mrr', self.mrr)

View File

@ -0,0 +1,30 @@
from __future__ import division
from __future__ import print_function
from graphsage.layers import Layer
import tensorflow as tf
flags = tf.app.flags
FLAGS = flags.FLAGS
"""
Classes that are used to sample node neighborhoods during
convolutions.
"""
class UniformNeighborSampler(Layer):
"""
Uniformly samples neighbors.
Assumes that adj lists are padded with random re-sampling
"""
def __init__(self, adj_info, **kwargs):
super(UniformNeighborSampler, self).__init__(**kwargs)
self.adj_info = adj_info
def _call(self, inputs):
ids, num_samples = inputs
adj_lists = tf.nn.embedding_lookup(self.adj_info, ids)
adj_lists = tf.transpose(tf.random_shuffle(tf.transpose(adj_lists)))
adj_lists = tf.slice(adj_lists, [0,0], [-1, num_samples])
return adj_lists

99
graphsage/prediction.py Normal file
View File

@ -0,0 +1,99 @@
from __future__ import division
from __future__ import print_function
from graphsage.inits import glorot, zeros
from graphsage.layers import Layer
import tensorflow as tf
flags = tf.app.flags
FLAGS = flags.FLAGS
class BipartiteEdgePredLayer(Layer):
def __init__(self, input_dim1, input_dim2, placeholders, dropout=False, act=tf.nn.sigmoid,
bias=False, bilinear_weights=False, **kwargs):
"""
Args:
bilinear_weights: use a bilinear weight for affinity calculation: u^T A v. If set to
false, it is assumed that input dimensions are the same and the affinity will be
based on dot product.
"""
super(BipartiteEdgePredLayer, self).__init__(**kwargs)
self.input_dim1 = input_dim1
self.input_dim2 = input_dim2
self.act = act
self.bias = bias
self.eps = 1e-7
self.bilinear_weights = bilinear_weights
if dropout:
self.dropout = placeholders['dropout']
else:
self.dropout = 0.
# output a likelihood term
self.output_dim = 1
with tf.variable_scope(self.name + '_vars'):
# bilinear form
if bilinear_weights:
#self.vars['weights'] = glorot([input_dim1, input_dim2],
# name='pred_weights')
self.vars['weights'] = tf.get_variable(
'pred_weights',
shape=(input_dim1, input_dim2),
dtype=tf.float32,
initializer=tf.contrib.layers.xavier_initializer())
if self.bias:
self.vars['bias'] = zeros([self.output_dim], name='bias')
if self.logging:
self._log_vars()
def affinity(self, inputs1, inputs2):
""" Affinity score between batch of inputs1 and inputs2.
Args:
inputs1: tensor of shape [batch_size x feature_size].
"""
# shape: [batch_size, input_dim1]
if self.bilinear_weights:
prod = tf.matmul(inputs2, tf.transpose(self.vars['weights']))
self.prod = prod
result = tf.reduce_sum(inputs1 * prod, axis=1)
else:
result = tf.reduce_sum(inputs1 * inputs2, axis=1)
return result
def neg_cost(self, inputs1, neg_samples):
""" For each input in batch, compute the sum of its affinity to negative samples.
Returns:
Tensor of shape [batch_size x num_neg_samples]. For each node, a list of affinities to
negative samples is computed.
"""
if self.bilinear_weights:
inputs1 = tf.matmul(inputs1, self.vars['weights'])
neg_aff = tf.matmul(inputs1, tf.transpose(neg_samples))
return neg_aff
def loss(self, inputs1, inputs2, neg_samples):
""" negative sampling loss.
Args:
neg_samples: tensor of shape [num_neg_samples x input_dim2]. Negative samples for all
inputs in batch inputs1.
"""
aff = self.affinity(inputs1, inputs2)
neg_aff = self.neg_cost(inputs1, neg_samples)
true_xent = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.ones_like(aff), logits=aff)
negative_xent = tf.nn.sigmoid_cross_entropy_with_logits(
labels=tf.zeros_like(neg_aff), logits=neg_aff)
loss = tf.reduce_sum(true_xent) + tf.reduce_sum(negative_xent)
return loss
def weights_norm(self):
return tf.nn.l2_norm(self.vars['weights'])

View File

@ -0,0 +1,100 @@
import tensorflow as tf
import graphsage.models as models
import graphsage.layers as layers
from graphsage.aggregators import MeanAggregator, PoolingAggregator, SeqAggregator, GCNAggregator, TwoLayerPoolingAggregator
flags = tf.app.flags
FLAGS = flags.FLAGS
class SupervisedGraphsage(models.SampleAndAggregate):
def __init__(self, num_classes,
placeholders, features, adj, degrees,
layer_infos, concat=True, aggregator_type="mean",
model_size="small", sigmoid_loss=False,
**kwargs):
models.GeneralizedModel.__init__(self, **kwargs)
if aggregator_type == "mean":
self.aggregator_cls = MeanAggregator
elif aggregator_type == "seq":
self.aggregator_cls = SeqAggregator
elif aggregator_type == "pool":
self.aggregator_cls = PoolingAggregator
elif aggregator_type == "pool_2":
self.aggregator_cls = TwoLayerPoolingAggregator
elif aggregator_type == "gcn":
self.aggregator_cls = GCNAggregator
else:
raise Exception("Unknown aggregator: ", self.aggregator_cls)
# get info from placeholders...
self.inputs1 = placeholders["batch"]
self.model_size = model_size
self.adj_info = adj
self.features = tf.Variable(tf.constant(features, dtype=tf.float32), trainable=False)
self.degrees = degrees
self.concat = concat
self.num_classes = num_classes
self.sigmoid_loss = sigmoid_loss
self.dims = [features.shape[1]]
self.dims.extend([layer_infos[i].output_dim for i in range(len(layer_infos))])
self.batch_size = placeholders["batch_size"]
self.placeholders = placeholders
self.layer_infos = layer_infos
self.optimizer = tf.train.AdamOptimizer(learning_rate=FLAGS.learning_rate)
self.build()
def build(self):
samples1, support_sizes1 = self.sample(self.inputs1, self.layer_infos)
num_samples = [layer_info.num_samples for layer_info in self.layer_infos]
self.outputs1, self.aggregators = self.aggregate(samples1, [self.features], self.dims, num_samples,
support_sizes1, concat=self.concat, model_size=self.model_size)
dim_mult = 2 if self.concat else 1
self.outputs1 = tf.nn.l2_normalize(self.outputs1, 1)
dim_mult = 2 if self.concat else 1
self.node_pred = layers.Dense(dim_mult*self.dims[-1], self.num_classes,
dropout=self.placeholders['dropout'],
act=lambda x : x)
# TF graph management
self.node_preds = self.node_pred(self.outputs1)
self._loss()
grads_and_vars = self.optimizer.compute_gradients(self.loss)
clipped_grads_and_vars = [(tf.clip_by_value(grad, -5.0, 5.0) if grad is not None else None, var)
for grad, var in grads_and_vars]
self.grad, _ = clipped_grads_and_vars[0]
self.opt_op = self.optimizer.apply_gradients(clipped_grads_and_vars)
self.preds = self.predict()
def _loss(self):
# Weight decay loss
for aggregator in self.aggregators:
for var in aggregator.vars.values():
self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)
for var in self.node_pred.vars.values():
self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)
# classification loss
if self.sigmoid_loss:
self.loss += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(
logits=self.node_preds,
labels=self.placeholders['labels']))
else:
self.loss += tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
logits=self.node_preds,
labels=self.placeholders['labels']))
tf.summary.scalar('loss', self.loss)
def predict(self):
if self.sigmoid_loss:
return tf.nn.sigmoid(self.node_preds)
else:
return tf.nn.softmax(self.node_preds)

View File

@ -0,0 +1,315 @@
from __future__ import division
from __future__ import print_function
import os
import time
import tensorflow as tf
import numpy as np
import sklearn
from sklearn import metrics
from graphsage.supervised_models import SupervisedGraphsage
from graphsage.models import SAGEInfo
from graphsage.minibatch import NodeMinibatchIterator
from graphsage.neigh_samplers import UniformNeighborSampler
from graphsage.utils import load_data
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
# Set random seed
seed = 123
np.random.seed(seed)
tf.set_random_seed(seed)
# Settings
flags = tf.app.flags
FLAGS = flags.FLAGS
tf.app.flags.DEFINE_boolean('log_device_placement', False,
"""Whether to log device placement.""")
#core params..
flags.DEFINE_string('model', 'graphsage_mean', 'model names. See README for possible values.')
flags.DEFINE_float('learning_rate', 0.01, 'initial learning rate.')
flags.DEFINE_string("model_size", "small", "Can be big or small; model specific def'ns")
flags.DEFINE_string('train_prefix', '', 'prefix identifying training data. must be specified.')
# left to default values in main experiments
flags.DEFINE_integer('epochs', 10, 'number of epochs to train.')
flags.DEFINE_float('dropout', 0.0, 'dropout rate (1 - keep probability).')
flags.DEFINE_float('weight_decay', 0.0, 'weight for l2 loss on embedding matrix.')
flags.DEFINE_integer('max_degree', 128, 'maximum node degree.')
flags.DEFINE_integer('samples_1', 25, 'number of samples in layer 1')
flags.DEFINE_integer('samples_2', 10, 'number of users samples in layer 2')
flags.DEFINE_integer('samples_3', 0, 'number of users samples in layer 3. (Only or mean model)')
flags.DEFINE_integer('dim_1', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_integer('dim_2', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_boolean('random_context', True, 'Whether to use random context or direct edges')
flags.DEFINE_integer('batch_size', 512, 'minibatch size.')
flags.DEFINE_boolean('sigmoid', False, 'whether to use sigmoid loss')
#logging, saving, validation settings etc.
flags.DEFINE_string('base_log_dir', '.', 'base directory for logging and saving embeddings')
flags.DEFINE_integer('validate_iter', 5000, "how often to run a validation minibatch.")
flags.DEFINE_integer('validate_batch_size', 256, "how many nodes per validation sample.")
flags.DEFINE_integer('gpu', 1, "which gpu to use.")
flags.DEFINE_integer('print_every', 5, "How often to print training info.")
flags.DEFINE_integer('max_total_steps', 10**10, "Maximum total number of iterations")
os.environ["CUDA_VISIBLE_DEVICES"]=str(FLAGS.gpu)
GPU_MEM_FRACTION = 0.8
def calc_f1(y_true, y_pred):
if not FLAGS.sigmoid:
y_true = np.argmax(y_true, axis=1)
y_pred = np.argmax(y_pred, axis=1)
else:
y_pred[y_pred > 0.5] = 1
y_pred[y_pred <= 0.5] = 0
return metrics.f1_score(y_true, y_pred, average="micro"), metrics.f1_score(y_true, y_pred, average="macro")
# Define model evaluation function
def evaluate(sess, model, minibatch_iter, size=None):
t_test = time.time()
feed_dict_val, labels = minibatch_iter.node_val_feed_dict(size)
node_outs_val = sess.run([model.preds, model.loss],
feed_dict=feed_dict_val)
mic, mac = calc_f1(labels, node_outs_val[0])
return node_outs_val[1], mic, mac, (time.time() - t_test)
def log_dir():
log_dir = FLAGS.base_log_dir + "/sup-" + FLAGS.train_prefix.split("/")[-2]
log_dir += "/{model:s}_{model_size:s}_{lr:0.4f}/".format(
model=FLAGS.model,
model_size=FLAGS.model_size,
lr=FLAGS.learning_rate)
if not os.path.exists(log_dir):
os.makedirs(log_dir)
return log_dir
def incremental_evaluate(sess, model, minibatch_iter, size, test=False):
t_test = time.time()
finished = False
val_losses = []
val_preds = []
labels = []
iter_num = 0
finished = False
while not finished:
feed_dict_val, batch_labels, finished, _ = minibatch_iter.incremental_node_val_feed_dict(size, iter_num, test=test)
node_outs_val = sess.run([model.preds, model.loss],
feed_dict=feed_dict_val)
val_preds.append(node_outs_val[0])
labels.append(batch_labels)
val_losses.append(node_outs_val[1])
iter_num += 1
val_preds = np.vstack(val_preds)
labels = np.vstack(labels)
f1_scores = calc_f1(labels, val_preds)
return np.mean(val_losses), f1_scores[0], f1_scores[1], (time.time() - t_test)
def construct_placeholders(num_classes):
# Define placeholders
placeholders = {
'labels' : tf.placeholder(tf.float32, shape=(None, num_classes), name='labels'),
'batch' : tf.placeholder(tf.int32, shape=(None), name='batch1'),
'dropout': tf.placeholder_with_default(0., shape=(), name='dropout'),
'batch_size' : tf.placeholder(tf.int32, name='batch_size'),
}
return placeholders
def train(train_data, test_data=None):
G = train_data[0]
features = train_data[1]
id_map = train_data[2]
class_map = train_data[4]
if isinstance(class_map.values()[0], list):
num_classes = len(class_map.values()[0])
else:
num_classes = len(set(class_map.values()))
# pad with dummy zero vector
features = np.vstack([features, np.zeros((features.shape[1],))])
context_pairs = train_data[3] if FLAGS.random_context else None
placeholders = construct_placeholders(num_classes)
minibatch = NodeMinibatchIterator(G,
id_map,
placeholders,
class_map,
num_classes,
batch_size=FLAGS.batch_size,
max_degree=FLAGS.max_degree,
context_pairs = context_pairs)
adj_info = tf.Variable(tf.constant(minibatch.adj, dtype=tf.int32), trainable=False, name="adj_info")
if FLAGS.model == 'graphsage_mean':
# Create model
sampler = UniformNeighborSampler(adj_info)
if FLAGS.samples_3 != 0:
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2),
SAGEInfo("node", sampler, FLAGS.samples_3, FLAGS.dim_2)]
elif FLAGS.samples_2 != 0:
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
else:
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1)]
model = SupervisedGraphsage(num_classes, placeholders,
features,
adj_info,
minibatch.deg,
layer_infos,
model_size=FLAGS.model_size,
sigmoid_loss = FLAGS.sigmoid,
logging=True)
elif FLAGS.model == 'gcn':
# Create model
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, 2*FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, 2*FLAGS.dim_2)]
model = SupervisedGraphsage(num_classes, placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
aggregator_type="gcn",
model_size=FLAGS.model_size,
concat=False,
sigmoid_loss = FLAGS.sigmoid,
logging=True)
elif FLAGS.model == 'graphsage_seq':
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
model = SupervisedGraphsage(num_classes, placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
aggregator_type="seq",
model_size=FLAGS.model_size,
sigmoid_loss = FLAGS.sigmoid,
logging=True)
elif FLAGS.model == 'graphsage_pool':
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
model = SupervisedGraphsage(num_classes, placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
aggregator_type="pool",
model_size=FLAGS.model_size,
sigmoid_loss = FLAGS.sigmoid,
logging=True)
else:
raise Exception('Error: model name unrecognized.')
config = tf.ConfigProto(log_device_placement=FLAGS.log_device_placement)
config.gpu_options.allow_growth = True
#config.gpu_options.per_process_gpu_memory_fraction = GPU_MEM_FRACTION
config.allow_soft_placement = True
# Initialize session
sess = tf.Session(config=config)
merged = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter(log_dir(), sess.graph)
# Init variables
sess.run(tf.global_variables_initializer())
# Train model
total_steps = 0
avg_time = 0.0
epoch_val_costs = []
train_adj_info = tf.assign(adj_info, minibatch.adj)
val_adj_info = tf.assign(adj_info, minibatch.test_adj)
for epoch in range(FLAGS.epochs):
minibatch.shuffle()
iter = 0
print('Epoch: %04d' % (epoch + 1))
epoch_val_costs.append(0)
while not minibatch.end():
# Construct feed dictionary
feed_dict, labels = minibatch.next_minibatch_feed_dict()
feed_dict.update({placeholders['dropout']: FLAGS.dropout})
t = time.time()
# Training step
outs = sess.run([merged, model.opt_op, model.loss, model.preds], feed_dict=feed_dict)
train_cost = outs[2]
if iter % FLAGS.validate_iter == 0:
# Validation
sess.run(val_adj_info.op)
if FLAGS.validate_batch_size == -1:
val_cost, val_f1_mic, val_f1_mac, duration = incremental_evaluate(sess, model, minibatch, FLAGS.batch_size)
else:
val_cost, val_f1_mic, val_f1_mac, duration = evaluate(sess, model, minibatch, FLAGS.validate_batch_size)
sess.run(train_adj_info.op)
epoch_val_costs[-1] += val_cost
if total_steps % FLAGS.print_every == 0:
summary_writer.add_summary(outs[0], total_steps)
# Print results
avg_time = (avg_time * total_steps + time.time() - t) / (total_steps + 1)
if total_steps % FLAGS.print_every == 0:
train_f1_mic, train_f1_mac = calc_f1(labels, outs[-1])
print("Iter:", '%04d' % iter,
"train_loss=", "{:.5f}".format(train_cost),
"train_f1_mic=", "{:.5f}".format(train_f1_mic),
"train_f1_mac=", "{:.5f}".format(train_f1_mac),
"val_loss=", "{:.5f}".format(val_cost),
"val_f1_mic=", "{:.5f}".format(val_f1_mic),
"val_f1_mac=", "{:.5f}".format(val_f1_mac),
"time=", "{:.5f}".format(avg_time))
iter += 1
total_steps += 1
if total_steps > FLAGS.max_total_steps:
break
if total_steps > FLAGS.max_total_steps:
break
print("Optimization Finished!")
sess.run(val_adj_info.op)
val_cost, val_f1_mic, val_f1_mac, duration = incremental_evaluate(sess, model, minibatch, FLAGS.batch_size)
print("Full validation stats:",
"loss=", "{:.5f}".format(val_cost),
"f1_micro=", "{:.5f}".format(val_f1_mic),
"f1_macro=", "{:.5f}".format(val_f1_mac),
"time=", "{:.5f}".format(duration))
with open(log_dir() + "val_stats.txt", "w") as fp:
fp.write("loss={:.5f} f1_micro={:.5f} f1_macro={:.5f} time={:.5f}".
format(val_cost, val_f1_mic, val_f1_mac, duration))
print("Writing test set stats to file (don't peak!)")
val_cost, val_f1_mic, val_f1_mac, duration = incremental_evaluate(sess, model, minibatch, FLAGS.batch_size, test=True)
with open(log_dir() + "test_stats.txt", "w") as fp:
fp.write("loss={:.5f} f1_micro={:.5f} f1_macro={:.5f}".
format(val_cost, val_f1_mic, val_f1_mac))
def main(argv=None):
print("Loading training data..")
train_data = load_data(FLAGS.train_prefix)
print("Done loading training data..")
train(train_data)
if __name__ == '__main__':
tf.app.run()

View File

@ -0,0 +1,363 @@
from __future__ import division
from __future__ import print_function
import os
import time
import tensorflow as tf
import numpy as np
from graphsage.models import SampleAndAggregate, SAGEInfo, Node2VecModel
from graphsage.minibatch import EdgeMinibatchIterator
from graphsage.neigh_samplers import UniformNeighborSampler
from graphsage.utils import load_data
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
# Set random seed
seed = 123
np.random.seed(seed)
tf.set_random_seed(seed)
# Settings
flags = tf.app.flags
FLAGS = flags.FLAGS
tf.app.flags.DEFINE_boolean('log_device_placement', False,
"""Whether to log device placement.""")
#core params..
flags.DEFINE_string('model', 'graphsage', 'model names. See README for possible values.')
flags.DEFINE_float('learning_rate', 0.00001, 'initial learning rate.')
flags.DEFINE_string("model_size", "small", "Can be big or small; model specific def'ns")
flags.DEFINE_string('train_prefix', '', 'name of the object file that stores the training data. must be specified.')
# left to default values in main experiments
flags.DEFINE_integer('epochs', 1, 'number of epochs to train.')
flags.DEFINE_float('dropout', 0.0, 'dropout rate (1 - keep probability).')
flags.DEFINE_float('weight_decay', 0.0, 'weight for l2 loss on embedding matrix.')
flags.DEFINE_integer('max_degree', 100, 'maximum node degree.')
flags.DEFINE_integer('samples_1', 25, 'number of samples in layer 1')
flags.DEFINE_integer('samples_2', 10, 'number of users samples in layer 2')
flags.DEFINE_integer('dim_1', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_integer('dim_2', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_boolean('random_context', True, 'Whether to use random context or direct edges')
flags.DEFINE_integer('neg_sample_size', 20, 'number of negative samples')
flags.DEFINE_integer('batch_size', 512, 'minibatch size.')
flags.DEFINE_integer('n2v_test_epochs', 1, 'Number of new SGD epochs for n2v.')
#logging, saving, validation settings etc.
flags.DEFINE_boolean('save_embeddings', True, 'whether to save embeddings for all nodes after training')
flags.DEFINE_string('base_log_dir', '.', 'base directory for logging and saving embeddings')
flags.DEFINE_integer('validate_iter', 5000, "how often to run a validation minibatch.")
flags.DEFINE_integer('validate_batch_size', 256, "how many nodes per validation sample.")
flags.DEFINE_integer('gpu', 1, "which gpu to use.")
flags.DEFINE_integer('print_every', 50, "How often to print training info.")
flags.DEFINE_integer('max_total_steps', 10**10, "Maximum total number of iterations")
os.environ["CUDA_VISIBLE_DEVICES"]=str(FLAGS.gpu)
GPU_MEM_FRACTION = 0.8
def log_dir():
log_dir = FLAGS.base_log_dir + "/unsup-" + FLAGS.train_prefix.split("/")[-2]
log_dir += "/{model:s}_{model_size:s}_{lr:0.6f}/".format(
model=FLAGS.model,
model_size=FLAGS.model_size,
lr=FLAGS.learning_rate)
if not os.path.exists(log_dir):
os.makedirs(log_dir)
return log_dir
# Define model evaluation function
def evaluate(sess, model, minibatch_iter, size=None):
t_test = time.time()
feed_dict_val = minibatch_iter.val_feed_dict(size)
outs_val = sess.run([model.loss, model.ranks, model.mrr],
feed_dict=feed_dict_val)
return outs_val[0], outs_val[1], outs_val[2], (time.time() - t_test)
def incremental_evaluate(sess, model, minibatch_iter, size):
t_test = time.time()
finished = False
val_losses = []
val_mrrs = []
iter_num = 0
while not finished:
feed_dict_val, finished, _ = minibatch_iter.incremental_val_feed_dict(size, iter_num)
iter_num += 1
outs_val = sess.run([model.loss, model.ranks, model.mrr],
feed_dict=feed_dict_val)
val_losses.append(outs_val[0])
val_mrrs.append(outs_val[2])
return np.mean(val_losses), np.mean(val_mrrs), (time.time() - t_test)
def save_val_embeddings(sess, model, minibatch_iter, size, out_dir, mod=""):
val_embeddings = []
finished = False
seen = set([])
nodes = []
iter_num = 0
name = "val"
while not finished:
feed_dict_val, finished, edges = minibatch_iter.incremental_embed_feed_dict(size, iter_num)
iter_num += 1
outs_val = sess.run([model.loss, model.mrr, model.outputs1],
feed_dict=feed_dict_val)
#ONLY SAVE FOR embeds1 because of planetoid
for i, edge in enumerate(edges):
if not edge[0] in seen:
val_embeddings.append(outs_val[-1][i,:])
nodes.append(edge[0])
seen.add(edge[0])
if not os.path.exists(out_dir):
os.makedirs(out_dir)
val_embeddings = np.vstack(val_embeddings)
np.save(out_dir + name + mod + ".npy", val_embeddings)
with open(out_dir + name + mod + ".txt", "w") as fp:
fp.write("\n".join(map(str,nodes)))
def construct_placeholders(feature_size):
# Define placeholders
placeholders = {
'batch1' : tf.placeholder(tf.int32, shape=(None), name='batch1'),
'batch2' : tf.placeholder(tf.int32, shape=(None), name='batch2'),
# negative samples for all nodes in the batch
'neg_samples': tf.placeholder(tf.int32, shape=(None,),
name='neg_sample_size'),
'dropout': tf.placeholder_with_default(0., shape=(), name='dropout'),
'batch_size' : tf.placeholder(tf.int32, name='batch_size'),
}
return placeholders
def train(train_data, test_data=None):
G = train_data[0]
features = train_data[1]
id_map = train_data[2]
# pad with dummy zero vector
features = np.vstack([features, np.zeros((features.shape[1],))])
feature_size = features.shape[1]
context_pairs = train_data[3] if FLAGS.random_context else None
placeholders = construct_placeholders(feature_size)
minibatch = EdgeMinibatchIterator(G,
id_map,
placeholders, batch_size=FLAGS.batch_size,
max_degree=FLAGS.max_degree,
num_neg_samples=FLAGS.neg_sample_size,
context_pairs = context_pairs)
adj_info = tf.Variable(tf.constant(minibatch.adj, dtype=tf.int32), trainable=False, name="adj_info")
if FLAGS.model == 'graphsage_mean':
# Create model
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
model_size=FLAGS.model_size,
logging=True)
elif FLAGS.model == 'gcn':
# Create model
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, 2*FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, 2*FLAGS.dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
aggregator_type="gcn",
model_size=FLAGS.model_size,
concat=False,
logging=True)
elif FLAGS.model == 'graphsage_seq':
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
aggregator_type="seq",
model_size=FLAGS.model_size,
logging=True)
elif FLAGS.model == 'graphsage_pool':
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
aggregator_type="pool",
model_size=FLAGS.model_size,
logging=True)
elif FLAGS.model == 'n2v':
model = Node2VecModel(placeholders, features.shape[0],
minibatch.deg,
#2x because graphsage uses concat
nodevec_dim=2*FLAGS.dim_1,
lr=FLAGS.learning_rate)
else:
raise Exception('Error: model name unrecognized.')
config = tf.ConfigProto(log_device_placement=FLAGS.log_device_placement)
config.gpu_options.allow_growth = True
#config.gpu_options.per_process_gpu_memory_fraction = GPU_MEM_FRACTION
config.allow_soft_placement = True
# Initialize session
sess = tf.Session(config=config)
merged = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter(log_dir(), sess.graph)
# Init variables
sess.run(tf.global_variables_initializer())
# Train model
train_shadow_mrr = None
shadow_mrr = None
total_steps = 0
avg_time = 0.0
epoch_val_costs = []
train_adj_info = tf.assign(adj_info, minibatch.adj)
val_adj_info = tf.assign(adj_info, minibatch.test_adj)
for epoch in range(FLAGS.epochs):
minibatch.shuffle()
iter = 0
print('Epoch: %04d' % (epoch + 1))
epoch_val_costs.append(0)
while not minibatch.end():
# Construct feed dictionary
#feed_dict = construct_minibatch_feed_dict(features, G, y_train, train_mask, placeholders)
feed_dict = minibatch.next_minibatch_feed_dict()
feed_dict.update({placeholders['dropout']: FLAGS.dropout})
t = time.time()
# Training step
outs = sess.run([merged, model.opt_op, model.loss, model.ranks, model.aff_all,
model.mrr, model.outputs1], feed_dict=feed_dict)
train_cost = outs[2]
train_mrr = outs[5]
if train_shadow_mrr is None:
train_shadow_mrr = train_mrr#
else:
train_shadow_mrr -= (1-0.99) * (train_shadow_mrr - train_mrr)
if iter % FLAGS.validate_iter == 0:
# Validation
sess.run(val_adj_info.op)
val_cost, ranks, val_mrr, duration = evaluate(sess, model, minibatch, size=FLAGS.validate_batch_size)
sess.run(train_adj_info.op)
epoch_val_costs[-1] += val_cost
if shadow_mrr is None:
shadow_mrr = val_mrr
else:
shadow_mrr -= (1-0.99) * (shadow_mrr - val_mrr)
if total_steps % FLAGS.print_every == 0:
summary_writer.add_summary(outs[0], total_steps)
# Print results
avg_time = (avg_time * total_steps + time.time() - t) / (total_steps + 1)
if total_steps % FLAGS.print_every == 0:
print("Iter:", '%04d' % iter,
"train_loss=", "{:.5f}".format(train_cost),
"train_mrr=", "{:.5f}".format(train_mrr),
"train_mrr_ema=", "{:.5f}".format(train_shadow_mrr), # exponential moving average
"val_loss=", "{:.5f}".format(val_cost),
"val_mrr=", "{:.5f}".format(val_mrr),
"val_mrr_ema=", "{:.5f}".format(shadow_mrr), # exponential moving average
"time=", "{:.5f}".format(avg_time))
iter += 1
total_steps += 1
if total_steps > FLAGS.max_total_steps:
break
if total_steps > FLAGS.max_total_steps:
break
print("Optimization Finished!")
if FLAGS.save_embeddings:
sess.run(val_adj_info.op)
save_val_embeddings(sess, model, minibatch, FLAGS.validate_batch_size, log_dir())
if FLAGS.model == "n2v":
# stopping the gradient for the already trained nodes
train_ids = tf.constant([[id_map[n]] for n in G.nodes_iter() if not G.node[n]['val'] and not G.node[n]['test']],
dtype=tf.int32)
test_ids = tf.constant([[id_map[n]] for n in G.nodes_iter() if G.node[n]['val'] or G.node[n]['test']],
dtype=tf.int32)
update_nodes = tf.nn.embedding_lookup(model.context_embeds, tf.squeeze(test_ids))
no_update_nodes = tf.nn.embedding_lookup(model.context_embeds,tf.squeeze(train_ids))
update_nodes = tf.scatter_nd(test_ids, update_nodes, tf.shape(model.context_embeds))
no_update_nodes = tf.stop_gradient(tf.scatter_nd(train_ids, no_update_nodes, tf.shape(model.context_embeds)))
model.context_embeds = update_nodes + no_update_nodes
sess.run(model.context_embeds)
# run random walks
from graphsage.utils import run_random_walks
nodes = [n for n in G.nodes_iter() if G.node[n]["val"] or G.node[n]["test"]]
start_time = time.time()
pairs = run_random_walks(G, nodes, num_walks=50)
walk_time = time.time() - start_time
test_minibatch = EdgeMinibatchIterator(G,
id_map,
placeholders, batch_size=FLAGS.batch_size,
max_degree=FLAGS.max_degree,
num_neg_samples=FLAGS.neg_sample_size,
context_pairs = pairs,
n2v_retrain=True,
fixed_n2v=True)
start_time = time.time()
print("Doing test training for n2v.")
test_steps = 0
for epoch in range(FLAGS.n2v_test_epochs):
test_minibatch.shuffle()
while not test_minibatch.end():
feed_dict = test_minibatch.next_minibatch_feed_dict()
feed_dict.update({placeholders['dropout']: FLAGS.dropout})
outs = sess.run([model.opt_op, model.loss, model.ranks, model.aff_all,
model.mrr, model.outputs1], feed_dict=feed_dict)
if test_steps % FLAGS.print_every == 0:
print("Iter:", '%04d' % test_steps,
"train_loss=", "{:.5f}".format(outs[1]),
"train_mrr=", "{:.5f}".format(outs[-2]))
test_steps += 1
train_time = time.time() - start_time
save_val_embeddings(sess, model, minibatch, FLAGS.validate_batch_size, log_dir(), mod="-test")
print("Total time: ", train_time+walk_time)
print("Walk time: ", walk_time)
print("Train time: ", train_time)
def main(argv=None):
print("Loading training data..")
train_data = load_data(FLAGS.train_prefix)
print("Done loading training data..")
train(train_data)
if __name__ == '__main__':
tf.app.run()

70
graphsage/utils.py Normal file
View File

@ -0,0 +1,70 @@
import numpy as np
import random
import json
from networkx.readwrite import json_graph
WALK_LEN=5
N_WALKS=50
def load_data(prefix, normalize=True):
G_data = json.load(open(prefix + "-G.json"))
G = json_graph.node_link_graph(G_data)
if isinstance(G.nodes()[0], int):
conversion = lambda n : int(n)
else:
conversion = lambda n : n
feats = np.load(prefix + "-feats.npy")
id_map = json.load(open(prefix + "-id_map.json"))
id_map = {conversion(k):int(v) for k,v in id_map.iteritems()}
walks = []
class_map = json.load(open(prefix + "-class_map.json"))
if isinstance(class_map.values()[0], list):
lab_conversion = lambda n : n
else:
lab_conversion = lambda n : int(n)
class_map = {conversion(k):lab_conversion(v) for k,v in class_map.iteritems()}
## Make sure the graph has edge train_removed annotations
## (some datasets might already have this..)
print("Loaded data.. now preprocessing..")
for edge in G.edges_iter():
if (G.node[edge[0]]['val'] or G.node[edge[1]]['val'] or
G.node[edge[0]]['test'] or G.node[edge[1]]['test']):
G[edge[0]][edge[1]]['train_removed'] = True
else:
G[edge[0]][edge[1]]['train_removed'] = False
if normalize:
from sklearn.preprocessing import StandardScaler
train_ids = np.array([id_map[n] for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']])
train_feats = feats[train_ids]
scaler = StandardScaler()
scaler.fit(train_feats)
feats = scaler.transform(feats)
with open(prefix + "-walks.txt") as fp:
for line in fp:
walks.append(map(conversion, line.split()))
return G, feats, id_map, walks, class_map
def run_random_walks(G, nodes, num_walks=N_WALKS):
print("Subgraph for walks is of size", len(G))
pairs = []
for count, node in enumerate(nodes):
if G.degree(node) == 0:
continue
for i in range(num_walks):
curr_node = node
for j in range(WALK_LEN):
next_node = random.choice(G.neighbors(curr_node))
# self co-occurrences are useless
if curr_node != node:
pairs.append((node,curr_node))
curr_node = next_node
if count % 1000 == 0:
print(count)
return pairs