# A not so basic neural network on python

My previous post on implementing a basic Neural Network on python got a lot of attention staying one whole day on HN front page. I was very happy about that but more about the feedback I got. The community gave me a lot of tips and tricks on how to improve. So now I am presenting an improved version which supports multiple hidden layers, more optimization options using minibatches and a more maintainable/understandable code (or so I believe).

I kept reading a lot about Neural Networks mainly by watching some videos from Geoffrey's Hinton Neural Network course on coursera. Also reading more on deeplearning.net and trying to read some papers. That last task was definitely the hardest one because of the complexity of the papers. If I cannot even read them I can just imagine how hard is to write them.

About the Neural networks course I didn't like it as much as the Machine Learning course. The main reason is that I had to watch most videos 3 or 4 times before understanding (something). This is definitely my fault because the material is great but it focuses more on the theory (math) which I am not that good at and not on the implementation which I am more interested. But I take it as a learning experience and I will try to finish those videos.

Some people criticized (besides my orthography, sorry about my spanglish :P) that I didn't use any cross-validation and that the iris and digits examples are way to simple datasets. I agree that both things are necessary but I am not trying to make this post as a scientific paper with complete benchmarks of different optimization algorithms I leave that to the researchers who have spent years studying neural networks. But in on this case I am using the MNIST dataset to compare.

My objective was to learn about them and to have some personal implementation of the algorithm, just to say that I have done it :P

## Implementation¶

I was very tempted to use theano and but at the end I decided to do a pure numpy implementation.

It is defenitly possible (and not that hard neither that easy) to create a theano implementation and I took some of the ideas from their site such as creating a class for each layer because it makes easier to combine and create different implementations and makes the code more mantainable. But other than the optimization I still cannot see the value of using theano, of course the speed up is important but I thing I prefer the Numba approach.

The implementation is very similar to the previous post, the differences are:

- Support for multiple hidden layers. I still train the network using a big 1d-array of weights because that allows to use the scipy optimizations.
- A more "smart" random initialization for the arrays.
- Support for mini-batch optimization, which is a must have with bigger datasets.
- Created a Layer class which weights are mapped to slices of the big weight array that is optimized so they are sharing that part of memory. That allows to a more maintainable code and probably to include pre-training in a future.

One feature that I think is nice is that when fitting one can actually interrupt (not restart) the ipython kernel and the last weights are going to be saved in the class (`NN.coef_`

) so one doesn't have to start the training from nothing, also I added a callback on each epoch so one can for example see the score in a validation dataset.

### Basic classes and functions¶

```
import math
import numpy as np
from scipy import optimize
```

```
class Layer(object):
def __init__(self, n_in=None, n_out=None, W=None, random_state=None, activation=None):
if random_state is None:
rnd = np.random.RandomState()
else:
rnd = random_state
if W is None:
self.W = rnd.uniform(size=(n_out, n_in + 1))
else:
self.W = W
self.activation = activation
def output(self, input):
data = np.insert(input, 0, 1, axis=1)
linear_output = np.dot(data, self.W)
return self.activation(linear_output)
```

```
def sigmoid(z):
return np.divide(1., (1 + np.exp(-z)))
```

```
class SigmoidLayer(Layer):
def __init__(self, n_in=None, n_out=None, W=None, random_state=None):
Layer.__init__(self, n_in, n_out, W, random_state, activation=sigmoid)
```

### Cost function and gradient¶

```
def unpack_weigths(weights, weights_meta):
start_pos = 0
for layer in weights_meta:
end_pos = start_pos + layer[0] * (layer[1])
yield weights[start_pos:end_pos].reshape((layer[0], layer[1]))
start_pos = end_pos
```

```
def cost(weights, X, y, weights_meta, num_labels):
# Forward
act_prev = np.insert(X, 0, 1, axis=1)
for weight in unpack_weigths(weights, weights_meta):
z = np.dot(act_prev, weight)
activation = sigmoid(z)
act_prev = np.insert(activation, 0, 1, axis=1)
Y = np.eye(num_labels)[y]
h = activation
costPositive = -Y * np.log(h)
costNegative = (1 - Y) * np.log(1 - h)
J = np.sum(costPositive - costNegative) / X.shape[0]
return J
```

```
def unpack_weigths_inv(weights, weights_meta):
end_pos = len(weights)
for layer in reversed(weights_meta):
start_pos = end_pos - layer[0] * (layer[1])
yield weights[start_pos:end_pos].reshape((layer[0], layer[1]))
end_pos = start_pos
```

```
def cost_prime(weights, X, y, weights_meta, num_labels):
Y = np.eye(num_labels)[y]
Deltas = [np.zeros(shape) for shape in weights_meta]
data = np.insert(X, 0, 1, axis=1)
for i, row in enumerate(data):
# Forward
#row = np.array([row])
act_prev = row
activations = (act_prev, )
for weight in unpack_weigths(weights, weights_meta):
z = np.dot(act_prev, weight)
activation = sigmoid(z)
act_prev = np.append(1, activation)
activations = activations + (act_prev, )
# Backprop
prev_delta = activations[-1][1:] - Y[i, :].T # last delta
deltas = (prev_delta, ) # deltas[0] == delta2
for act, weight in zip(reversed(activations[1:-1]), unpack_weigths_inv(weights, weights_meta)):
delta = np.dot(weight, prev_delta)[1:] * (act[1:] * (1 - act[1:])).T
deltas = (delta, ) + deltas
prev_delta = delta
# Accumulate errors
for delta, act, i in zip(deltas, activations[:-1], range(len(Deltas))):
Deltas[i] = Deltas[i] + np.dot(delta[np.newaxis].T, act[np.newaxis]).T
for i in range(len(Deltas)):
Deltas[i] = Deltas[i] / X.shape[0]
return np.concatenate(tuple([D.reshape(-1) for D in Deltas]))
```

### Optimization¶

This class is simulating a `MinibatchOpti`

module.

```
class MinibatchOpti(object):
@staticmethod
def minibatches(X, y=None, batch_size=50, random_state=None):
if random_state is None:
rnd = np.random.RandomState()
elif isinstance(random_state, int):
rnd = np.random.RandomState(random_state)
else:
rnd = random_state
m = X.shape[0]
batch_size = batch_size if batch_size >= 1 else int(math.floor(m * batch_size))
max_batchs = int(math.floor(m / batch_size))
while True:
random_indices = rnd.choice(np.arange(m), m, replace=False)
for i in range(max_batchs):
batch_indices = np.arange(i * batch_size, (i + 1) * batch_size)
indices = random_indices[batch_indices]
if y is None:
yield X[indices]
else:
yield X[indices], y[indices]
@staticmethod
def GD(fun, weights, jac, X, y, options, args=()):
weights -= options['learning_rate'] * jac(weights, X, y, *args)
options['learning_rate'] = options['learning_rate'] * options['learning_rate_decay']
@staticmethod
def GD_momentum(fun, weights, jac, X, y, options, args=()):
bigjump = options['momentum'] * options['step']
weights -= bigjump
correction = options['learning_rate'] * jac(weights, X, y, *args)
weights -= correction
options['step'] = bigjump + correction
options['learning_rate'] = options['learning_rate'] * options['learning_rate_decay']
options['momentum'] = options['momemtum_decay'] * options['momentum']
@staticmethod
def RMSPROP(fun, weights, jac, X, y, options, args=()):
gradient = jac(weights, X, y, *args)
options['moving_mean_squared'] = options['decay'] * options['moving_mean_squared'] \
+ (1 - options['decay']) * gradient ** 2
weights -= gradient / np.sqrt(options['moving_mean_squared'] + 1e-8)
@staticmethod
def CG(fun, weights, jac, X, y, options, args=()):
ans = optimize.minimize(fun, weights, jac=jac, method='CG', args=(X, y) + args, options={'maxiter': options['mb_maxiter']})
weights[:] = ans.x
@staticmethod
def LBFGSB(fun, weights, jac, X, y, options, args=()):
ans = optimize.minimize(fun, weights, jac=jac, method='L-BFGS-B', args=(X, y) + args, options={'maxiter': options['mb_maxiter']})
weights[:] = ans.x
@staticmethod
def minimize(fun, weights, jac, X, y, method, batch_size=50, tol=1e-6, maxiter=100, args=None,
verbose=False, options=None, random_state=None, callback=None):
if method == 'GD':
assert 'learning_rate' in options, 'GD needs a learning rate'
if 'learning_rate_decay' not in options:
options['learning_rate_decay'] = 1
if 'momentum' in options:
if 'momemtum_decay' not in options:
options['momemtum_decay'] = 1
options['step'] = 0
update = MinibatchOpti.GD_momentum
else:
update = MinibatchOpti.GD
elif method == 'RMSPROP':
options['moving_mean_squared'] = 1
update = MinibatchOpti.RMSPROP
elif method == 'CG':
update = MinibatchOpti.CG
elif method == 'L-BFGS-B':
update = MinibatchOpti.LBFGSB
else:
raise Exception('Optimization method not found')
i = 1
prev_cost = 1e8
for _X, _y in MinibatchOpti.minibatches(X, y, batch_size, random_state=random_state):
update(fun, weights, jac, _X, _y, options, args=args)
new_cost = fun(weights, X, y, *args)
diff = new_cost - prev_cost
if np.abs(diff) < tol:
if verbose >= 1:
print 'Minimum tolerance reached in %i iterations' % i
break
if i >= maxiter:
if verbose >= 1 :
print 'Maximum number of iterations reached'
break
if verbose >= 2:
print i, newJ
if callback is not None:
stop = callback(i, weights)
if stop == True:
break
prev_cost = new_cost
i = i + 1
```

### The sklearn-stlye class¶

```
class NN(object):
def __init__(self, hidden_layers, coef0=None, random_state=None,
opti_method='GD', batch_size=50, maxiter=100, tol=1e-6, verbose=1,
opti_options=None, callback=None):
self.hidden_layers = hidden_layers
self.coef_ = None if coef0 is None else np.copy(coef0)
if random_state is None:
self.rnd = np.random.RandomState()
elif isinstance(random_state, int):
self.rnd = np.random.RandomState(random_state)
else:
self.rnd = random_state
self.opti_method = opti_method
self.batch_size = batch_size
self.verbose = verbose
self.maxiter = maxiter
self.tol = tol
self.opti_options = {} if opti_options is None else opti_options
self.callback = callback
def rand_init(self, weights_info, random_state):
w_sizes = []
for layer_info in weights_info:
w_sizes.append(layer_info[0] * layer_info[1])
ans = np.zeros(sum(w_sizes))
# "Smart" random initialization
start_pos = 0
for i, layer in enumerate(weights_info):
end_pos = start_pos + layer[0] * (layer[1])
gap = 4 * np.sqrt(6. / (layer[0] + layer[1]))
ans[start_pos:end_pos] = random_state.uniform(low=-gap, high=gap, size=w_sizes[i])
start_pos = end_pos
return ans
def predict_proba(self, X):
output = self.layers[0].output(X)
for layer in self.layers[1:]:
output = layer.output(output)
return output
def predict(self, X):
return self.predict_proba(X).argmax(1)
def fit(self, X, y):
layers = list(self.hidden_layers) # Copy
layers.insert(0, X.shape[1])
layers.insert(len(layers), len(np.unique(y)))
self.weights_info = [(layers[i] + 1, layers[i + 1]) for i in range(len(layers) - 1)]
self.opti_options = self.opti_options.copy()
if self.coef_ is None:
self.coef_ = self.rand_init(self.weights_info, self.rnd)
# Unpack the weights and assign them to the layers
self.layers = []
start_pos = 0
for w_info in self.weights_info:
end_pos = start_pos + w_info[0] * (w_info[1])
weight = self.coef_[start_pos:end_pos].reshape((w_info[0], w_info[1]))
self.layers.append(SigmoidLayer(W=weight))
start_pos = end_pos
args = (self.weights_info, len(np.unique(y)))
MinibatchOpti.minimize(cost, self.coef_, cost_prime, X, y, method=self.opti_method,
random_state=self.rnd, batch_size=self.batch_size, maxiter=self.maxiter,
tol=self.tol, args=args, verbose=self.verbose, options=self.opti_options,
callback=self.callback)
```

## MNINST¶

Lets see how it good it does the using the MINST dataset: 50k rows for training and 10k validations, 748 features.

For all the optimization algorithms I am going to use 100 iterations using a batch size of 100. Also I only timed the fitting once because it takes a long time and I am lazy, also the difference between iterations when the times are that long are not that significant.

My setup is: Macbook Pro 2.5 GHz Inter Core i5. 16 GB RAM.

```
import cPickle, gzip, numpy
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()
```

```
X_train, y_train = train_set[0], train_set[1]
X_valid, y_valid = valid_set[0], valid_set[1]
```

```
from sklearn.metrics import accuracy_score
```

### Gradient decent¶

First let's try using a simple gradient decent, the bad part is that one needs to input that pesky learning rate.

```
options = {}
options['learning_rate'] = 0.3
options['learning_rate_decay'] = 1
```

```
nn = NN([25], opti_method='GD', maxiter=100, opti_options=options, random_state=1234)
```

```
%timeit -n1 -r1 nn.fit(X_train, y_train)
```

```
accuracy_score(nn.predict(X_valid), y_valid)
```

### RMSPROP¶

To remove the learning rate an easy solution is to use RMSPROP.

```
options = {}
options['decay'] = 0.9
```

```
nn = NN([25], opti_method='RMSPROP', maxiter=100, opti_options=options, random_state=1234)
```

```
%timeit -n1 -r1 nn.fit(X_train, y_train)
```

```
accuracy_score(nn.predict(X_valid), y_valid)
```

### Conjugate gradient¶

Finally my personal favorite is just to use the scipy optimization methods, one has to be careful though because most are for full-batch optimization. But the Conjugate Gradient and L-BFGS-B works on mini-batches. Personally I prefer CG because gave me better results.

```
options = {}
options['mb_maxiter'] = 10
```

```
nn = NN([25], opti_method='CG', maxiter=100, opti_options=options, random_state=1234)
```

```
%timeit -n1 -r1 nn.fit(X_train, y_train)
```

```
accuracy_score(nn.predict(X_valid), y_valid)
```

### Comparison¶

Let's look at the results in a beautiful pandas DataFrame.

```
import pandas as pd
```

```
pd.DataFrame([[284, 0.8316], [297, 0.1863], [314, 0.8607]], index=['GD', 'RMSPROP', 'CG'], columns=['Time [s]', 'Accuracy'])
```

## Conclusions¶

Even though gradient decent did well I still prefer the scipy optimization methods, maybe because they are just more robust and don't have a learning rate. It takes more time but is not the end of the world. I don't know what is wrong with my RMSPROP implementation I looked at some implementations online and couldn't found the error, if someone sees anything suspicious let me know.

Some improvements I would love to do and ideas I want to test:

- Use Numba to speed up the process, in theory having the cost function, its gradient and optimization in different "well" defined functions it should be easy.
- Try more optimization methods or why not use an existing implementation because there are many python implementations of these and more algorithms online, just to name a few I found: python-rl and climin.
- Do pretraining using RBMs: I am just starting to read about them.
- Use minibatches to some point and then switch to bigger minibatches or even full batch optimization.
- People suggested me to use a lookup table for the sigmoid. I tried but the accuracy I got was really bad, I have to check that code again.

I have to admit that I cheated a little bit on the post. Because the neural network supports multiple hidden layers but I only used one!
**Why?**: Mainly because it takes a long time to train those networks and I am lazy. Second, I couldn't find a way to make those networks converge without using the scipy advanced optimization methods.

From what I understand the problem are the initial weights, and as a side note I spent almost 2 days trying to find a bug while training the MNIST dataset and it was that I started all the layers weights with similar values. The initial weights are **very** important.

I think I have learned enough about neural networks for my personal satisfaction, so this is probably the last iteration of my Neural network. I will definitely keep looking at the updates of NN research but not as almost fulltime as I been doing the past weeks. Is an imposible mission to catch up that amount of information in my free time. Having spent almost a month reading about NN my (obvious) conclusion is that researchers are just scratching the surface of what is possible, and the next years are going to be exciting.

If you are looking for a more complete implementation of a deep belif network there is free python (gpu) implementation by one person on Geoffrey Hinton group which I haven't look but seams promising and hopefully he will release more code.