TensorFlow Tutorial for Data Scientist – Part 2

Tensorflow Neural Networks

Create a new Jupyter notebook with python 2.7 kernel. Name it as TensorFlow Neural Networks. Let’s import all the required modules.

%pylab inline

import os
import numpy as np
import pandas as pd
from scipy.misc import imread
from sklearn.metrics import accuracy_score
import tensorflow as tf

Set a seed value, so that we can control our models randomness

# To stop potential randomness
seed = 128
rng = np.random.RandomState(seed)

Set directory paths

root_dir = os.path.abspath('../')
data_dir = os.path.join(root_dir, 'tensorflow-tutorial/data')
sub_dir = os.path.join(root_dir, 'tensorflow-tutorial/sub')

# check for existence
os.path.exists(root_dir)
os.path.exists(data_dir)
os.path.exists(sub_dir)

Read the datasets. These are in .csv formats, and have a filename along with the appropriate labels

train = pd.read_csv(os.path.join(data_dir, 'Train', 'train.csv'))
test = pd.read_csv(os.path.join(data_dir, 'Test.csv'))

train.head()

Read images

img_name = rng.choice(train.filename)
filepath = os.path.join(data_dir, 'Train', 'Images', 'train', img_name)

img = imread(filepath, flatten=True)

pylab.imshow(img, cmap='gray')
pylab.axis('off')
pylab.show()

Show image in numpy array format

img

Store all our images as numpy arrays

temp = []
for img_name in train.filename:
    image_path = os.path.join(data_dir, 'Train', 'Images', 'train', img_name)
    img = imread(image_path, flatten=True)
    img = img.astype('float32')
    temp.append(img)
    
train_x = np.stack(temp)

temp = []
for img_name in test.filename:
    image_path = os.path.join(data_dir, 'Train', 'Images', 'test', img_name)
    img = imread(image_path, flatten=True)
    img = img.astype('float32')
    temp.append(img)

test_x = np.stack(temp)

Split size of 70:30 for train set vs validation set

split_size = int(train_x.shape[0]*0.7)

train_x, val_x = train_x[:split_size], train_x[split_size:]
train_y, val_y = train.label.values[:split_size], train.label.values[split_size:]

Define some helper functions

def dense_to_one_hot(labels_dense, num_classes=10):
    """Convert class labels from scalars to one-hot vectors"""
    num_labels = labels_dense.shape[0]
    index_offset = np.arange(num_labels) * num_classes
    labels_one_hot = np.zeros((num_labels, num_classes))
    labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1
    
    return labels_one_hot

def preproc(unclean_batch_x):
    """Convert values to range 0-1"""
    temp_batch = unclean_batch_x / unclean_batch_x.max()
    
    return temp_batch

def batch_creator(batch_size, dataset_length, dataset_name):
    """Create batch with random samples and return appropriate format"""
    batch_mask = rng.choice(dataset_length, batch_size)
    
    batch_x = eval(dataset_name + '_x')[[batch_mask]].reshape(-1, input_num_units)
    batch_x = preproc(batch_x)
    
    if dataset_name == 'train':
        batch_y = eval(dataset_name).ix[batch_mask, 'label'].values
        batch_y = dense_to_one_hot(batch_y)
        
    return batch_x, batch_y

Define a neural network architecture with 3 layers; input, hidden and output. The number of neurons in input and output are fixed, as the input is our 28 x 28 image and the output is a 10 x 1 vector representing the class. We take 500 neurons in the hidden layer. This number can vary according to your need. We also assign values to remaining variables.

### set all variables

# number of neurons in each layer
input_num_units = 28*28
hidden_num_units = 500
output_num_units = 10

# define placeholders
x = tf.placeholder(tf.float32, [None, input_num_units])
y = tf.placeholder(tf.float32, [None, output_num_units])

# set remaining variables
epochs = 5
batch_size = 128
learning_rate = 0.01

### define weights and biases of the neural network (refer this article if you don't understand the terminologies)

weights = {
    'hidden': tf.Variable(tf.random_normal([input_num_units, hidden_num_units], seed=seed)),
    'output': tf.Variable(tf.random_normal([hidden_num_units, output_num_units], seed=seed))
}

biases = {
    'hidden': tf.Variable(tf.random_normal([hidden_num_units], seed=seed)),
    'output': tf.Variable(tf.random_normal([output_num_units], seed=seed))
}

Create neural networks computational graph

hidden_layer = tf.add(tf.matmul(x, weights['hidden']), biases['hidden'])
hidden_layer = tf.nn.relu(hidden_layer)

output_layer = tf.matmul(hidden_layer, weights['output']) + biases['output']

Define cost of our neural network

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = output_layer, labels=y))

Set the optimizer, i.e. our backpropogation algorithm (Adam, Gradient Descent algorithm)

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

Initialize all the variables

init = tf.global_variables_initializer()

Create a session, and run the neural network in the session. Then validate the models accuracy on validation set that just created

with tf.Session() as sess:
    # create initialized variables
    sess.run(init)
    
    ### for each epoch, do:
    ###   for each batch, do:
    ###     create pre-processed batch
    ###     run optimizer by feeding batch
    ###     find cost and reiterate to minimize
    
    for epoch in range(epochs):
        avg_cost = 0
        total_batch = int(train.shape[0]/batch_size)
        for i in range(total_batch):
            batch_x, batch_y = batch_creator(batch_size, train_x.shape[0], 'train')
            _, c = sess.run([optimizer, cost], feed_dict = {x: batch_x, y: batch_y})
            
            avg_cost += c / total_batch
            
        print "Epoch:", (epoch+1), "cost =", "{:.5f}".format(avg_cost)
    
    print "\nTraining complete!"
    
    
    # find predictions on val set
    pred_temp = tf.equal(tf.argmax(output_layer, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(pred_temp, "float"))
    print "Validation Accuracy:", accuracy.eval({x: val_x.reshape(-1, input_num_units), y: dense_to_one_hot(val_y)})
    
    predict = tf.argmax(output_layer, 1)
    pred = predict.eval({x: test_x.reshape(-1, input_num_units)})

Test the model and visualize its predictions

img_name = rng.choice(test.filename)
filepath = os.path.join(data_dir, 'Train', 'Images', 'test', img_name)

img = imread(filepath, flatten=True)

test_index = int(img_name.split('.')[0]) - 49000
print "Prediction is: ", pred[test_index]

pylab.imshow(img, cmap='gray')
pylab.axis('off')
pylab.show()

Continue for part 3, http://intellij.my/2017/08/07/tensorflow-tutorial-for-data-scientist-part-3/.

TensorFlow Tutorial for Data Scientist – Part 1

Setup environment

Install Python 2.7. XX, https://www.python.org/downloads/. Then install Tensorflow using PIP, https://www.tensorflow.org/install/.

pip install tensorflow
TensorFlow

TensorFlow

Install Jupyter via PIP.

pip install jupyter

Create a new folder called tensorflow-tutorial and cd into that folder via terminal. Run jupyter notebook command.

Useful TensorFlow operators

The official documentation carefully lays out all available math ops: https://www.tensorflow.org/api_docs/Python/math_ops.html.

Some specific examples of commonly used operators include:

tf.add(x, y) 
Add two tensors of the same type, x + y
tf.sub(x, y) 
Subtract tensors of the same type, x — y
tf.mul(x, y) 
Multiply two tensors element-wise
tf.pow(x, y) 
Take the element-wise power of x to y
tf.exp(x) 
Equivalent to pow(e, x), where e is Euler’s number (2.718…)
tf.sqrt(x) 
Equivalent to pow(x, 0.5)
tf.div(x, y) 
Take the element-wise division of x and y
tf.truediv(x, y) 
Same as tf.div, except casts the arguments as a float
tf.floordiv(x, y) 
Same as truediv, except rounds down the final answer into an integer
tf.mod(x, y) 
Takes the element-wise remainder from division

Create a new Jupyter notebook with python 2.7 kernel. Name it as TensorFlow operators. Lets write a small program to add two numbers.

# import tensorflow
import tensorflow as tf

# build computational graph
a = tf.placeholder(tf.int16)
b = tf.placeholder(tf.int16)

addition = tf.add(a, b)

# initialize variables
init = tf.global_variables_initializer()

# create session and run the graph
with tf.Session() as sess:
    sess.run(init)
    print "Addition: %i" % sess.run(addition, feed_dict={a: 2, b: 3})

# close session
sess.close()

Exercise: Try all these operations and check the output. tf.add(x, y), tf.sub(x, y), tf.mul(x, y), tf.pow(x, y), tf.sqrt(x), tf.div(x, y) & tf.mod(x, y).

Continue for part 2, http://intellij.my/2017/08/07/tensorflow-tutorial-for-data-scientist-part-2.

Space Cluster – Search tool for data researchers on NASA datasets

Space Cluster is a search tool to help data researchers use data more efficiently. The problem is it is hard to find relevant data sets on data.nasa.gov. Making logical connections between datasets is a challenge. Its Artificial Intelligence-powered search engine to help data researchers find relevant data.It maintains data integrity between data sets using keywords frequency.Provides interactive visualization of keywords and datasets. The technology:LDA model (clustering) + D3.js (visualization).

Space Cluster.

Space Cluster.

Demo: https://spacecluster.herokuapp.com/