TensorFlow Tutorial for Data Scientist – Part 4

TensorFlow Convolutional Neural Networks (CNN)

TensorflowTensorflow

Create a new Jupyter notebook with python 2.7 kernel. Name it as TensorFlow CNN. In this tutorial we will train a simple classifier to classify images of birds. Open your Chrome browser and install Fatkun Batch Download Image. Google this keyword malabar pied hornbill. Select Images and click Fatkun Batch Download Image icon on the right top. Select This tab and new windows will appear.

Fatkun Batch Download Image

Fatkun Batch Download Image

Unselect which images that not related to malabar pied hornbill bird category then click Save Image. Make sure minimum images that need to be train is 75. Wait until all images finish download. Copy all the images and place it into <your_working_space>tf_files > birds > imagesMalabar Pied Hornbill. Repeat the same steps over and over again for these categories.

sacred kingfisher
pied kingfisher
common hoopoe
layard s parakeet
owl
sparrow
brahminy kite
sparrowhawk
wallcreeper
bornean ground cuckoo
blue crowned hanging parrot

Download retrain script (https://raw.githubusercontent.com/datomnurdin/tensorflow-python/master/retrain.py) to the current directory (<your_working_space>) . Go to terminal/command line and cd to <your_working_space> directory. Run this command to retrain all the images. It takes around 30 minutes to finish.

python retrain.py 
--bottleneck_dir=tf_files/bottlenecks 
--model_dir=tf_files/inception 
--output_graph=tf_files/retrained_graph.pb 
--output_labels=tf_files/retrained_labels.txt 
--image_dir <your_absolute_path>/<your_working_space>/tf_files/birds

Create a prediction script and load generated model into it.

import tensorflow as tf
import sys

# change this as you see fit
image_path = sys.argv[1]

# Read in the image_data
image_data = tf.gfile.FastGFile(image_path, 'rb').read()

# Loads label file, strips off carriage return
label_lines = [line.rstrip() for line 
                   in tf.gfile.GFile("tf_files/retrained_labels.txt")]

# Unpersists graph from file
with tf.gfile.FastGFile("tf_files/retrained_graph.pb", 'rb') as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())
    _ = tf.import_graph_def(graph_def, name='')

with tf.Session() as sess:
    # Feed the image_data as input to the graph and get first prediction
    softmax_tensor = sess.graph.get_tensor_by_name('final_result:0')
    
    predictions = sess.run(softmax_tensor, \
             {'DecodeJpeg/contents:0': image_data})
    
    # Sort to show labels of first prediction in order of confidence
    top_k = predictions[0].argsort()[-len(predictions[0]):][::-1]
    
    for node_id in top_k:
        human_string = label_lines[node_id]
        score = predictions[0][node_id]
        print('%s (score = %.5f)' % (human_string, score))

Predict the image using terminal/command line.

python detect.py test_image.png

Continue for part 5, .

TensorFlow Tutorial for Data Scientist – Part 3

TensorFlow Deep Learning

Create a new Jupyter notebook with python 2.7 kernel. Name it as TensorFlow Deep Learning. Let’s import all the required modules.

import tensorflow as tf
import tempfile
import pandas as pd
import urllib

Define Base Feature Columns that will be the building blocks used by both the wide part and the deep part of the model.

tf.logging.set_verbosity(tf.logging.ERROR)

# Categorical base columns.
gender = tf.contrib.layers.sparse_column_with_keys(column_name="gender", keys=["Female", "Male"])
race = tf.contrib.layers.sparse_column_with_keys(column_name="race", keys=[
  "Amer-Indian-Eskimo", "Asian-Pac-Islander", "Black", "Other", "White"])
education = tf.contrib.layers.sparse_column_with_hash_bucket("education", hash_bucket_size=1000)
relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship", hash_bucket_size=100)
workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass", hash_bucket_size=100)
occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation", hash_bucket_size=1000)
native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country", hash_bucket_size=1000)

# Continuous base columns.
age = tf.contrib.layers.real_valued_column("age")
age_buckets = tf.contrib.layers.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
education_num = tf.contrib.layers.real_valued_column("education_num")
capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")

The wide model is a linear model with a wide set of sparse and crossed feature columns:

wide_columns = [
  gender, native_country, education, occupation, workclass, relationship, age_buckets,
  tf.contrib.layers.crossed_column([education, occupation], hash_bucket_size=int(1e4)),
  tf.contrib.layers.crossed_column([native_country, occupation], hash_bucket_size=int(1e4)),
  tf.contrib.layers.crossed_column([age_buckets, education, occupation], hash_bucket_size=int(1e6))
]

The Deep Model: Neural Network with Embeddings

deep_columns = [
  tf.contrib.layers.embedding_column(workclass, dimension=8),
  tf.contrib.layers.embedding_column(education, dimension=8),
  tf.contrib.layers.embedding_column(gender, dimension=8),
  tf.contrib.layers.embedding_column(relationship, dimension=8),
  tf.contrib.layers.embedding_column(native_country, dimension=8),
  tf.contrib.layers.embedding_column(occupation, dimension=8),
  age, education_num, capital_gain, capital_loss, hours_per_week
]

Combining Wide and Deep Models into one

model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.DNNLinearCombinedClassifier(
    fix_global_step_increment_bug=True,
    model_dir=model_dir,
    linear_feature_columns=wide_columns,
    dnn_feature_columns=deep_columns,
    dnn_hidden_units=[100, 50])

Process input data

# Define the column names for the data sets.
COLUMNS = ["age", "workclass", "fnlwgt", "education", "education_num",
  "marital_status", "occupation", "relationship", "race", "gender",
  "capital_gain", "capital_loss", "hours_per_week", "native_country", "income_bracket"]
LABEL_COLUMN = 'label'
CATEGORICAL_COLUMNS = ["workclass", "education", "marital_status", "occupation",
                       "relationship", "race", "gender", "native_country"]
CONTINUOUS_COLUMNS = ["age", "education_num", "capital_gain", "capital_loss",
                      "hours_per_week"]

# Download the training and test data to temporary files.
# Alternatively, you can download them yourself and change train_file and
# test_file to your own paths.
train_file = tempfile.NamedTemporaryFile()
test_file = tempfile.NamedTemporaryFile()
urllib.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.data", train_file.name)
urllib.urlretrieve("http://mlr.cs.umass.edu/ml/machine-learning-databases/adult/adult.test", test_file.name)

# Read the training and test data sets into Pandas dataframe.
df_train = pd.read_csv(train_file, names=COLUMNS, skipinitialspace=True)
df_test = pd.read_csv(test_file, names=COLUMNS, skipinitialspace=True, skiprows=1)
df_train[LABEL_COLUMN] = (df_train['income_bracket'].apply(lambda x: '>50K' in x)).astype(int)
df_test[LABEL_COLUMN] = (df_test['income_bracket'].apply(lambda x: '>50K' in x)).astype(int)

def input_fn(df):
  # Creates a dictionary mapping from each continuous feature column name (k) to
  # the values of that column stored in a constant Tensor.
  continuous_cols = {k: tf.constant(df[k].values)
                     for k in CONTINUOUS_COLUMNS}
  # Creates a dictionary mapping from each categorical feature column name (k)
  # to the values of that column stored in a tf.SparseTensor.
  categorical_cols = {k: tf.SparseTensor(
      indices=[[i, 0] for i in range(df[k].size)],
      values=df[k].values,
      dense_shape=[df[k].size, 1])
                      for k in CATEGORICAL_COLUMNS}
  # Merges the two dictionaries into one.
  feature_cols = dict(continuous_cols.items() + categorical_cols.items())
  # Converts the label column into a constant Tensor.
  label = tf.constant(df[LABEL_COLUMN].values)
  # Returns the feature columns and the label.
  return feature_cols, label

def train_input_fn():
  return input_fn(df_train)

def eval_input_fn():
  return input_fn(df_test)

Training and evaluating The Model

m.fit(input_fn=train_input_fn, steps=200)
results = m.evaluate(input_fn=eval_input_fn, steps=1)
for key in sorted(results):
    print("%s: %s" % (key, results[key]))

Continue for part 4, http://intellij.my/2017/08/08/tensorflow-tutorial-for-data-scientist-part-4/

TensorFlow Tutorial for Data Scientist – Part 2

Tensorflow Neural Networks

Create a new Jupyter notebook with python 2.7 kernel. Name it as TensorFlow Neural Networks. Let’s import all the required modules.

%pylab inline

import os
import numpy as np
import pandas as pd
from scipy.misc import imread
from sklearn.metrics import accuracy_score
import tensorflow as tf

Set a seed value, so that we can control our models randomness

# To stop potential randomness
seed = 128
rng = np.random.RandomState(seed)

Set directory paths

root_dir = os.path.abspath('../')
data_dir = os.path.join(root_dir, 'tensorflow-tutorial/data')
sub_dir = os.path.join(root_dir, 'tensorflow-tutorial/sub')

# check for existence
os.path.exists(root_dir)
os.path.exists(data_dir)
os.path.exists(sub_dir)

Read the datasets. These are in .csv formats, and have a filename along with the appropriate labels

train = pd.read_csv(os.path.join(data_dir, 'Train', 'train.csv'))
test = pd.read_csv(os.path.join(data_dir, 'Test.csv'))

train.head()

Read images

img_name = rng.choice(train.filename)
filepath = os.path.join(data_dir, 'Train', 'Images', 'train', img_name)

img = imread(filepath, flatten=True)

pylab.imshow(img, cmap='gray')
pylab.axis('off')
pylab.show()

Show image in numpy array format

img

Store all our images as numpy arrays

temp = []
for img_name in train.filename:
    image_path = os.path.join(data_dir, 'Train', 'Images', 'train', img_name)
    img = imread(image_path, flatten=True)
    img = img.astype('float32')
    temp.append(img)
    
train_x = np.stack(temp)

temp = []
for img_name in test.filename:
    image_path = os.path.join(data_dir, 'Train', 'Images', 'test', img_name)
    img = imread(image_path, flatten=True)
    img = img.astype('float32')
    temp.append(img)

test_x = np.stack(temp)

Split size of 70:30 for train set vs validation set

split_size = int(train_x.shape[0]*0.7)

train_x, val_x = train_x[:split_size], train_x[split_size:]
train_y, val_y = train.label.values[:split_size], train.label.values[split_size:]

Define some helper functions

def dense_to_one_hot(labels_dense, num_classes=10):
    """Convert class labels from scalars to one-hot vectors"""
    num_labels = labels_dense.shape[0]
    index_offset = np.arange(num_labels) * num_classes
    labels_one_hot = np.zeros((num_labels, num_classes))
    labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1
    
    return labels_one_hot

def preproc(unclean_batch_x):
    """Convert values to range 0-1"""
    temp_batch = unclean_batch_x / unclean_batch_x.max()
    
    return temp_batch

def batch_creator(batch_size, dataset_length, dataset_name):
    """Create batch with random samples and return appropriate format"""
    batch_mask = rng.choice(dataset_length, batch_size)
    
    batch_x = eval(dataset_name + '_x')[[batch_mask]].reshape(-1, input_num_units)
    batch_x = preproc(batch_x)
    
    if dataset_name == 'train':
        batch_y = eval(dataset_name).ix[batch_mask, 'label'].values
        batch_y = dense_to_one_hot(batch_y)
        
    return batch_x, batch_y

Define a neural network architecture with 3 layers; input, hidden and output. The number of neurons in input and output are fixed, as the input is our 28 x 28 image and the output is a 10 x 1 vector representing the class. We take 500 neurons in the hidden layer. This number can vary according to your need. We also assign values to remaining variables.

### set all variables

# number of neurons in each layer
input_num_units = 28*28
hidden_num_units = 500
output_num_units = 10

# define placeholders
x = tf.placeholder(tf.float32, [None, input_num_units])
y = tf.placeholder(tf.float32, [None, output_num_units])

# set remaining variables
epochs = 5
batch_size = 128
learning_rate = 0.01

### define weights and biases of the neural network (refer this article if you don't understand the terminologies)

weights = {
    'hidden': tf.Variable(tf.random_normal([input_num_units, hidden_num_units], seed=seed)),
    'output': tf.Variable(tf.random_normal([hidden_num_units, output_num_units], seed=seed))
}

biases = {
    'hidden': tf.Variable(tf.random_normal([hidden_num_units], seed=seed)),
    'output': tf.Variable(tf.random_normal([output_num_units], seed=seed))
}

Create neural networks computational graph

hidden_layer = tf.add(tf.matmul(x, weights['hidden']), biases['hidden'])
hidden_layer = tf.nn.relu(hidden_layer)

output_layer = tf.matmul(hidden_layer, weights['output']) + biases['output']

Define cost of our neural network

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = output_layer, labels=y))

Set the optimizer, i.e. our backpropogation algorithm (Adam, Gradient Descent algorithm)

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

Initialize all the variables

init = tf.global_variables_initializer()

Create a session, and run the neural network in the session. Then validate the models accuracy on validation set that just created

with tf.Session() as sess:
    # create initialized variables
    sess.run(init)
    
    ### for each epoch, do:
    ###   for each batch, do:
    ###     create pre-processed batch
    ###     run optimizer by feeding batch
    ###     find cost and reiterate to minimize
    
    for epoch in range(epochs):
        avg_cost = 0
        total_batch = int(train.shape[0]/batch_size)
        for i in range(total_batch):
            batch_x, batch_y = batch_creator(batch_size, train_x.shape[0], 'train')
            _, c = sess.run([optimizer, cost], feed_dict = {x: batch_x, y: batch_y})
            
            avg_cost += c / total_batch
            
        print "Epoch:", (epoch+1), "cost =", "{:.5f}".format(avg_cost)
    
    print "\nTraining complete!"
    
    
    # find predictions on val set
    pred_temp = tf.equal(tf.argmax(output_layer, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(pred_temp, "float"))
    print "Validation Accuracy:", accuracy.eval({x: val_x.reshape(-1, input_num_units), y: dense_to_one_hot(val_y)})
    
    predict = tf.argmax(output_layer, 1)
    pred = predict.eval({x: test_x.reshape(-1, input_num_units)})

Test the model and visualize its predictions

img_name = rng.choice(test.filename)
filepath = os.path.join(data_dir, 'Train', 'Images', 'test', img_name)

img = imread(filepath, flatten=True)

test_index = int(img_name.split('.')[0]) - 49000
print "Prediction is: ", pred[test_index]

pylab.imshow(img, cmap='gray')
pylab.axis('off')
pylab.show()

Continue for part 3, http://intellij.my/2017/08/07/tensorflow-tutorial-for-data-scientist-part-3/.

TensorFlow Tutorial for Data Scientist – Part 1

Setup environment

Install Python 2.7. XX, https://www.python.org/downloads/. Then install Tensorflow using PIP, https://www.tensorflow.org/install/.

pip install tensorflow
TensorFlow

TensorFlow

Install Jupyter via PIP.

pip install jupyter

Create a new folder called tensorflow-tutorial and cd into that folder via terminal. Run jupyter notebook command.

Useful TensorFlow operators

The official documentation carefully lays out all available math ops: https://www.tensorflow.org/api_docs/Python/math_ops.html.

Some specific examples of commonly used operators include:

tf.add(x, y) 
Add two tensors of the same type, x + y
tf.sub(x, y) 
Subtract tensors of the same type, x — y
tf.mul(x, y) 
Multiply two tensors element-wise
tf.pow(x, y) 
Take the element-wise power of x to y
tf.exp(x) 
Equivalent to pow(e, x), where e is Euler’s number (2.718…)
tf.sqrt(x) 
Equivalent to pow(x, 0.5)
tf.div(x, y) 
Take the element-wise division of x and y
tf.truediv(x, y) 
Same as tf.div, except casts the arguments as a float
tf.floordiv(x, y) 
Same as truediv, except rounds down the final answer into an integer
tf.mod(x, y) 
Takes the element-wise remainder from division

Create a new Jupyter notebook with python 2.7 kernel. Name it as TensorFlow operators. Lets write a small program to add two numbers.

# import tensorflow
import tensorflow as tf

# build computational graph
a = tf.placeholder(tf.int16)
b = tf.placeholder(tf.int16)

addition = tf.add(a, b)

# initialize variables
init = tf.global_variables_initializer()

# create session and run the graph
with tf.Session() as sess:
    sess.run(init)
    print "Addition: %i" % sess.run(addition, feed_dict={a: 2, b: 3})

# close session
sess.close()

Exercise: Try all these operations and check the output. tf.add(x, y), tf.sub(x, y), tf.mul(x, y), tf.pow(x, y), tf.sqrt(x), tf.div(x, y) & tf.mod(x, y).

Continue for part 2, http://intellij.my/2017/08/07/tensorflow-tutorial-for-data-scientist-part-2.

Image Recognition using Tensorflow

Step 1

Install Tensorflow using PIP, https://www.tensorflow.org/install/.

Step 2

Download bird images from Google image using this chrome extension, https://chrome.google.com/webstore/detail/fatkun-batch-download-ima/nnjjahlikiabnchcpehcpkdeckfgnohf?hl=en. Create 12 folders for all birds inside tf_files > birds. Make sure each bird folder contain at least 60-70 images and same category.

bird folder name

bird folder name

Step 3

Use this command to retrain your custom images

sudo python retrain.py --bottleneck_dir=tf_files/bottlenecks --model_dir=tf_files/inception --output_graph=tf_files/retrained_graph.pb --output_labels=tf_files/retrained_labels.txt --image_dir /tf_files/birds

Step 4

Predict

python detect.py sample.png
prediction

prediction

Reference: https://github.com/datomnurdin/tensorflow-python

Classification on Bank Marketing dataset

The Bank Marketing dataset was used in Wisaeng, K. (2013). A comparison of different classification techniques for bank direct marketing. International Journal of Soft Computing and Engineering (IJSCE), 3(4), 116-119.

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

There are four datasets:
1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).
4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).
The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

The columns in this dataset are:

  • age
  • job
  • marital
  • education
  • default
  • housing
  • loan
  • contact
  • month
  • day_of_week
  • duration
  • campaign
  • pdays
  • previous
  • poutcome
  • emp.var.rate
  • cons.price.idx
  • cons.conf.idx
  • euribor3m
  • nr.employed

 

Data Mining Syllabus – PyMathCamp

Demand for Data science talent is exploding. McKinsey estimates that by 2018, a 500,000 strong workforce of data scientists will be needed in US alone. The resulting talent gap must be filled by a new generation of data scientists. The term data scientist is quite ambiguous. The Center for Data Science at New York University describe data science as,

the study of the generalizable extraction of knowledge from data [using] mathematics, machine learning, artificial intelligence, statistics, databases and optimization, along with a deep understanding of the craft of problem formulation to engineer effective solutions

Data science.

Data science.

As you can see, a data scientist is a professional with a multidisciplinary profile. Optimizing the value of data is dependent on the skills of the data scientists who process the data.

Intellij.my is offering these essentials with PyMathCamp. This course is your stepping stone to become a data scientist. Key concepts in data acquisition, preparation, exploration and visualization along with examples on how to build interactive data science solutions are presented using Ipython notebooks.
You will learn to write Python code and apply data science techniques to many field of interest, for example in finance, robotic, marketing, gaming, computer vision, speech recognition and many more. By the end of this course, you will know how to build machine learning models and derive insights from data science.

The course is organized into 11 chapters. The major components of PyMathCamp are:

1) Data management (extract, transform, load, storing, cleaning and transformation)

We begin with studying data warehousing and OLAP, data cubes technology and multidimensional databases. (Chapter 2, 3 and 4)

2) Data Mining (machine learning technology, math and statistics)

Descriptive statistics are applied for data exploration. Mining Frequent Patterns, Association and Correlations. We will also learn more on the different types of machine learning methodology through python programming. (Chapter 5)

3) Data Analysis/Prescription (classification, regression, clustering, visualization)

At this stage, we are ready to dive into data modelling with different types of machine learning methods. PyMathcamp includes many different machine learning techniques to analyse and mine data, including linear regression, logistic regression, support vector machines, ensembling and clustering among numerous others. Model construction and validation are studied. This rigorous data modelling process is further enhanced with graphical visualisation. The end result will lead to insight for intelligent decision making. (Chapter 6 and 7)

Source: Pethuru (2014)

Source: Pethuru (2014)

Encapsulating data science intelligence and investing in modelling is vital for any organization to be successful.

Hence, we will use our data mining knowledge gained from the above chapters to analyse, extract and mine different types of data for value. Or more specifically spatial and spatiotemporal data, object, multimedia, text, time series and web data. (Chapter 8, 9 and 10)

After spending a few months learning and programming with PyMathCamp, we will end the course by updating you with the latest applications and trends of data mining. (Chapter 11)

In conclusion, PyMathCamp is the perfect course for student who might not have the rigorous technical and programming background required to do data science on their own.

Credit to: Joe Choong

“Future belongs to those who figure out how to collect and use data successfully.” 

Muhammad Nurdin, CEO of IntelliJ.

button

Classification on Adult dataset

The Adult dataset was used in Ron Kohavi 2011 paper, Scaling Up the Accuracy of Naive-Bayes Classi ers: a Decision-Tree Hybrid.

Predict whether income exceeds $50K/yr based on census data. Also known as “Census Income” dataset. Extraction was done by Barry Becker from the 1994 Census database. Prediction task is to determine whether a person makes over 50K a year.

The columns in this dataset are:

  • age
  • workclass
  • fnlwgt
  • education
  • education-num
  • maritial-status
  • occupation
  • relationship
  • race
  • sex
  • capital-gain
  • capital-loss
  • hours-per-week
  • native-country

The model was been generated using Random Forest approach (http://scikit-learn.org/stable/), Pandas (http://pandas.pydata.org/) and Numpy (http://www.numpy.org/).

Sample adult data

Sample adult data

sample adult data

Summary of numerical fields

summary of numerical fields

summary of numerical fields

Examples number of each incomes

Examples number of each incomes

Examples number of each incomes

True means have missing value else False.

True means have missing value else False

True means have missing value else False

Model Output generated.

Model Output.

Model Output.