Classification on Bank Marketing dataset

The Bank Marketing dataset was used in Wisaeng, K. (2013). A comparison of different classification techniques for bank direct marketing. International Journal of Soft Computing and Engineering (IJSCE), 3(4), 116-119.

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

There are four datasets:
1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).
4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).
The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

The columns in this dataset are:

  • age
  • job
  • marital
  • education
  • default
  • housing
  • loan
  • contact
  • month
  • day_of_week
  • duration
  • campaign
  • pdays
  • previous
  • poutcome
  • emp.var.rate
  • cons.price.idx
  • cons.conf.idx
  • euribor3m
  • nr.employed

 

Intellij’s Trainings of the Month

This May we will have 3 events.

MOBILE IOS DEV USING SWIFT 2.0
16 – 17 May 2016
The workshop breaks down the processes of becoming an iOS developer.
Level: Beginner to Intermediate.
Sign up at: http://peatix.com/event/168172

=======

MOBILE ANDROID DEVELOPMENT
23 – 24 May 2016
The workshop breaks down the processes of becoming an Android developer.
Level: Beginner to Intermediate.
Sign up at: http://peatix.com/event/168176

=======
TEACH YOUR ROBOTS TO SEE, PERCEIVE SURROUNDINGS, AND COMPLETE TASKS.
28 – 29 May 2016.
OpenCV training using Python. Hands-on, practical exercises will be plenty.
9.00am to 6.00pm at Ovul Damansara.
Sign up at: http://peatix.com/event/167151

Grab your seats now!

Let’s change to world.

email: nurdin@intellij.my | call: 01126252058

</CODE>

Data Mining Syllabus – PyMathCamp

Demand for Data science talent is exploding. McKinsey estimates that by 2018, a 500,000 strong workforce of data scientists will be needed in US alone. The resulting talent gap must be filled by a new generation of data scientists. The term data scientist is quite ambiguous. The Center for Data Science at New York University describe data science as,

the study of the generalizable extraction of knowledge from data [using] mathematics, machine learning, artificial intelligence, statistics, databases and optimization, along with a deep understanding of the craft of problem formulation to engineer effective solutions

Data science.

Data science.

As you can see, a data scientist is a professional with a multidisciplinary profile. Optimizing the value of data is dependent on the skills of the data scientists who process the data.

Intellij.my is offering these essentials with PyMathCamp. This course is your stepping stone to become a data scientist. Key concepts in data acquisition, preparation, exploration and visualization along with examples on how to build interactive data science solutions are presented using Ipython notebooks.
You will learn to write Python code and apply data science techniques to many field of interest, for example in finance, robotic, marketing, gaming, computer vision, speech recognition and many more. By the end of this course, you will know how to build machine learning models and derive insights from data science.

The course is organized into 11 chapters. The major components of PyMathCamp are:

1) Data management (extract, transform, load, storing, cleaning and transformation)

We begin with studying data warehousing and OLAP, data cubes technology and multidimensional databases. (Chapter 2, 3 and 4)

2) Data Mining (machine learning technology, math and statistics)

Descriptive statistics are applied for data exploration. Mining Frequent Patterns, Association and Correlations. We will also learn more on the different types of machine learning methodology through python programming. (Chapter 5)

3) Data Analysis/Prescription (classification, regression, clustering, visualization)

At this stage, we are ready to dive into data modelling with different types of machine learning methods. PyMathcamp includes many different machine learning techniques to analyse and mine data, including linear regression, logistic regression, support vector machines, ensembling and clustering among numerous others. Model construction and validation are studied. This rigorous data modelling process is further enhanced with graphical visualisation. The end result will lead to insight for intelligent decision making. (Chapter 6 and 7)

Source: Pethuru (2014)

Source: Pethuru (2014)

Encapsulating data science intelligence and investing in modelling is vital for any organization to be successful.

Hence, we will use our data mining knowledge gained from the above chapters to analyse, extract and mine different types of data for value. Or more specifically spatial and spatiotemporal data, object, multimedia, text, time series and web data. (Chapter 8, 9 and 10)

After spending a few months learning and programming with PyMathCamp, we will end the course by updating you with the latest applications and trends of data mining. (Chapter 11)

In conclusion, PyMathCamp is the perfect course for student who might not have the rigorous technical and programming background required to do data science on their own.

Credit to: Joe Choong

“Future belongs to those who figure out how to collect and use data successfully.” 

Muhammad Nurdin, CEO of IntelliJ.

button

Dimensionality Reduction on Github Event using PCA approach

This case study using Github Event dataset focus on Malaysia’s developers.

This is a read-only API to the GitHub events. These events power the various activity streams on the site.

github star wars

github star wars

The columns in this dataset are:

  1. a_login
  2. e_CommitCommentEvent
  3. e_CreateEvent
  4. e_DeleteEvent
  5. e_DeploymentEvent
  6. e_DeploymentStatusEvent
  7. e_DownloadEvent
  8. e_FollowEvent
  9. e_ForkEvent
  10. e_ForkApplyEvent
  11. e_GistEvent
  12. e_GollumEvent
  13. e_IssueCommentEvent
  14. e_IssuesEvent
  15. e_MemberEvent
  16. e_MembershipEvent
  17. e_PageBuildEvent
  18. e_PublicEvent
  19. e_PullRequestEvent
  20. e_PullRequestReviewCommentEvent
  21. e_PushEvent
  22. e_ReleaseEvent
  23. e_RepositoryEvent
  24. e_StatusEvent
  25. e_TeamAddEvent
  26. e_WatchEvent

Sample Github Event data.

sample Github Event data

sample Github Event data

Lower dimension representation of our data frame.

lower dimension representation of our data frame

lower dimension representation of our data frame

Explained variance ratio.

explained variance ratio

explained variance ratio

Plot on the data frame.

plot on the data frame

plot on the data frame

Re-scaled mean per a_login across all the events.

re-scaled mean per a_login across all the events

re-scaled mean per a_login across all the events

Bubble plot chart (a_login mean).

bubble plot chart (a_login mean)

bubble plot chart (a_login mean)

Bubble plot chart (a_login sum).

bubble plot chart (a_login sum)

bubble plot chart (a_login sum)

Regression on Airfoil Self-Noise dataset using Linear Regression approach

This case study using Airfoil Self-Noise dataset.

The NASA data set comprises different size NACA 0012 airfoils at various wind tunnel speeds and angles of attack. The span of the airfoil and the observer position were the same in all of the experiments.

The columns in this dataset are:

  1. A = Frequency
  2. B = Angle of attack
  3. C = Chord length
  4. D = Free-stream velocity
  5. E = Suction side displacement thickness
  6. F = Scaled sound pressure level

Sample Airfoil Self-Noise data

Sample Airfoil Self-Noise data

Sample Airfoil Self-Noise data

Prediction variables (attributes)

  1. Frequency, in Hertzs.
  2. Angle of attack, in degrees.
  3. Chord length, in meters.
  4. Free-stream velocity, in meters per second.
  5. Suction side displacement thickness, in meters.

Target variables

  1. Scaled sound pressure level, in decibels.
shape of the DataFrame

shape of the DataFrame

There are 1503 observations in the dataset.

Scatter plots

Scatter plots

Use Statsmodels to estimate the model coefficients for the Airfoil Self-Noise data with B (angle of attack):

model coefficients for the Airfoil Self-Noise data

model coefficients for the Airfoil Self-Noise data

Interpreting Model Coefficients

Interpretation angle of attack coefficient (β1)

  • A “unit” increase in angle of attack is associated with a 0.008927 “unit” increase in F (scaled sound pressure level).

Using the Model for Prediction

Let’s say that where the Angle of attack increased was 70. What would we predict for the scaled sound pressure level? (First approach for prediction)

126.309388 + (0.008927 * 70) = 126.934278

Thus, we would predict scaled sound pressure level of 126.934278.

Use Statsmodels to make the prediction: (Second approach for prediction)

Statsmodels to make the prediction

Statsmodels to make the prediction

Plotting the Least Squares Line

Make predictions for the smallest and largest observed values of x, and then use the predicted values to plot the least squares line:

DataFrame with the minimum and maximum values of B

DataFrame with the minimum and maximum values of B

 

predictions for those x values

predictions for those x values

least squares line

least squares line

 

confidence intervals for the model coefficients

confidence intervals for the model coefficients