Dimensionality Reduction on Github Event using PCA approach

This case study using Github Event dataset focus on Malaysia’s developers.

This is a read-only API to the GitHub events. These events power the various activity streams on the site.

github star wars

github star wars

The columns in this dataset are:

  1. a_login
  2. e_CommitCommentEvent
  3. e_CreateEvent
  4. e_DeleteEvent
  5. e_DeploymentEvent
  6. e_DeploymentStatusEvent
  7. e_DownloadEvent
  8. e_FollowEvent
  9. e_ForkEvent
  10. e_ForkApplyEvent
  11. e_GistEvent
  12. e_GollumEvent
  13. e_IssueCommentEvent
  14. e_IssuesEvent
  15. e_MemberEvent
  16. e_MembershipEvent
  17. e_PageBuildEvent
  18. e_PublicEvent
  19. e_PullRequestEvent
  20. e_PullRequestReviewCommentEvent
  21. e_PushEvent
  22. e_ReleaseEvent
  23. e_RepositoryEvent
  24. e_StatusEvent
  25. e_TeamAddEvent
  26. e_WatchEvent

Sample Github Event data.

sample Github Event data

sample Github Event data

Lower dimension representation of our data frame.

lower dimension representation of our data frame

lower dimension representation of our data frame

Explained variance ratio.

explained variance ratio

explained variance ratio

Plot on the data frame.

plot on the data frame

plot on the data frame

Re-scaled mean per a_login across all the events.

re-scaled mean per a_login across all the events

re-scaled mean per a_login across all the events

Bubble plot chart (a_login mean).

bubble plot chart (a_login mean)

bubble plot chart (a_login mean)

Bubble plot chart (a_login sum).

bubble plot chart (a_login sum)

bubble plot chart (a_login sum)

Intelligence Traffic Light Control using Machine Learning Algorithms

One of theoretical in intelligence traffic light control is computational learning theory. It’s analyze computational complexity of machine learning algorithms. There are two types of machine learning, supervised learning, unsupervised learning and regression learning. It is mainly deal with supervised learning.  Supervised learning is learning where the sample dataset is labeled with useful information. There are two variable types of supervised learning, categorical and continuous.  Categorical variable (nominal variable) is one that has two or more categories. For example, male and female. Continuous variable can only take on a certain number of values. For example, 1 or 2. We can conclude the hypothesis where the intelligence traffic light control using machine learning algorithm based on supervise sample training data.

Sample training dataset from driver’s behavior to determine traffic light status.

Sample training dataset from driver’s behavior to determine traffic light status.

In intelligence traffic light system, we believe the system embedded with proper sophisticated communication and sensor network system. The traffic lights are able to communicate each other so it can utilize more resources to ever increasing travelling times and diminishing waiting times before red traffic lights. The information gathered from sensor network system applied inside the traffic light so it can study driver’s behavior. Besides that, drivers will get the traffic information from mobile app given by the government so they can plan well before they drive to their destination. More advanced traffic light system when emergency vehicle such as police or ambulance go through the road so the traffic light will remain green to avoid any collision with other vehicles.

Intelligence traffic light control with communication and sensor network system.

Intelligence traffic light control with communication and sensor network system.

Based on observation and experimental retrieved from traffic light sensor, driver’s behavior can be collected and convert into valuable information to diminish waiting times before red traffic lights occur. Besides that, the data also been collected will be transform into information using machine language algorithm to create an efficient and accurate model for prediction analysis. Using predictive analysis knowledge, waiting times can be reduced even limited resources provided by current infrastructures lead to ever increasing travelling times. There are a lot of machine language algorithms such as neural network, linear regression, random forest, KNN and many more. The more data been collected, the more model will be accurate because the model not only suitable on certain time, it’s need to be supervise from time to time.

Deep learning understanding.

Deep learning understanding.

Regression on Airfoil Self-Noise dataset using Linear Regression approach

This case study using Airfoil Self-Noise dataset.

The NASA data set comprises different size NACA 0012 airfoils at various wind tunnel speeds and angles of attack. The span of the airfoil and the observer position were the same in all of the experiments.

The columns in this dataset are:

  1. A = Frequency
  2. B = Angle of attack
  3. C = Chord length
  4. D = Free-stream velocity
  5. E = Suction side displacement thickness
  6. F = Scaled sound pressure level

Sample Airfoil Self-Noise data

Sample Airfoil Self-Noise data

Sample Airfoil Self-Noise data

Prediction variables (attributes)

  1. Frequency, in Hertzs.
  2. Angle of attack, in degrees.
  3. Chord length, in meters.
  4. Free-stream velocity, in meters per second.
  5. Suction side displacement thickness, in meters.

Target variables

  1. Scaled sound pressure level, in decibels.
shape of the DataFrame

shape of the DataFrame

There are 1503 observations in the dataset.

Scatter plots

Scatter plots

Use Statsmodels to estimate the model coefficients for the Airfoil Self-Noise data with B (angle of attack):

model coefficients for the Airfoil Self-Noise data

model coefficients for the Airfoil Self-Noise data

Interpreting Model Coefficients

Interpretation angle of attack coefficient (β1)

  • A “unit” increase in angle of attack is associated with a 0.008927 “unit” increase in F (scaled sound pressure level).

Using the Model for Prediction

Let’s say that where the Angle of attack increased was 70. What would we predict for the scaled sound pressure level? (First approach for prediction)

126.309388 + (0.008927 * 70) = 126.934278

Thus, we would predict scaled sound pressure level of 126.934278.

Use Statsmodels to make the prediction: (Second approach for prediction)

Statsmodels to make the prediction

Statsmodels to make the prediction

Plotting the Least Squares Line

Make predictions for the smallest and largest observed values of x, and then use the predicted values to plot the least squares line:

DataFrame with the minimum and maximum values of B

DataFrame with the minimum and maximum values of B

 

predictions for those x values

predictions for those x values

least squares line

least squares line

 

confidence intervals for the model coefficients

confidence intervals for the model coefficients

Data Visualization (Scatter Plot) on Forest Fires dataset

The Forest Fires dataset was used in D. Zhang, Y. Tian and P. Zhang 2008 paper, Kernel-Based Nonparametric Regression Method.

In [Cortez and Morais, 2007], the output ‘area’ was first transformed with a ln(x+1) function. Then, several Data Mining methods were applied. After fitting the models, the outputs were post-processed with the inverse of the ln(x+1) transform. Four different input setups were used. The experiments were conducted using a 10-fold (cross-validation) x 30 runs. Two regression metrics were measured: MAD and RMSE. A Gaussian support vector machine (SVM) fed with only 4 direct weather conditions (temp, RH, wind and rain) obtained the best MAD value: 12.71 +- 0.01 (mean and confidence interval within 95% using a t-student distribution). The best RMSE was attained by the naive mean predictor. An analysis to the regression error curve (REC) shows that the SVM model predicts more examples within a lower admitted error. In effect, the SVM model predicts better small fires, which are the majority.

The columns in this dataset are:

  • X
  • Y
  • month
  • day
  • FFMC
  • DMC
  • DC
  • ISI
  • temp
  • RH
  • wind
  • rain
  • area

The scatter plot was been generated using Pandas (http://pandas.pydata.org/) and Matplotlib (http://matplotlib.org/).

Sample forest fires data

Sample forest fires data

Sample forest fires data

Prediction variables (attributes)

  1. X – x-axis spatial coordinate within the Montesinho park map: 1 to 9
  2. Y – y-axis spatial coordinate within the Montesinho park map: 2 to 9
  3. month – month of the year: ‘jan’ to ‘dec’
  4. day – day of the week: ‘mon’ to ‘sun’
  5. FFMC – FFMC index from the FWI system: 18.7 to 96.20
  6. DMC – DMC index from the FWI system: 1.1 to 291.3
  7. DC – DC index from the FWI system: 7.9 to 860.6
  8. ISI – ISI index from the FWI system: 0.0 to 56.10
  9. temp – temperature in Celsius degrees: 2.2 to 33.30
  10. RH – relative humidity in %: 15.0 to 100
  11. wind – wind speed in km/h: 0.40 to 9.40
  12. rain – outside rain in mm/m2 : 0.0 to 6.4

Target variables

  1. area – the burned area of the forest (in ha): 0.00 to 1090.84 (this output variable is very skewed towards 0.0, thus it may make sense to model with the logarithm transform).
shape of the DataFrame

shape of the DataFrame

There are 517 observations in the dataset.

Scatter plots

Scatter plots

Classification on Adult dataset

The Adult dataset was used in Ron Kohavi 2011 paper, Scaling Up the Accuracy of Naive-Bayes Classi ers: a Decision-Tree Hybrid.

Predict whether income exceeds $50K/yr based on census data. Also known as “Census Income” dataset. Extraction was done by Barry Becker from the 1994 Census database. Prediction task is to determine whether a person makes over 50K a year.

The columns in this dataset are:

  • age
  • workclass
  • fnlwgt
  • education
  • education-num
  • maritial-status
  • occupation
  • relationship
  • race
  • sex
  • capital-gain
  • capital-loss
  • hours-per-week
  • native-country

The model was been generated using Random Forest approach (http://scikit-learn.org/stable/), Pandas (http://pandas.pydata.org/) and Numpy (http://www.numpy.org/).

Sample adult data

Sample adult data

sample adult data

Summary of numerical fields

summary of numerical fields

summary of numerical fields

Examples number of each incomes

Examples number of each incomes

Examples number of each incomes

True means have missing value else False.

True means have missing value else False

True means have missing value else False

Model Output generated.

Model Output.

Model Output.

PyMathCamp aims to produce modern innovator through data science & mathematics

Innovative thinking and necessary skills set are critically crucial to solve real world problems. Approaching the future, problem will be getting more complex. Malaysia is in dire need of modern innovator to develop state-of-the-art solutions to solve them. And to develop solution, with just innovative thinking is not enough.

With lack of data science and mathematics talent, Malaysia is going to have tough time to have intellectual local resources to solve local problems.

Yes, it is true that Malaysia can outsource talents to foreign expertise but it is not right to be too dependent on them all the time. Even the dependency, the supply is still insufficient. Technology transfer can be very expensive and second, foreign workers shall be taking time to adapt with local structure before developing suitable solution. The more time taken, the more money out.

Malaysia is lacking of innovators.

study data scientist Malaysia

“Malaysia may not have enough engineers, architects, and other professionals, to achieve Vision 2020 based on the low level of interest by our students in science, technology, engineering, and mathematics (STEM). If the situation goes on, Malaysia may have to depend on foreign workers to attain developed status, warn expert.” Star Sunday.

Wawasan 2020 is getting nearer yet we are still incapable to show that we can ‘supply’ the vision.

Here we are, want to provide highly-impact education which focus on data science and mathematics, to ALL Malaysian for FREE so that, whole nation can change million of lives to be better.

Introducing to you, PyMathCamp.

PyMathCamp will be an online learning platform to teach data science and mathematics that make use of programming languages such as Python, C++ or R in preparation to produce future actionable Malaysian innovator to solve problems.

The online learning platform shall help them to learn how to code and further career in science, technology, engineering and mathematics (STEM). How?

How subjects of data science and mathematics can invent innovator?

Data science and mathematics are not “subjects in the class, stay in the class”. They are basic necessities to all kind of businesses; health, agriculture, finance, social sciences, maritime sciences, planetary sciences, meteorology, geography, and many more. You name it. STEM is WIDE. 

Data science in a simple word is a study of how to gather interesting data. And the interestingness of data shall depend on the searcher or data looker. Data is one oceanic word. However he/she may want to look for a matter that he/she is desired into, he/she must learn the science of pulling it from the ocean (of data), clean it, groom it and present it informatively.

Mathematics, on the other hand, is what makes life measurable to the basic thing like genomic. Mathematics demands wisdom, judgment and maturityWe can make error to find solution, we can alter our methods or start all over. When it comes to life, reality mostly doesn’t allow us to redo anything most of the time, but when it comes to ‘measurable condition’, we are allowed to attempt to change things.

By defining their importance in state-of-the-art programming, we shall have idea how both subjects are keys to economic prosperity. Without above talents, we will have difficulties to obtain interesting parameters. To obtain, data science and mathematics must be learnt.

Modern students of PyMathCamp should expect the following:

Student shall be able to create emphatic solutions. They shall be able to build advanced innovation through data science and mathematics and deliver curing values to others.

A variety of topics such as data exploration, visualization, feature engineering, predictive analytics, predictive modeling, clustering, big data pipelines, metrics and many more should be expected.

All trainers and mentors are experts, highly trained and well-experienced Malaysians. They are specialized in data science, computer vision, big data, machine learning, artificial intelligence and etc.

Students are also expected to find own solutions by leveraging our programming community portal and discussion group (chit chat). For open source development, PyMathCamp will be integrated with Github. 

We have evidential method to improve every of users’ learning curve to the finish line.

Note that PyMathCamp will only be committed to specific fields that are data science and mathematics.

There will be no age limit.

PyMathCamp will be focusing on Python, C++ or R because it’s beginner-friendly (easy to use and understand), math supported and mother tongue of Artificial Intelligence. Truly high in-demand skills set for sure.

And it is free. Yup. No charges.

Carpe diem.

Seize the day.

We want to build smart society to build smart structures.

We want to produce intelligent society. Malaysia needs smart society to help nation grow each other better to achieve Wawasan 2020 and further ages.

Other than fulfilling job vacancy, we aim that students shall be able to invent advanced solution and create intelligent startups to solve all society’s problems. This is our deepest aim actually. We want students to be modern innovator.

In simple word, PyMathCamp is really preparing Malaysians for the amazing (automated) future.

Join PyMathCamp.

IntelliJ is a deeply value-oriented company.

We want to educate and bring Malaysian mind to advanced level, starting from small, FOR FREE, which is the essence to change Malaysia into economically, a prosperous place.

We want to produce marketable Malaysians, in this self-serving economy, with highly-impact education as the first defense.

We pray that every mission of ours enrich all lives.

“Future is belongs to those who figure out how to collect and use data successfully.” 

Muhammad Nurdin, CEO of IntelliJ.

button