Data Visualization (Scatter Plot) on Forest Fires dataset

The Forest Fires dataset was used in D. Zhang, Y. Tian and P. Zhang 2008 paper, Kernel-Based Nonparametric Regression Method.

In [Cortez and Morais, 2007], the output ‘area’ was first transformed with a ln(x+1) function. Then, several Data Mining methods were applied. After fitting the models, the outputs were post-processed with the inverse of the ln(x+1) transform. Four different input setups were used. The experiments were conducted using a 10-fold (cross-validation) x 30 runs. Two regression metrics were measured: MAD and RMSE. A Gaussian support vector machine (SVM) fed with only 4 direct weather conditions (temp, RH, wind and rain) obtained the best MAD value: 12.71 +- 0.01 (mean and confidence interval within 95% using a t-student distribution). The best RMSE was attained by the naive mean predictor. An analysis to the regression error curve (REC) shows that the SVM model predicts more examples within a lower admitted error. In effect, the SVM model predicts better small fires, which are the majority.

The columns in this dataset are:

  • X
  • Y
  • month
  • day
  • FFMC
  • DMC
  • DC
  • ISI
  • temp
  • RH
  • wind
  • rain
  • area

The scatter plot was been generated using Pandas ( and Matplotlib (

Sample forest fires data

Sample forest fires data

Sample forest fires data

Prediction variables (attributes)

  1. X – x-axis spatial coordinate within the Montesinho park map: 1 to 9
  2. Y – y-axis spatial coordinate within the Montesinho park map: 2 to 9
  3. month – month of the year: ‘jan’ to ‘dec’
  4. day – day of the week: ‘mon’ to ‘sun’
  5. FFMC – FFMC index from the FWI system: 18.7 to 96.20
  6. DMC – DMC index from the FWI system: 1.1 to 291.3
  7. DC – DC index from the FWI system: 7.9 to 860.6
  8. ISI – ISI index from the FWI system: 0.0 to 56.10
  9. temp – temperature in Celsius degrees: 2.2 to 33.30
  10. RH – relative humidity in %: 15.0 to 100
  11. wind – wind speed in km/h: 0.40 to 9.40
  12. rain – outside rain in mm/m2 : 0.0 to 6.4

Target variables

  1. area – the burned area of the forest (in ha): 0.00 to 1090.84 (this output variable is very skewed towards 0.0, thus it may make sense to model with the logarithm transform).
shape of the DataFrame

shape of the DataFrame

There are 517 observations in the dataset.

Scatter plots

Scatter plots

Leave a Reply

Your email address will not be published. Required fields are marked *