The Forest Fires dataset was used in D. Zhang, Y. Tian and P. Zhang 2008 paper, Kernel-Based Nonparametric Regression Method.
In [Cortez and Morais, 2007], the output ‘area’ was first transformed with a ln(x+1) function. Then, several Data Mining methods were applied. After fitting the models, the outputs were post-processed with the inverse of the ln(x+1) transform. Four different input setups were used. The experiments were conducted using a 10-fold (cross-validation) x 30 runs. Two regression metrics were measured: MAD and RMSE. A Gaussian support vector machine (SVM) fed with only 4 direct weather conditions (temp, RH, wind and rain) obtained the best MAD value: 12.71 +- 0.01 (mean and confidence interval within 95% using a t-student distribution). The best RMSE was attained by the naive mean predictor. An analysis to the regression error curve (REC) shows that the SVM model predicts more examples within a lower admitted error. In effect, the SVM model predicts better small fires, which are the majority.
The columns in this dataset are:
Sample forest fires data
Prediction variables (attributes)
- X – x-axis spatial coordinate within the Montesinho park map: 1 to 9
- Y – y-axis spatial coordinate within the Montesinho park map: 2 to 9
- month – month of the year: ‘jan’ to ‘dec’
- day – day of the week: ‘mon’ to ‘sun’
- FFMC – FFMC index from the FWI system: 18.7 to 96.20
- DMC – DMC index from the FWI system: 1.1 to 291.3
- DC – DC index from the FWI system: 7.9 to 860.6
- ISI – ISI index from the FWI system: 0.0 to 56.10
- temp – temperature in Celsius degrees: 2.2 to 33.30
- RH – relative humidity in %: 15.0 to 100
- wind – wind speed in km/h: 0.40 to 9.40
- rain – outside rain in mm/m2 : 0.0 to 6.4
- area – the burned area of the forest (in ha): 0.00 to 1090.84 (this output variable is very skewed towards 0.0, thus it may make sense to model with the logarithm transform).
There are 517 observations in the dataset.