Search for question
Question

Analysis 1: Regression

The UCI Machine Learning repository contains a variety of real-world datasets. For this example, we will analyse forest fire data, taken from here:

https://archive.ics.uci.edu/ml/datasets/Forest+Fires

For each data point in the sample, the amount of burnt land is quantified by the variable area, which gives the hectares burnt. Further information on the dataset is available in the accompanying paper here:

http://www.dsi.uminho.pt/~pcortez/fires.pdf

Split the forest fire dataset into a training set (80% of the data) and a testing set (20% of the data). Include in your report a section on the data preparation, including how you converted the date information from non-numeric to numeric data.

Then:

• Build a regression model using k nearest neighbours that can predict the burnt area for new data points. To make the prediction, use a weighted average of the area values for the nearest neighbour points. Be careful to specify in your report which variables were used as the input variables (you can use all of them, but need to describe them).

• For the same dataset, build a multiple linear regression model in Excel that can predict the area.

By evaluating the performance of your models on the test set, explain which model gives more accurate predictions and how this was assessed. When performing the validation on the testing set, it might be useful to use the one-way table approach taken in the by-hand neural network practical. Here, you can consider using the What-If Analysis → Data Table... as a way of evaluating the models on the testing set.