HW4
In this homework you will be using the (neural network or) multilayer perceptron for regretion sklearn.neural_network.MLPRegressor - see more here
The file Training and Validation Data holds 425 rows of data, 21 columns of input (i.e. 21 attributes) and 6 columns of output (the last 6 columns in the files are the output). The data comes from real world measurements. The instructor has reserved 131 rows of data that will not be available to you. She will use that for testing/grading purposes. You are to use the MLPRegressor in a Jupyther notebook to train an artificial neural network called net that predicts the 6 outputs when given the 21 inputs (attributes) of one 21-dimensinal data point. Grading will be competitive, based on the performance of your network on the test set. For example, when I trained the net I got the folowing actual output vs. NN-predicted output for one single 21-dimensional data point:
| actual_out = [6.26 | 73.41 | 70.18 | 1.68 | 43.82 | 94.66] | 
| NN_predict_out = [6.30 | 73.05 | 68.64 | 1.03 | 42.60 | 95.30] | 
You would like to find a NN that gives you similar results in how close the NN approximates the actual output of a data point.
Submit the Jupyther notebook with your final trained network (named net) and any intermediate networks that you may have worked with before renaming your best network to net. In your notebook you must have at least 2 different NN that you have tried.
One problem with the data is that column 1 holds category (non-numeric) data that is unacceptable in numpy arrays. You can convert this data matrix from 21 to 23 columns either in Excel (as explained next) or pandas data frame in DataSpell. In Excel, convert from 21 to 23 inputs (attributes) replacing the first column with three columns, the first new column holding a 1 (one) if the corresponding entry in the original first column had an h and a 0 otherwise, the second holding a 1 (one) if the corresponding entry had an hl (h followed by l as in “love”) and a 0 otherwise, and the third holding a 1 (one) if the corresponding entry had an l (l as in “love”) and a 0 otherwise. For instance:
hl
l
h
l
would convert to:
0 1 0
0 0 1
1 0 0
0 0 1
To reformat this way, load the file into Excel and:
- Create a new column by clicking on the B at the top of sheet and selecting Insert: columns
- Repeat the above step two more times to obtain three total new columns.
- Click on cell B1 and insert the text: =IF(B1="h",1,0)
- Click Ctrl-C to copy cell B1.
- Drag select cells B1 through B425 and click Ctrl-V to paste. (OR, to populate the cells B2 through B425 with the correct value, you may “grab” with your mouse the lower-right corner of cell B1, and “drag” it down that column.)
- Repeat the preceding steps for columns C and D making the appropriate changes for hl and l.
- Select columns B, C, and D by drag selecting B, C, and D at the top of sheet, and then, Ctrl-C to copy,
- Select Edit:Paste Special:Values:OK to replace the formulas in those columns by the actual values to which they evaluate. Check to make sure that everything looks as it should.
- Delete column A by clicking the A at the top of the sheet and selecting Edit:delete.
- Select File:Save-as...:text(Windows). Save into your folder using the nametrainVal.txt.Click Yes as often as necessary to complete the save.
Load the file into DataSpell notebook data frame df using
df = pd.read_csv(r'trainVal.txt', sep ="\t", header = None)
Convert it into a numpy array with
data = df.to_numpy()
Break the data into two sets, training and validation (or test), using your own judgement as to the relative sizes of those sets. We have done something simlar in our labs - review that work if you do not remember.
Break each of the training and validation sets into matrices. Good names might be X_test, y_test, X_train, y_train. Most likely your X_test and X_train have 23 columns (inputs or attributes) and y_test and y_train have 6 columns (outputs).
Create a feedforward neural net in variable net that matches the data. You may experiment with multiple nets (minimum two diffrent nets), but the one you finally submit will be called net.
Train the net using the training set and test it using the validation (test) sets that you created. Refresh your memory on how to execute this step by reviewing our labs on NN.
For instance, the following creates and trains a network with two hidden neurons of 30 and 20 neurons each. [Of course, the input size is 29 and the output layer is 6.] Keep in mind that maybe having 400 and 100 neurons migth be better - give it a try! Feel free to try other configurations as well.
from sklearn.neural_network import MLPRegressor
mlp = MLPRegressor(random_state=1, max_iter = 1000, hidden_layer_sizes = (30,20))
mlp.fit(X_train, y_train) # train the NN
print("Accuracy on training set: {:.2f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(mlp.score(X_test, y_test)))
Note: the output of the trained net when the net is applied to the validation set can be computed as
predict_test = mlp.predict(X_test)
predict_test
Make sure you check if some of these matrices need to be transposed in order to fit your net.
Experiment as you wish, but make sure that your best performing network is in variable named net before uploading your notebook.
At the end of your notebook, write in markdown the answers to these questions, or have the right code execueted to show answer to these questions:
1) Size of training set
2) Size of validation/test set
3) Your net configuration (number of neurons in each layer + number of hidden layers) For example: 23 inputs x 20 neurons x 15 neurons x 6 outputs, a network with two hidden layers.
4) Accuracy on training
5) Accuracy on validation/test
6) The output given by
predict_test = mlp.predict(X_test)
predict_test
7) The output given by
y_test
8) The output given by
[predict_test[0,:], y_test[0,:]]