24

enter image description hereI want to plot the learning error curve of a neural net with respect to the number of training examples. Here is the code :

import sklearn
import numpy as np
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
from sklearn import neural_network
from sklearn import cross_validation

myList=[]
myList2=[]
w=[]

dataset=np.loadtxt("data", delimiter=",")
X=dataset[:, 0:6]
Y=dataset[:,6]
clf=sklearn.neural_network.MLPClassifier(hidden_layer_sizes=(2,3),activation='tanh')

# split the data between training and testing
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=0.25, random_state=33)

# begin with few training datas
X_eff=X_train[0:int(len(X_train)/150), : ]
Y_eff=Y_train[0:int(len(Y_train)/150)]

k=int(len(X_train)/150)-1


for m in range (140) :


    print (m)

    w.append(k)

    # train the model and store the training error
    A=clf.fit(X_eff,Y_eff)
    myList.append(1-A.score(X_eff,Y_eff))

      # compute the testing error
    myList2.append(1-A.score(X_test,Y_test))

    # add some more training datas
    X_eff=np.vstack((X_eff,X_train[k+1:k+101,:]))
    Y_eff=np.hstack((Y_eff,Y_train[k+1:k+101]))
    k=k+100

plt.figure(figsize=(8, 8))
plt.subplots_adjust()
plt.title("Erreur d'entrainement et de test")
plt.plot(w,myList,label="training error")
plt.plot(w,myList2,label="test error")
plt.legend()
plt.show()

However, I get a very strange result, with curves fluctuating, the training error very close to the testing error which does not appear to be normal. Where is the mistake? I can't understand why there are so many ups and downs and why the training error does not increase, as it would be expected to.Any help would be appreciated !

EDIT : the dataset I am using is https://archive.ics.uci.edu/ml/datasets/Chess+%28King-Rook+vs.+King%29 where I got rid of the classes having less than 1000 instances. I manually re-encoded the litteral data.

6
  • Is there a reason why you gradually expand the training set?
    – Flomp
    Commented Aug 28, 2017 at 12:55
  • 1
    @Flomp Thats how the learning curve is plotted. Commented Aug 28, 2017 at 13:01
  • Without the actual data its very hard to say. Have you tried tuning the parameters of MLP like hidden layers or activation function. tanh IMO gives this kind of curve usually. Maybe try changing that. Try 'relu' or 'logistic' in its place Commented Aug 28, 2017 at 13:33
  • I'm sorry but I dont know much about it. Are the results different when using other activation functions? Commented Aug 28, 2017 at 14:50
  • 1
    You opened a bounty for this meaning you aren't satisfied with the answers (that's fine), but both answers directly answer the stated question. The bounty won't help if people don't know what you're looking for. Would you mind explaining why the provided answers don't fit the bill? I would love to adapt one of them to make it work for you.
    – Andnp
    Commented Aug 31, 2017 at 17:22

3 Answers 3

13

I think that the reason you're seeing this kind of curve is that the performance metric you are measuring is different from the performance metric that you are optimizing.

Optimization metric

The neural network minimizes a loss function, and in the case of tanh activiations, I assume you are using a modified version of the cross entropy loss. If you were to plot the loss over time, you would see a more monotonically decreasing error function like you expect. (Not actually monotonic because neural networks are non-convex, but that's beside the point.)

Performance metric

The performance metric that you are measuring is the percent accuracy, which is different from the loss. Why are these different? The loss function tells us how much error we have in a differentiable way (which is important for fast optimization methods). The accuracy metric tells us how well we predict, which is useful for application of the neural network.

Putting it together

Because you are plotting the performance of a related metric, you can expect that the plot will look similar to that of your optimized metric. However because they are not the same, you may be introducing some unaccounted-for variance in your plot (as evidenced by the plot you posted).

There are a couple of ways to fix this.

  1. Plot the loss instead of the accuracy. This doesn't actually fix your problem if you actually need the accuracy plot, but it will give you much more smooth curves.
  2. Plot an average over multiple runs. Save the accuracy plots over 20 independent runs of your algorithm (as in train the network 20 times), then average them together and plot this. That will greatly reduce the variance.

TL;DR

Don't expect the accuracy plot to always be smooth and monotonically decreasing, it won't be.


After question edit:

Now that you've added your dataset, I see a few other things that may be causing the issues that you're seeing.

Information in magnitude

The dataset defines the rank and file (row and column) of several chess pieces. These are input as an integer from 1 to 6. However is 2 really 1 better than 1? Is 6 really 4 better than 2? I don't think this is the case in terms of chess position.

Imagine I am building a classifier that takes money as an input. Is there some amount of information being portrayed by the magnitude of my values? Yes, $1 is quite different from $100; and we can tell that there is a relationship based on the magnitude.

For a chess game, does row 1 mean something different than row 8? Not at all, in fact these dimensions are symmetrical! Using a bias unit in your network can help account for the symmetry by "rescaling" your inputs to be effectively from [-3, 4] which is now centered(ish) around 0.

Solutions

I think, however, you would get the most mileage out of tile-coding or one-hot encoding each of your features. Don't allow the network to rely on the information contained in the magnitude of each feature, as that may be causing the network to work its way into bad local optima.

5
  • Well, I was not expecting a monotonical function but here it seems that the varaince is indeed very high. Moreover, I did similar curves with SVM's and there, I had quite smooth, monotical curves. Moreover, what interest me is actually the accuracy (or the error) because I am interested in comparing it to VC bounds
    – MysteryGuy
    Commented Aug 28, 2017 at 15:03
  • However, thanks for your explanation about the difference between accuracy and loss. (if I've well understood, accuracy is a "good or bad classified" measure, practical while the loss is more "mathematical"). So +1 for that but I can't accept your answer because it does not fill my requirements
    – MysteryGuy
    Commented Aug 28, 2017 at 15:08
  • Maybe I'm unclear as to the requirements then. If you can make the question more clear, maybe you can get the answers you're looking for.
    – Andnp
    Commented Aug 28, 2017 at 15:37
  • There are a lot of differences between the optimization for an SVM and an ANN, I am not surprised that SVMs had a smoother curve. There certainly could be some bug in the code causing this, but the most obvious issue to me is that the ANN accuracy curve is not guaranteed to be smooth. In fact it rarely is smooth.
    – Andnp
    Commented Aug 28, 2017 at 15:45
  • Thanks for your answer, it seems indeed that one hot encoding is a critical point in this case... Do you have a proper example of how doing that because I was not really convinced by the one of Scikit-Learn?
    – MysteryGuy
    Commented Sep 8, 2017 at 7:39
7
+50

In addition to the previous answers you should also have in mind, that you might have to tweak the learning rate (by setting learning_rate = value in the initializer) of the network. If you choose the rate to big, you will jump from on local minimum to another or circle around these points, but won't actually converge (see the image below, taken from here).

Furthermore, please also plot the loss and not just the accuracy of your network. This will give you better insights about it.

Also, keep in mind, that you have to use a lot of training and test data to get a more or less "smooth" curve, or even a representative curve; if you are using just a few (maybe a few hundred) data points, the resulting metrics will not actually be very accurate, as they contain a lot of stochastic things. To solve this error you should not train the network with the same examples every time, but rather change the orders of your training data, and maybe split it up on different mini batches. I am very confident, that you can solve or even reduce your problem by trying to mind these aspects and to implement them.

Depending on your kind of problem, you should change the activation function to something different than the tanh function. Performing a classification, a OneHotEncoder might also be useful (if your data is not already one hot encoded); the sklearn framework is offering an implementation of this, too.

5
  • I tried relu activation function, as advised by many and other architectures but the curves were similar to the one I had previously
    – MysteryGuy
    Commented Sep 6, 2017 at 12:14
  • @MysteryGuy have you also decreased the learning rate? Did you try using the OneHotEncoder? Maybe you could supply us with the data set, so that we can try it out by our self.
    – zimmerrol
    Commented Sep 6, 2017 at 12:15
  • About mini batches, do you have any other implementation than : stackoverflow.com/questions/38157972/… ?
    – MysteryGuy
    Commented Sep 8, 2017 at 11:13
  • I mean, is it posible, with sklearn to directly deal with that ?
    – MysteryGuy
    Commented Sep 8, 2017 at 11:14
  • @MysteryGuy I did not know, that the MLPClassifier is doing this internal. So you might try to set batch_size to something smaller (like 64) in the initialization. Also, you might increase the total number of epochs by setting max_iter to some higher value, too.
    – zimmerrol
    Commented Sep 8, 2017 at 12:40
5

Randomize training set and repeat

If you would like a fair comparison of the effect of the number of training sample on the accuracy, I suggest to randomly pick n_samples from your training set instead of adding 100 samples to the previous batch. You would also repeat the fit N_repeat times for each n_samples value.

This would give something like this (not tested):

n_samples_array = np.arange(100,len(X_train),100)
N_repeat = 10

for n_samples in n_samples_array:
    print(n_samples)
    
    # repeat the fit several times and take the mean
    myList_tmp, myList2_tmp = [],[]
    for repeat in range(0,N_repeat):
        # Randomly pick samples
        selection = np.random.choice(range(0,len(X_train)),n_samples,repeat=False)
    
        # train the model and store the training error
        A=clf.fit(X_train[selection],Y_train[selection])
        myList_tmp.append(1-A.score(X_train[selection],Y_train[selection]))

          # compute the testing error
        myList2_tmp.append(1-A.score(X_test,Y_test))
    
    myList.append(np.mean(myList_tmp))
    myList2.append(np.mean(myList2_tmp))

Warm start

When you use the fit function, you restart the optimization from scratch. If you would like to see the improvement on your optimization when adding a few sample to the same previously trained network, you can use the option warm_start=True

As per the documentation:

warm_start : bool, optional, default False

When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.

4
  • 1
    Thanks for your answer ! However, although my testing error is decreasing (which is fine), my training error is decreasing too, it isn't normal, is it?
    – MysteryGuy
    Commented Aug 31, 2017 at 11:10
  • 1
    If you have less training samples, you tend to overfit and therefore, the training error could be very low. However, your network is extremely small (2 hidden layers with 2 and 3 neurons on each respectively). Hard to say without seeing the data (on a scatter plot for example). I wouldn't be surprised if the network didn't really learn anything given the low number of neurons and the low prediction accuracy (r-square).
    – nbeuchat
    Commented Aug 31, 2017 at 17:33
  • 1
    Actually, my last simulation was done on a neural network containing 10 hidden layers with 100 neurons each
    – MysteryGuy
    Commented Sep 1, 2017 at 6:46
  • 2
    @MysteryGuy I see, so you basically have more than 100k free parameters for less than 15k data points. Did you try the warm_start option, it might help with smoothing your curve. Also, I would recommend using another activation function. relu is currently the most widely used in deep learning. Sigmoids have issue with very slow learning due to the vanishing gradient problem.
    – nbeuchat
    Commented Sep 5, 2017 at 14:41

Not the answer you're looking for? Browse other questions tagged or ask your own question.