1

I am trying to create an ML model (regression) using various techniques like SMR, Logistic Regression, and others. With all the techniques, I'm not able to get efficiency more than 35%. Here's what I'm doing:

X_data = [X_data_distance]
X_data = np.vstack(X_data).astype(np.float64)
X_data = X_data.T
y_data = X_data_orders
#print(X_data.shape)
#print(y_data.shape)
#(10000, 1)
#(10000,)
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.33, random_state=42)
svr_rbf = SVC(kernel= 'rbf', C= 1.0)
svr_rbf.fit(X_train, y_train)
plt.plot(X_data_distance, svr_rbf.predict(X_data), color= 'red', label= 'RBF model')

For the plot, I'm getting the following: enter image description here

I have tried various parameter tuning, changing the parameter C, gamma even tried different kernels, but nothing changes the accuracy. Even tried SVR, Logistic regression instead of SVC, but nothing helps. I tried different scaling for training input data like StandardScalar() and scale().

I used this as a reference

What should I do?

6
  • Seems like the prediction you get could be interpreted as valid... Commented Nov 17, 2018 at 18:15
  • Assuming that "efficiency" means error rate, then that seems pretty good for this data. Commented Nov 17, 2018 at 18:17
  • @MatthieuBrucher, forgive me but I don't quite understand what you mean. Commented Nov 17, 2018 at 18:19
  • @GordonLinoff, I'm hoping for more of a bell shaped output like the data points. Commented Nov 17, 2018 at 18:20
  • put a link to the dataset you're using please
    – Yahya
    Commented Nov 17, 2018 at 18:32

1 Answer 1

5

As a rule of thumb, we usually follow this convention:

  1. For little number of features, go with Logistic Regression.
  2. For a lot of features but not a lot of data, go with SVM.
  3. For a lot of features and a lot of data, go with Neural Network.

Because your dataset is a 10K cases, it'd be better to use Logistic Regression because SVM will take forever to finish!.


Nevertheless, because your dataset contains a lot of classes, there is a chance of classes imbalance in your implementation. Thus I tried to workaround this problem via using the StratifiedKFold instead of train_test_split which doesn't guarantee balanced classes in the splits.

Moreover, I used GridSearchCV with StratifiedKFold to perform Cross-Validation in order to tune the parameters and try all different optimizers!

So the full implementation is as follows:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold, StratifiedShuffleSplit
import numpy as np


def getDataset(path, x_attr, y_attr):
    """
    Extract dataset from CSV file
    :param path: location of csv file
    :param x_attr: list of Features Names
    :param y_attr: Y header name in CSV file
    :return: tuple, (X, Y)
    """
    df = pd.read_csv(path)
    X = X = np.array(df[x_attr]).reshape(len(df), len(x_attr))
    Y = np.array(df[y_attr])
    return X, Y

def stratifiedSplit(X, Y):
    sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
    train_index, test_index = next(sss.split(X, Y))
    X_train, X_test = X[train_index], X[test_index]
    Y_train, Y_test = Y[train_index], Y[test_index]
    return X_train, X_test, Y_train, Y_test


def run(X_data, Y_data):
    X_train, X_test, Y_train, Y_test = stratifiedSplit(X_data, Y_data)
    param_grid = {'C': [0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['l1', 'l2'],
                  'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
    model = LogisticRegression(random_state=0)
    clf = GridSearchCV(model, param_grid, cv=StratifiedKFold(n_splits=10))
    clf.fit(X_train, Y_train)
    print(accuracy_score(Y_train, clf.best_estimator_.predict(X_train)))
    print(accuracy_score(Y_test, clf.best_estimator_.predict(X_test)))


X_data, Y_data = getDataset("data - Sheet1.csv", ['distance'], 'orders')

run(X_data, Y_data)

Despite all the attempts with all different algorithms, the accuracy didn't exceed 36%!!.


Why is that?

If you want to make a person recognize/classify another person by their T-shirt color, you cannot say: hey if it's red that means he's John and if it's red it's Peter but if it's red it's Aisling!! He would say "really, what the hack is the difference"?!!.

And that's exactly what is in your dataset!

Simply, run print(len(np.unique(X_data))) and print(len(np.unique(Y_data))) and you'll find that the numbers are so weird, in a nutshell you have:

Number of Cases: 10000 !!
Number of Classes: 118 !!
Number of Unique Inputs (i.e. Features): 66 !!

All classes are sharing hell a lot of information which make it impressive to have even up to 36% accuracy!

In other words, you have no informative features which lead to a lack in the uniqueness of each class model!


What to do? I believe you are not allowed to remove some classes, so the only two solutions you have are:

  1. Either live with this very valid result.

  2. Or add more informative feature(s).


Update

Having you provided same dataset but with more features (i.e. complete set of features), the situation now is different.

I recommend you do the following:

  1. Pre-process your dataset (i.e. prepare it by imputing missing values or deleting rows containing missing values, and converting dates to some unique values (example) ...etc).

  2. Check what features are most important to the Orders Classes, you can achieve that by using of Forests of Trees to evaluate the importance of features. Here is a complete and simple example of how to do that in Scikit-Learn.

  3. Create a new version of the dataset but this time hold Orders as the Y response, and the above-found features as the X variables.

  4. Follow the same GrdiSearchCV and StratifiedKFold procedure that I showed you in the implementation above.


Hint

As per mentioned by Vivek Kumar in the comment below, stratify parameter has been added in Scikit-learn update to the train_test_split function.

It works by passing the array-like ground truth, so you don't need my workaround in the function stratifiedSplit(X, Y) above.

7
  • Hi @Yahya, thanks a lot for the explanation. Here's another link to the dataset. What do you suggest I should I do in this case: docs.google.com/spreadsheets/d/… Commented Nov 18, 2018 at 5:14
  • 1
    thanks a lot!! I changed my model and now, with steps, I achieved an accuracy of 80% on predicting the amount spent. Commented Nov 18, 2018 at 13:50
  • 1
    @SarvagyaGupta Glad I could help :)
    – Yahya
    Commented Nov 18, 2018 at 15:10
  • train_test_split also has a stratify param along with test_size. Is there a special need to use the custom stratifiedSplit method here? Commented Nov 19, 2018 at 7:34
  • @VivekKumar Yeah you're right and I'm aware of that, it's has been added in the Scikit-learn update. However, it threw an exception when I used it, I forgot what it was but when I checked I did not have time to fix it as my main focus was on finding the reason rather than how to implement. I will add it to the answer as a hint.
    – Yahya
    Commented Nov 19, 2018 at 9:35

Not the answer you're looking for? Browse other questions tagged or ask your own question.