21

I am running a logistic regression with a tf-idf being ran on a text column. This is the only column I use in my logistic regression. How can I ensure the parameters for this are tuned as well as possible?

I would like to be able to run through a set of steps which would ultimately allow me say that my Logistic Regression classifier is running as well as it possibly can.

from sklearn import metrics,preprocessing,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
import sklearn.linear_model as lm
import pandas as p
loadData = lambda f: np.genfromtxt(open(f, 'r'), delimiter=' ')

print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:, 2])
testdata = list(np.array(p.read_table('test.tsv'))[:, 2])
y = np.array(p.read_table('train.tsv'))[:, -1]

tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode',
                      analyzer='word', token_pattern=r'\w{1,}', 
                      ngram_range=(1, 2), use_idf=1, smooth_idf=1, 
                      sublinear_tf=1)

rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                           C=1, fit_intercept=True, intercept_scaling=1.0, 
                           class_weight=None, random_state=None)

X_all = traindata + testdata
lentrain = len(traindata)

print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)

X = X_all[:lentrain]
X_test = X_all[lentrain:]

print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))

print "training on full data"
rd.fit(X, y)
pred = rd.predict_proba(X_test)[:, 1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."
5
  • 1
    Could you please explain better what you are trying to achieve? What hyperparameters are you trying to tune? Logistic regression does not have any hyperparameters.
    – George
    Commented Feb 16, 2014 at 20:58
  • 1
    @George Apologies for not being clear. I just want to ensure that the parameters I pass into my Logistic Regression are the best possible ones. I would like to be able to run through a set of steps which would ultimately allow me say that my Logistic Regression classifier is running as well as it possibly can. Commented Feb 16, 2014 at 21:04
  • 2
    @George scikit-learn's logistic regression takes several regularization parameters.
    – Fred Foo
    Commented Feb 17, 2014 at 10:02
  • 1
    So it is not a logistic regression, but its a L1 or L2 regularized version?
    – George
    Commented Feb 17, 2014 at 10:44
  • 1
    @George Logistic regression in scikit-learn also has a C parameter that controls the sparsity of the model. Commented Nov 10, 2017 at 21:05

3 Answers 3

35

You can use grid search to find out the best C value for you. Basically smaller C specify stronger regularization.

>>> param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
>>> clf = GridSearchCV(LogisticRegression(penalty='l2'), param_grid)
GridSearchCV(cv=None,
             estimator=LogisticRegression(C=1.0, intercept_scaling=1,   
               dual=False, fit_intercept=True, penalty='l2', tol=0.0001),
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]})

See the GridSearchCv document for more details on your application.

4
  • 1
    Many thanks for this. Can you please explain to me what the C value does? I can find no specific information on this anywhere. Commented Feb 17, 2014 at 0:57
  • 1
    courses.cs.washington.edu/courses/cse599c1/13wi/slides/… see the formula on page 3. The lambda is added as the regularization item. But in sklearn they divide the lambda in order to make the coefficient of sum of square of weight equal to 1. I guess they do this because they want to be consistent with the cost function in SVM.
    – lennon310
    Commented Feb 17, 2014 at 1:11
  • 1
    I've been experimenting with Scikit-Optimize and it runs really quickly using a model based optimization.
    – O.rka
    Commented Jul 9, 2017 at 20:59
  • 1
    @lennon310 Can we use C as 10 or 100 if my log loss is keep on improving?
    – Anjith
    Commented Mar 27, 2019 at 21:48
6

You may use below code for more general details:

LR = LogisticRegression()
LRparam_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    'penalty': ['l1', 'l2'],
    # 'max_iter': list(range(100,800,100)),
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}
LR_search = GridSearchCV(LR, param_grid=LRparam_grid, refit = True, verbose = 3, cv=5)

# fitting the model for grid search 
LR_search.fit(X_train , y_train)
LR_search.best_params_
# summarize
print('Mean Accuracy: %.3f' % LR_search.best_score_)
print('Config: %s' % LR_search.best_params_)
3
  • 3
    Please note that this code doesn't work, because there are illegal hyparam configurations. And for some baffling reason, sklearn silently fails instead of raising errors. Use error_score='raise' to view them. Commented Feb 15, 2021 at 13:25
  • I believe that your error does not straightly cause the code. Sometimes the type of data and data distribution can be the reason for some problems in computations. Commented Feb 16, 2021 at 17:36
  • @ShihabShahriarKhan I think maybe the problem comes from the Hyparameter of "max_iter". Now works better. Can you do a double-check? Commented Feb 28, 2022 at 16:50
3

Grid search is a brutal way of finding the optimal parameters because it train and test every possible combination. best way is using bayesian optimization which learns for past evaluation score and takes less computation time.

1
  • 21
    Would you please share some example source code for your solution? Otherwise, this post does not seem to provide a quality answer to the question and should be posted as a comment to the question. Commented Aug 5, 2018 at 15:14

Not the answer you're looking for? Browse other questions tagged or ask your own question.