Fine-tuning parameters in Logistic Regression

Question

I am running a logistic regression with a tf-idf being ran on a text column. This is the only column I use in my logistic regression. How can I ensure the parameters for this are tuned as well as possible?

I would like to be able to run through a set of steps which would ultimately allow me say that my Logistic Regression classifier is running as well as it possibly can.

from sklearn import metrics,preprocessing,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
import sklearn.linear_model as lm
import pandas as p
loadData = lambda f: np.genfromtxt(open(f, 'r'), delimiter=' ')

print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:, 2])
testdata = list(np.array(p.read_table('test.tsv'))[:, 2])
y = np.array(p.read_table('train.tsv'))[:, -1]

tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode',
                      analyzer='word', token_pattern=r'\w{1,}', 
                      ngram_range=(1, 2), use_idf=1, smooth_idf=1, 
                      sublinear_tf=1)

rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                           C=1, fit_intercept=True, intercept_scaling=1.0, 
                           class_weight=None, random_state=None)

X_all = traindata + testdata
lentrain = len(traindata)

print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)

X = X_all[:lentrain]
X_test = X_all[lentrain:]

print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))

print "training on full data"
rd.fit(X, y)
pred = rd.predict_proba(X_test)[:, 1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."

Could you please explain better what you are trying to achieve? What hyperparameters are you trying to tune? Logistic regression does not have any hyperparameters. — George, Commented Feb 16, 2014 at 20:58
@George Apologies for not being clear. I just want to ensure that the parameters I pass into my Logistic Regression are the best possible ones. I would like to be able to run through a set of steps which would ultimately allow me say that my Logistic Regression classifier is running as well as it possibly can. — Simon Kiely, Commented Feb 16, 2014 at 21:04
@George scikit-learn's logistic regression takes several regularization parameters. — Fred Foo, Commented Feb 17, 2014 at 10:02
So it is not a logistic regression, but its a L1 or L2 regularized version? — George, Commented Feb 17, 2014 at 10:44
@George Logistic regression in scikit-learn also has a C parameter that controls the sparsity of the model. — WestCoastProjects, Commented Nov 10, 2017 at 21:05

FatihAkici · Accepted Answer · 2018-06-08 23:53:58Z

35

You can use grid search to find out the best C value for you. Basically smaller C specify stronger regularization.

>>> param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
>>> clf = GridSearchCV(LogisticRegression(penalty='l2'), param_grid)
GridSearchCV(cv=None,
             estimator=LogisticRegression(C=1.0, intercept_scaling=1,   
               dual=False, fit_intercept=True, penalty='l2', tol=0.0001),
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]})

See the GridSearchCv document for more details on your application.

edited Jun 8, 2018 at 23:53

FatihAkici

4,9292 gold badges33 silver badges50 bronze badges

answered Feb 17, 2014 at 0:34

lennon310

12.6k11 gold badges44 silver badges63 bronze badges

1

Many thanks for this. Can you please explain to me what the C value does? I can find no specific information on this anywhere.
– Simon Kiely
Commented Feb 17, 2014 at 0:57
1

courses.cs.washington.edu/courses/cse599c1/13wi/slides/… see the formula on page 3. The lambda is added as the regularization item. But in sklearn they divide the lambda in order to make the coefficient of sum of square of weight equal to 1. I guess they do this because they want to be consistent with the cost function in SVM.
– lennon310
Commented Feb 17, 2014 at 1:11
1

I've been experimenting with Scikit-Optimize and it runs really quickly using a model based optimization.
– O.rka
Commented Jul 9, 2017 at 20:59
1

@lennon310 Can we use C as 10 or 100 if my log loss is keep on improving?
– Anjith
Commented Mar 27, 2019 at 21:48

Add a comment |

Amin Khodamoradi · Accepted Answer · 2022-02-28 16:49:35Z

6

You may use below code for more general details:

LR = LogisticRegression()
LRparam_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    'penalty': ['l1', 'l2'],
    # 'max_iter': list(range(100,800,100)),
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}
LR_search = GridSearchCV(LR, param_grid=LRparam_grid, refit = True, verbose = 3, cv=5)

# fitting the model for grid search 
LR_search.fit(X_train , y_train)
LR_search.best_params_
# summarize
print('Mean Accuracy: %.3f' % LR_search.best_score_)
print('Config: %s' % LR_search.best_params_)

edited Feb 28, 2022 at 16:49

answered Dec 8, 2020 at 17:58

Amin Khodamoradi

4141 gold badge8 silver badges20 bronze badges

3

Please note that this code doesn't work, because there are illegal hyparam configurations. And for some baffling reason, sklearn silently fails instead of raising errors. Use error_score='raise' to view them.
– Shihab Shahriar Khan
Commented Feb 15, 2021 at 13:25
I believe that your error does not straightly cause the code. Sometimes the type of data and data distribution can be the reason for some problems in computations.
– Amin Khodamoradi
Commented Feb 16, 2021 at 17:36
@ShihabShahriarKhan I think maybe the problem comes from the Hyparameter of "max_iter". Now works better. Can you do a double-check?
– Amin Khodamoradi
Commented Feb 28, 2022 at 16:50

Add a comment |

viplov · Accepted Answer · 2018-08-05 14:50:24Z

3

Grid search is a brutal way of finding the optimal parameters because it train and test every possible combination. best way is using bayesian optimization which learns for past evaluation score and takes less computation time.

answered Aug 5, 2018 at 14:50

viplov

672 bronze badges

21

Would you please share some example source code for your solution? Otherwise, this post does not seem to provide a quality answer to the question and should be posted as a comment to the question.
– sɐunıɔןɐqɐp
Commented Aug 5, 2018 at 15:14

Add a comment |

Collectives™ on Stack Overflow

Fine-tuning parameters in Logistic Regression

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
python
numpy
machine-learning
artificial-intelligence
scikit-learn
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged pythonnumpymachine-learningartificial-intelligencescikit-learn or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
numpy
machine-learning
artificial-intelligence
scikit-learn
or ask your own question.