1

I have been given some years data of Ozone, NO, NO2 and CO to work on. The task is to use this data to predict the value of ozone. Suppose i have data of year 2015,2016,2018 and 2019. I need to predict ozone value of 2019 using 2015,2016,2018 data which is with me.

Data format is hourly recorded and is present in the form of monthsimage. So in this format data is present.

What i have done: First of all the years data in one excel file which contains 4 columns NO,NO2,CO,O3. And added all the data month by month. So this is the master file which has been usedAttached image

I have used python. First the data has to be cleared. Let me explain a bit. No,No2 and CO are predecessors of ozone which means that ozone gas creation depends on these gases and the data has to be cleaned before hand and the constraints were to remove any negative value and to remove that whole row including others column so if any of the values of Ozone,No,NO2 and CO is invalid we have to remove the whole row and not count it. And the data contained some string format which also has to be removed. It was all done. Then i applied MLP regressor from sk learn Here the code which i have done.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import explained_variance_score
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error
import pandas as pd

import matplotlib.pyplot as plt

bugs = ['NOx', '* 43.3', '* 312', '11/19', '11/28', '06:00', '09/30', '09/04', '14:00', '06/25', '07:00', '06/02',
        '17:00', '04/10', '04/17', '18:00', '02/26', '02/03', '01:00', '11/23', '15:00', '11/12', '24:00', '09/02',
        '16:00', '09/28', '* 16.8', '* 121', '12:00', '06/24', '13:00', '06/26', 'Span', 'NoData', 'ppb', 'Zero',
        'Samp<', 'RS232']
dataset = pd.read_excel("Testing.xlsx")

dataset = pd.DataFrame(dataset).replace(bugs, 0)
dataset.dropna(subset=["O3"], inplace=True)
dataset.dropna(subset=["NO"], inplace=True)
dataset.dropna(subset=["NO2"], inplace=True)
dataset.dropna(subset=["CO"], inplace=True)

dataset.drop(dataset[dataset['O3'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['O3'] > 160].index, inplace=True)
dataset.drop(dataset[dataset['O3'] == 0].index, inplace=True)

dataset.drop(dataset[dataset['NO'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['NO'] > 160].index, inplace=True)
dataset.drop(dataset[dataset['NO'] == 0].index, inplace=True)

dataset.drop(dataset[dataset['NO2'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['NO2'] > 160].index, inplace=True)
dataset.drop(dataset[dataset['NO2'] == 0].index, inplace=True)

dataset.drop(dataset[dataset['CO'] < 1].index, inplace=True)
dataset.drop(dataset[dataset['CO'] > 4000].index, inplace=True)
dataset.drop(dataset[dataset['CO'] == 0].index, inplace=True)
dataset = dataset.reset_index()
dataset = dataset.drop(['index'], axis=1)

X = dataset[["NO", "NO2", "CO"]].astype(int)
Y = dataset[["O3"]].astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.05, random_state=27)
sc_x = StandardScaler()
X_train = sc_x.fit_transform(X_train)
X_test = sc_x.fit_transform(X_test)
clf = MLPRegressor(hidden_layer_sizes=(100,100,100), max_iter=10000,verbose=True,random_state=8)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(explained_variance_score(y_test, y_pred))
print(mean_absolute_error(y_test, y_pred))
y_test = pd.DataFrame(y_test)
y_test = y_test.reset_index(0)
y_test = y_test.drop(['index'], axis=1)
# y_test = y_test.drop([19,20],axis=0)
y_pred = pd.DataFrame(y_pred)
y_pred = y_pred.shift(-1)
# y_pred = y_pred.drop([19,20],axis=0)
plt.figure(figsize=(10, 5))
plt.plot(y_pred, color='r', label='PredictedO3')
plt.plot(y_test, color='g', label='OriginalO3')
plt.legend()
plt.show()

Console:

     y = column_or_1d(y, warn=True)
Iteration 1, loss = 537.59597297
Iteration 2, loss = 185.33662023
Iteration 3, loss = 159.32122111
Iteration 4, loss = 156.71612690
Iteration 5, loss = 155.05307865
Iteration 6, loss = 154.59351630
Iteration 7, loss = 154.16687592
Iteration 8, loss = 153.69258698
Iteration 9, loss = 153.36140320
Iteration 10, loss = 152.94593665
Iteration 11, loss = 152.75124494
Iteration 12, loss = 152.73893578
Iteration 13, loss = 152.27131771
Iteration 14, loss = 152.08732297
Iteration 15, loss = 151.83197245
Iteration 16, loss = 151.29399626
Iteration 17, loss = 150.96425147
Iteration 18, loss = 150.47673257
Iteration 19, loss = 150.14353009
Iteration 20, loss = 149.74165931
Iteration 21, loss = 149.39158575
Iteration 22, loss = 149.28863163
Iteration 23, loss = 148.95356802
Iteration 24, loss = 148.82618770
Iteration 25, loss = 148.18070387
Iteration 26, loss = 147.79069739
Iteration 27, loss = 147.03057672
Iteration 28, loss = 146.77822749
Iteration 29, loss = 146.47159952
Iteration 30, loss = 145.77185465
Iteration 31, loss = 145.54493110
Iteration 32, loss = 145.58297196
Iteration 33, loss = 145.05848640
Iteration 34, loss = 144.73301133
Iteration 35, loss = 144.04886503
Iteration 36, loss = 143.82328142
Iteration 37, loss = 143.87060411
Iteration 38, loss = 143.84762507
Iteration 39, loss = 142.64434158
Iteration 40, loss = 142.63539287
Iteration 41, loss = 142.55569644
Iteration 42, loss = 142.33659309
Iteration 43, loss = 142.08105262
Iteration 44, loss = 141.84181483
Iteration 45, loss = 143.50650508
Iteration 46, loss = 141.34511656
Iteration 47, loss = 141.26444355
Iteration 48, loss = 140.37034198
Iteration 49, loss = 140.15212097
Iteration 50, loss = 140.21204360
Iteration 51, loss = 140.01652524
Iteration 52, loss = 139.55019562
Iteration 53, loss = 139.96862367
Iteration 54, loss = 139.18904418
Iteration 55, loss = 138.96940532
Iteration 56, loss = 138.74715169
Iteration 57, loss = 138.42219317
Iteration 58, loss = 138.87739582
Iteration 59, loss = 138.48879907
Iteration 60, loss = 138.32348064
Iteration 61, loss = 138.25489777
Iteration 62, loss = 137.35913024
Iteration 63, loss = 137.34553482
Iteration 64, loss = 137.81499126
Iteration 65, loss = 137.24418131
Iteration 66, loss = 138.22142987
Iteration 67, loss = 136.68683284
Iteration 68, loss = 136.80873025
Iteration 69, loss = 136.89557260
Iteration 70, loss = 137.78914828
Iteration 71, loss = 136.39181767
Iteration 72, loss = 136.90698714
Iteration 73, loss = 136.15180171
Iteration 74, loss = 136.29621913
Iteration 75, loss = 136.54671797
Iteration 76, loss = 136.17984691
Iteration 77, loss = 135.46193871
Iteration 78, loss = 135.72399747
Iteration 79, loss = 135.66833438
Iteration 80, loss = 135.59829106
Iteration 81, loss = 134.89759461
Iteration 82, loss = 135.13978950
Iteration 83, loss = 135.13023951
Iteration 84, loss = 134.74279949
Iteration 85, loss = 135.81422214
Iteration 86, loss = 134.91660517
Iteration 87, loss = 134.42552779
Iteration 88, loss = 134.69309963
Iteration 89, loss = 135.12116240
Iteration 90, loss = 134.58731261
Iteration 91, loss = 135.03610330
Iteration 92, loss = 135.49753508
Iteration 93, loss = 134.34645918
Iteration 94, loss = 133.73179994
Iteration 95, loss = 133.63077367
Iteration 96, loss = 133.77330604
Iteration 97, loss = 134.34313391
Iteration 98, loss = 133.89467176
Iteration 99, loss = 134.16270723
Iteration 100, loss = 133.69654234
Iteration 101, loss = 134.06460647
Iteration 102, loss = 133.67570066
Iteration 103, loss = 133.51941546
Iteration 104, loss = 134.44514524
Iteration 105, loss = 133.77755818
Iteration 106, loss = 133.45007788
Iteration 107, loss = 133.07441490
Iteration 108, loss = 134.99803516
Iteration 109, loss = 133.80158058
Iteration 110, loss = 132.86973595
Iteration 111, loss = 132.95281816
Iteration 112, loss = 132.55546679
Iteration 113, loss = 133.89665148
Iteration 114, loss = 132.92319206
Iteration 115, loss = 133.02169313
Iteration 116, loss = 133.23774543
Iteration 117, loss = 132.03027124
Iteration 118, loss = 133.18472212
Iteration 119, loss = 132.34502179
Iteration 120, loss = 132.55417269
Iteration 121, loss = 132.43373273
Iteration 122, loss = 132.26810570
Iteration 123, loss = 133.17705777
Iteration 124, loss = 133.58044956
Iteration 125, loss = 132.12074893
Iteration 126, loss = 131.93800952
Iteration 127, loss = 132.30641181
Iteration 128, loss = 131.81882504
Iteration 129, loss = 132.06413592
Iteration 130, loss = 132.24680375
Iteration 131, loss = 132.12261129
Iteration 132, loss = 132.35714616
Iteration 133, loss = 131.90862418
Iteration 134, loss = 131.73195382
Iteration 135, loss = 131.55302493
Iteration 136, loss = 131.41382323
Iteration 137, loss = 131.62962730
Iteration 138, loss = 132.49231086
Iteration 139, loss = 131.14651158
Iteration 140, loss = 131.46236192
Iteration 141, loss = 131.36319145
Iteration 142, loss = 131.87374996
Iteration 143, loss = 132.08955722
Iteration 144, loss = 131.28997320
Iteration 145, loss = 131.35961909
Iteration 146, loss = 131.20954288
Iteration 147, loss = 131.99304728
Iteration 148, loss = 130.76432171
Iteration 149, loss = 131.42775156
Iteration 150, loss = 131.05940000
Iteration 151, loss = 131.28351430
Iteration 152, loss = 130.74260322
Iteration 153, loss = 130.88466712
Iteration 154, loss = 131.03646775
Iteration 155, loss = 130.34557661
Iteration 156, loss = 130.83447199
Iteration 157, loss = 131.28845939
Iteration 158, loss = 130.65785044
Iteration 159, loss = 130.61223056
Iteration 160, loss = 131.07589679
Iteration 161, loss = 130.64325675
Iteration 162, loss = 129.70704922
Iteration 163, loss = 129.84506370
Iteration 164, loss = 130.61988464
Iteration 165, loss = 130.43265567
Iteration 166, loss = 130.88822404
Iteration 167, loss = 130.76778201
Iteration 168, loss = 130.64819084
Iteration 169, loss = 130.28019987
Iteration 170, loss = 129.95417212
Iteration 171, loss = 131.06510048
Iteration 172, loss = 131.21377407
Iteration 173, loss = 130.17368709
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
0.2442499851919634
12.796789671568312

heres the final plot here If i am doing something wrong correct me. Regards

0

1 Answer 1

1

Such questions are actually difficult to answer exactly, since the answer depends crucially on the dataset used, which we don't have.

Nevertheless, since your target variable seems to have a rather high dynamic range, you should try scaling it using a separate scaler; you should take care to inverse-transform the predictions back to their original scale, before computing errors or plotting:

sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.reshape(-1, 1))
y_test = sc_y.transform(y_test.reshape(-1, 1))

# model definition and fitting...

y_pred_scaled = clf.predict(X_test) # get scaled predictions
y_pred = sc_y.inverse_transform(y_pred_scaled) # transform back to original scale

You should be able from this point on to continue further with y_pred as you do in your code.

Also, irrelevant to your issue, but your are scaling your features in a wrong way. We never use fit_transform on the test data; the correct way is:

sc_x = StandardScaler()
X_train = sc_x.fit_transform(X_train)
X_test = sc_x.transform(X_test) # transform here

As said, this is just a tip; the keyword here is experiment (with different number of layers, different numbers of units per layer, different scalers etc).

Not the answer you're looking for? Browse other questions tagged or ask your own question.