SlideShare a Scribd company logo
1
Data Mining Techniques
ITE2006
NETWORK ABUSE DETECTION
PROJECT REPORT
SUBMITTED BY
15BIT0134 RUBAL NANDAL
15BIT0268 KEDAR KUMAR
Guided By:
Dr. Sudha M
2
CERTIFICATE
This is to guarantee that the undertaking work entitled "STUDENT Marks
Analysis" that is being put together by "KEDAR KUMAR (15BIT0268) and
RUBAL NANDAL (15BIT0134)" is a record of bonafide work done in Data
MINING (ITE2006) under my watch. The substance of this Project work, in
full or in parts, have nor been taken from some other source nor have been
submitted for some other CAL course.
PLACE:VELLORE
DATE:1/11/2017
KEDAR KUMAR (15BIT0268)
RUBAL NANDAL (15BIT0134)"
3
Table of components
Acknowlegement 2
Problem Statement 3
Approach 6
Modules 7
Proposed Implementation 8
Implementation 9
Conclusi
on
22
Referenc
es
23
4
ACKNOWLEDGEMENTS
We acknowledge SUDHA M mam for the direction and help gave help
the execution of the undertaking. We additionally recognize all others
worried about accomplishment of this undertaking. It is standard to
recognize the University Management/School Dean for giving us a
chance to complete our examinations at the University. Thanks for such
an outstanding opportunity to us.
Problem Statement
Now a days there are so many attacks are carried out on various people with malicious intents
.Most of them are network attacks , so we attempt to develop an network abuse detection
(intrusion detection ) from the KDD-1999 data set and try to identity normal connection and
attacked connection
To detect network intrusions protects a computer network from unauthorized users, including
perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a
classifier) capable of distinguishing between "bad" connections, called intrusions or attacks, and
"good" normal connections.
A connection is a sequence of TCP packets starting and ending at some well defined times,
between which data flows to and from a source IP address to a target IP address under some well
defined protocol. Each connection is labelled as either normal, or as an attack, with exactly one
specific attack type. Each connection record consists of about 100 bytes.
Attacks fall into four main categories
 DOS: denial-of-service, e.g. syn flood;
 R2L: unauthorized access from a remote machine, e.g. guessing password;
 U2R: unauthorized access to local superuser (root) privileges, e.g., various "buffer
overflow" attacks;
 PROBING: surveillance and other probing, e.g., port scanning.
5
ABOUT DATASET
Our dataset contains these features
Table 1: Basic features of individual TCP connections
feature name description type
duration length (number of seconds) of the connection continuous
protocol_type type of the protocol, e.g. tcp, udp, etc. discrete
service network service on the destination, e.g., http, telnet, etc. discrete
src_bytes number of data bytes from source to destination continuous
dst_bytes number of data bytes from destination to source continuous
flag normal or error status of the connection discrete
land 1 if connection is from/to the same host/port; 0 otherwise discrete
wrong_fragment number of "wrong" fragments continuous
urgent number of urgent packets continuous
Table 2: Content features within a connection suggested by domain knowledge
feature name description type
hot number of "hot" indicators continuous
num_failed_logins number of failed login attempts continuous
logged_in 1 if successfully logged in; 0 otherwise discrete
num_compromised number of "compromised" conditions continuous
root_shell 1 if root shell is obtained; 0 otherwise discrete
su_attempted 1 if "su root" command attempted; 0 otherwise discrete
num_root number of "root" accesses continuous
6
num_file_creations number of file creation operations continuous
num_shells number of shell prompts continuous
num_access_files number of operations on access control files continuous
num_outbound_cmds number of outbound commands in an ftp session continuous
is_hot_login 1 if the login belongs to the "hot" list; 0 otherwise discrete
is_guest_login 1 if the login is a "guest"login; 0 otherwise discrete
Table 3: Traffic features computed using a two-second time window
feature name description> type
count number of connections to the same host as the current connection
in the past two seconds
continuous
Note: The following features refer to these same-host connections.
serror_rate % of connections that have "SYN" errors continuous
rerror_rate % of connections that have "REJ" errors continuous
same_srv_rate % of connections to the same service continuous
diff_srv_rate % of connections to different services continuous
srv_count number of connections to the same service as the current
connection in the past two seconds
continuous
Note: The following features refer to these same-service connections.
srv_serror_rate % of connections that have "SYN" errors continuous
srv_rerror_rate % of connections that have "REJ" errors continuous
srv_diff_host_rate % of connections to different hosts continuous
7
Approach
1)There we will do some exploratory data analysis using Pandas.
2) After that we will do Data pre-processing and remove unnecessary features (attributes) from
our dataset
3) Then we will use clustering and anomality detection. We want our model to be able to work
well with unknown attack types and also to give an approximation of the closest attack type. We
will use K-mean clustering.
4) Then we will build a classifier using Scikit-learn (machine learning library).
Our classifier will just classify entries into normal or attack. By doing so, we can
generalise the model to new attack types.
8
Modules
1) Data Pre-processing:
Initially, we will use all features. We need to do something with our categorical variables. But
not all the features are numerical so we will do feature selection to remove unwanted features to
reduce the dimensionality of our data.
2) KMeans clustering
We will perform anomaly detection approach in the reduced dataset. We will start by doing k-
means clustering. Once we have the cluster centres, we can use it to identify the clusters of
attack or normal in new dataset
3) Classification
In classification we will train our dataset and make a classifier and use that classifier to predict
other data file and then we will test our estimation with R2
test to predict the accuracy of our
classifier.
4) Predictions
Based on the assumption that new attack types will resemble old type, we will be able to detect
those. Moreover, anything that falls too far from any cluster, will be considered anomalous and
therefore a possible attack.
9
Feature
selection
and scaling
DKK-1999
Labelled
dataset
Proposed Implementation Framework
DKK-1999
Labelled raw
dataset
DKK-1999
Corrected
raw
dataset
Clustering
and anomaly
detection
Anomaly
detection
algorithm
DKK-1999
Corrected
dataset
Unlabell
ed
dataset
labelled
dataset
Predicti
on
results
10
Implementation
1) CLUSTERING
LOADING THE DATA
In [2] : import pandas
from time import time
col_names = ["duration","protocol_type","service","flag","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
kdd_data_10percent =
pandas.read_csv("D:studysem5dataminingprojectdatasetdatakddcup.data_10_percent_corrected",
header=None, names = col_names)
kdd_data_10percent.describe()
11
OUTPUT
VIEWING THE LABELS
In [3] : kdd_data_10percent['label'].value_counts()
OUTPUT
12
FEATURE SELECTION
In [4] :num_features = [
"duration","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate"
]
features = kdd_data_10percent[num_features].astype(float)
features.describe()
OUTPUT
13
CLUSTERING
from sklearn.cluster import KMeans
k = 30
km = KMeans(n_clusters = k)
t0 = time()
km.fit(features)
tt = time()-t0
print("Clustered in",round(tt,3)," seconds")
#visualising cluster sample
for i in range(600,620):
print (km.labels_[i])
ASSIGINING LABELS
labels = kdd_data_10percent['label']
label_names = list(map(
lambda x: pandas.Series([labels[i] for i in range(len(km.labels_)) if km.labels_[i]==x]),
range(k)))
for i in range(k):
print ("Cluster ",i," labels:")
print (label_names[i].value_counts(),"n")
print
14
LOADING TESTING DATA
kdd_data_corrected = pandas.read_csv("D:studysem5dataminingprojectdatasetdatacorrected",
header=None, names = col_names)
ASSIGINING CLUSTERS
t0 = time()
pred = km.predict(kdd_data_corrected[num_features])
tt = time() - t0
print ("Assigned clusters in",round(tt,3)," seconds")
15
2) CLASSIFICATIONS
LOADING THE DATA
In [2] : import pandas
from time import time
col_names = ["duration","protocol_type","service","flag","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
kdd_data_10percent =
pandas.read_csv("D:studysem5dataminingprojectdatasetdatakddcup.data_10_percent_corrected",
header=None, names = col_names)
kdd_data_10percent.describe()
OUTPUT
16
VIEWING THE LABELS
In [3] : kdd_data_10percent['label'].value_counts()
OUTPUT
FEATURE SELECTION
17
In [4] :num_features = [
"duration","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate"
]
features = kdd_data_10percent[num_features].astype(float)
features.describe()
OUTPUT
18
ADDING LABELS
from sklearn.neighbors import KNeighborsClassifier
labels = kdd_data_10percent['label'].copy()
labels[labels!='normal.'] = 'attack.'
labels.value_counts()
1) TRAINING CLASSIFIER WITH BALL TREE
#algo=bruteforce , ball-tree,kd-tree
clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'ball_tree', leaf_size=500)
t0 = time()
clf.fit(features,labels)
tt = time() - t0
print ("Classifier trained in",round(tt,3),"seconds")
LOADING TESTING DATA
kdd_data_corrected = pandas.read_csv("D:studysem5dataminingprojectdatasetdatacorrected",
header=None, names = col_names)
kdd_data_corrected['label'].value_counts()
19
CONVERTING LABELS
kdd_data_corrected['label'][kdd_data_corrected['label']!='normal.'] = 'attack.'
kdd_data_corrected['label'].value_counts()
CREATING TEST SAMPLE
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(
kdd_data_corrected[num_features],
kdd_data_corrected['label'],
test_size=0.1,
random_state=42)
PRIDICTING
t0 = time()
pred = clf.predict(features_test)
tt = time() - t0
print ("Predicted in",round(tt,3)," seconds")
20
CHECKING ACCURACY
from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)
print("R squared is ",round(acc,4),"")
21
2) TRAINING CLASSIFIER WITH KD-TREE
#algo=bruteforce , ball-tree,kd-tree
clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'kd-tree', leaf_size=500)
t0 = time()
clf.fit(features,labels)
tt = time() - t0
print ("Classifier trained in",round(tt,3),"seconds")
ACCURACY
from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)
print("R squared is ",round(acc,4),"")
22
3) TRAINING CLASSIFIER WITH BRUTEFORCE
#algo=bruteforce , ball-tree,kd-tree
clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'bruteforce', leaf_size=500)
t0 = time()
clf.fit(features,labels)
tt = time() - t0
print ("Classifier trained in",round(tt,3),"seconds")
ACCURACY
from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)
print("R squared is ",round(acc,4),"")
23
CONCLUSION
We have formed clusters . those clusters can e used with real data to predict an
attack and a normal connection. Even anything falling far from cluster can also be
considered as an attack
From classification we obtained results tabulated in below table
ALGORITHM TIME FOR TRAINING ACCURACY
Ball-Tree Least 0.925 (near max)
KD-TREE Little higher than Ball-tree 0.820 (least)
BRUTEFORCE High 0.932 (maximum)
Form our experiment we concluded bruteforce is most expensive algorithm but
produced max accuracy on the other hand kd-tree obtained least result for our data
and ball-tree algorithm worked better as it consumed almost least time and almost
max accuracy
24
References
Dataset
[1] http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Software
[2] https://spark.apache.org/downloads.html
Pyspark tutorial
[3] https://www.dezyre.com/apache-spark-tutorial/pyspark-tutorial
[4] https://www.datacamp.com/community/tutorials/apache-spark-python
Research article
[2] Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009, July). A detailed analysis of
the KDD CUP 99 data set. In Computational Intelligence for Security and Defense Applications,
2009. CISDA 2009. IEEE Symposium on (pp. 1-6). IEEE.

More Related Content

Data mining final report

  • 1. 1 Data Mining Techniques ITE2006 NETWORK ABUSE DETECTION PROJECT REPORT SUBMITTED BY 15BIT0134 RUBAL NANDAL 15BIT0268 KEDAR KUMAR Guided By: Dr. Sudha M
  • 2. 2 CERTIFICATE This is to guarantee that the undertaking work entitled "STUDENT Marks Analysis" that is being put together by "KEDAR KUMAR (15BIT0268) and RUBAL NANDAL (15BIT0134)" is a record of bonafide work done in Data MINING (ITE2006) under my watch. The substance of this Project work, in full or in parts, have nor been taken from some other source nor have been submitted for some other CAL course. PLACE:VELLORE DATE:1/11/2017 KEDAR KUMAR (15BIT0268) RUBAL NANDAL (15BIT0134)"
  • 3. 3 Table of components Acknowlegement 2 Problem Statement 3 Approach 6 Modules 7 Proposed Implementation 8 Implementation 9 Conclusi on 22 Referenc es 23
  • 4. 4 ACKNOWLEDGEMENTS We acknowledge SUDHA M mam for the direction and help gave help the execution of the undertaking. We additionally recognize all others worried about accomplishment of this undertaking. It is standard to recognize the University Management/School Dean for giving us a chance to complete our examinations at the University. Thanks for such an outstanding opportunity to us. Problem Statement Now a days there are so many attacks are carried out on various people with malicious intents .Most of them are network attacks , so we attempt to develop an network abuse detection (intrusion detection ) from the KDD-1999 data set and try to identity normal connection and attacked connection To detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between "bad" connections, called intrusions or attacks, and "good" normal connections. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labelled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes. Attacks fall into four main categories  DOS: denial-of-service, e.g. syn flood;  R2L: unauthorized access from a remote machine, e.g. guessing password;  U2R: unauthorized access to local superuser (root) privileges, e.g., various "buffer overflow" attacks;  PROBING: surveillance and other probing, e.g., port scanning.
  • 5. 5 ABOUT DATASET Our dataset contains these features Table 1: Basic features of individual TCP connections feature name description type duration length (number of seconds) of the connection continuous protocol_type type of the protocol, e.g. tcp, udp, etc. discrete service network service on the destination, e.g., http, telnet, etc. discrete src_bytes number of data bytes from source to destination continuous dst_bytes number of data bytes from destination to source continuous flag normal or error status of the connection discrete land 1 if connection is from/to the same host/port; 0 otherwise discrete wrong_fragment number of "wrong" fragments continuous urgent number of urgent packets continuous Table 2: Content features within a connection suggested by domain knowledge feature name description type hot number of "hot" indicators continuous num_failed_logins number of failed login attempts continuous logged_in 1 if successfully logged in; 0 otherwise discrete num_compromised number of "compromised" conditions continuous root_shell 1 if root shell is obtained; 0 otherwise discrete su_attempted 1 if "su root" command attempted; 0 otherwise discrete num_root number of "root" accesses continuous
  • 6. 6 num_file_creations number of file creation operations continuous num_shells number of shell prompts continuous num_access_files number of operations on access control files continuous num_outbound_cmds number of outbound commands in an ftp session continuous is_hot_login 1 if the login belongs to the "hot" list; 0 otherwise discrete is_guest_login 1 if the login is a "guest"login; 0 otherwise discrete Table 3: Traffic features computed using a two-second time window feature name description> type count number of connections to the same host as the current connection in the past two seconds continuous Note: The following features refer to these same-host connections. serror_rate % of connections that have "SYN" errors continuous rerror_rate % of connections that have "REJ" errors continuous same_srv_rate % of connections to the same service continuous diff_srv_rate % of connections to different services continuous srv_count number of connections to the same service as the current connection in the past two seconds continuous Note: The following features refer to these same-service connections. srv_serror_rate % of connections that have "SYN" errors continuous srv_rerror_rate % of connections that have "REJ" errors continuous srv_diff_host_rate % of connections to different hosts continuous
  • 7. 7 Approach 1)There we will do some exploratory data analysis using Pandas. 2) After that we will do Data pre-processing and remove unnecessary features (attributes) from our dataset 3) Then we will use clustering and anomality detection. We want our model to be able to work well with unknown attack types and also to give an approximation of the closest attack type. We will use K-mean clustering. 4) Then we will build a classifier using Scikit-learn (machine learning library). Our classifier will just classify entries into normal or attack. By doing so, we can generalise the model to new attack types.
  • 8. 8 Modules 1) Data Pre-processing: Initially, we will use all features. We need to do something with our categorical variables. But not all the features are numerical so we will do feature selection to remove unwanted features to reduce the dimensionality of our data. 2) KMeans clustering We will perform anomaly detection approach in the reduced dataset. We will start by doing k- means clustering. Once we have the cluster centres, we can use it to identify the clusters of attack or normal in new dataset 3) Classification In classification we will train our dataset and make a classifier and use that classifier to predict other data file and then we will test our estimation with R2 test to predict the accuracy of our classifier. 4) Predictions Based on the assumption that new attack types will resemble old type, we will be able to detect those. Moreover, anything that falls too far from any cluster, will be considered anomalous and therefore a possible attack.
  • 9. 9 Feature selection and scaling DKK-1999 Labelled dataset Proposed Implementation Framework DKK-1999 Labelled raw dataset DKK-1999 Corrected raw dataset Clustering and anomaly detection Anomaly detection algorithm DKK-1999 Corrected dataset Unlabell ed dataset labelled dataset Predicti on results
  • 10. 10 Implementation 1) CLUSTERING LOADING THE DATA In [2] : import pandas from time import time col_names = ["duration","protocol_type","service","flag","src_bytes", "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins", "logged_in","num_compromised","root_shell","su_attempted","num_root", "num_file_creations","num_shells","num_access_files","num_outbound_cmds", "is_host_login","is_guest_login","count","srv_count","serror_rate", "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate", "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate", "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"] kdd_data_10percent = pandas.read_csv("D:studysem5dataminingprojectdatasetdatakddcup.data_10_percent_corrected", header=None, names = col_names) kdd_data_10percent.describe()
  • 11. 11 OUTPUT VIEWING THE LABELS In [3] : kdd_data_10percent['label'].value_counts() OUTPUT
  • 12. 12 FEATURE SELECTION In [4] :num_features = [ "duration","src_bytes", "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins", "logged_in","num_compromised","root_shell","su_attempted","num_root", "num_file_creations","num_shells","num_access_files","num_outbound_cmds", "is_host_login","is_guest_login","count","srv_count","serror_rate", "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate", "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate", "dst_host_rerror_rate","dst_host_srv_rerror_rate" ] features = kdd_data_10percent[num_features].astype(float) features.describe() OUTPUT
  • 13. 13 CLUSTERING from sklearn.cluster import KMeans k = 30 km = KMeans(n_clusters = k) t0 = time() km.fit(features) tt = time()-t0 print("Clustered in",round(tt,3)," seconds") #visualising cluster sample for i in range(600,620): print (km.labels_[i]) ASSIGINING LABELS labels = kdd_data_10percent['label'] label_names = list(map( lambda x: pandas.Series([labels[i] for i in range(len(km.labels_)) if km.labels_[i]==x]), range(k))) for i in range(k): print ("Cluster ",i," labels:") print (label_names[i].value_counts(),"n") print
  • 14. 14 LOADING TESTING DATA kdd_data_corrected = pandas.read_csv("D:studysem5dataminingprojectdatasetdatacorrected", header=None, names = col_names) ASSIGINING CLUSTERS t0 = time() pred = km.predict(kdd_data_corrected[num_features]) tt = time() - t0 print ("Assigned clusters in",round(tt,3)," seconds")
  • 15. 15 2) CLASSIFICATIONS LOADING THE DATA In [2] : import pandas from time import time col_names = ["duration","protocol_type","service","flag","src_bytes", "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins", "logged_in","num_compromised","root_shell","su_attempted","num_root", "num_file_creations","num_shells","num_access_files","num_outbound_cmds", "is_host_login","is_guest_login","count","srv_count","serror_rate", "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate", "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate", "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"] kdd_data_10percent = pandas.read_csv("D:studysem5dataminingprojectdatasetdatakddcup.data_10_percent_corrected", header=None, names = col_names) kdd_data_10percent.describe() OUTPUT
  • 16. 16 VIEWING THE LABELS In [3] : kdd_data_10percent['label'].value_counts() OUTPUT FEATURE SELECTION
  • 17. 17 In [4] :num_features = [ "duration","src_bytes", "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins", "logged_in","num_compromised","root_shell","su_attempted","num_root", "num_file_creations","num_shells","num_access_files","num_outbound_cmds", "is_host_login","is_guest_login","count","srv_count","serror_rate", "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate", "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate", "dst_host_rerror_rate","dst_host_srv_rerror_rate" ] features = kdd_data_10percent[num_features].astype(float) features.describe() OUTPUT
  • 18. 18 ADDING LABELS from sklearn.neighbors import KNeighborsClassifier labels = kdd_data_10percent['label'].copy() labels[labels!='normal.'] = 'attack.' labels.value_counts() 1) TRAINING CLASSIFIER WITH BALL TREE #algo=bruteforce , ball-tree,kd-tree clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'ball_tree', leaf_size=500) t0 = time() clf.fit(features,labels) tt = time() - t0 print ("Classifier trained in",round(tt,3),"seconds") LOADING TESTING DATA kdd_data_corrected = pandas.read_csv("D:studysem5dataminingprojectdatasetdatacorrected", header=None, names = col_names) kdd_data_corrected['label'].value_counts()
  • 19. 19 CONVERTING LABELS kdd_data_corrected['label'][kdd_data_corrected['label']!='normal.'] = 'attack.' kdd_data_corrected['label'].value_counts() CREATING TEST SAMPLE from sklearn.cross_validation import train_test_split features_train, features_test, labels_train, labels_test = train_test_split( kdd_data_corrected[num_features], kdd_data_corrected['label'], test_size=0.1, random_state=42) PRIDICTING t0 = time() pred = clf.predict(features_test) tt = time() - t0 print ("Predicted in",round(tt,3)," seconds")
  • 20. 20 CHECKING ACCURACY from sklearn.metrics import accuracy_score acc = accuracy_score(pred, labels_test) print("R squared is ",round(acc,4),"")
  • 21. 21 2) TRAINING CLASSIFIER WITH KD-TREE #algo=bruteforce , ball-tree,kd-tree clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'kd-tree', leaf_size=500) t0 = time() clf.fit(features,labels) tt = time() - t0 print ("Classifier trained in",round(tt,3),"seconds") ACCURACY from sklearn.metrics import accuracy_score acc = accuracy_score(pred, labels_test) print("R squared is ",round(acc,4),"")
  • 22. 22 3) TRAINING CLASSIFIER WITH BRUTEFORCE #algo=bruteforce , ball-tree,kd-tree clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'bruteforce', leaf_size=500) t0 = time() clf.fit(features,labels) tt = time() - t0 print ("Classifier trained in",round(tt,3),"seconds") ACCURACY from sklearn.metrics import accuracy_score acc = accuracy_score(pred, labels_test) print("R squared is ",round(acc,4),"")
  • 23. 23 CONCLUSION We have formed clusters . those clusters can e used with real data to predict an attack and a normal connection. Even anything falling far from cluster can also be considered as an attack From classification we obtained results tabulated in below table ALGORITHM TIME FOR TRAINING ACCURACY Ball-Tree Least 0.925 (near max) KD-TREE Little higher than Ball-tree 0.820 (least) BRUTEFORCE High 0.932 (maximum) Form our experiment we concluded bruteforce is most expensive algorithm but produced max accuracy on the other hand kd-tree obtained least result for our data and ball-tree algorithm worked better as it consumed almost least time and almost max accuracy
  • 24. 24 References Dataset [1] http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html Software [2] https://spark.apache.org/downloads.html Pyspark tutorial [3] https://www.dezyre.com/apache-spark-tutorial/pyspark-tutorial [4] https://www.datacamp.com/community/tutorials/apache-spark-python Research article [2] Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009, July). A detailed analysis of the KDD CUP 99 data set. In Computational Intelligence for Security and Defense Applications, 2009. CISDA 2009. IEEE Symposium on (pp. 1-6). IEEE.