Data mining final report

1
Data Mining Techniques
ITE2006
NETWORK ABUSE DETECTION
PROJECT REPORT
SUBMITTED BY
15BIT0134 RUBAL NANDAL
15BIT0268 KEDAR KUMAR
Guided By:
Dr. Sudha M

2
CERTIFICATE
This is to guarantee that the undertaking work entitled "STUDENT Marks
Analysis" that is being put together by "KEDAR KUMAR (15BIT0268) and
RUBAL NANDAL (15BIT0134)" is a record of bonafide work done in Data
MINING (ITE2006) under my watch. The substance of this Project work, in
full or in parts, have nor been taken from some other source nor have been
submitted for some other CAL course.
PLACE:VELLORE
DATE:1/11/2017
KEDAR KUMAR (15BIT0268)
RUBAL NANDAL (15BIT0134)"

3
Table of components
Acknowlegement 2
Problem Statement 3
Approach 6
Modules 7
Proposed Implementation 8
Implementation 9
Conclusi
on
22
Referenc
es
23

4
ACKNOWLEDGEMENTS
We acknowledge SUDHA M mam for the direction and help gave help
the execution of the undertaking. We additionally recognize all others
worried about accomplishment of this undertaking. It is standard to
recognize the University Management/School Dean for giving us a
chance to complete our examinations at the University. Thanks for such
an outstanding opportunity to us.
Problem Statement
Now a days there are so many attacks are carried out on various people with malicious intents
.Most of them are network attacks , so we attempt to develop an network abuse detection
(intrusion detection ) from the KDD-1999 data set and try to identity normal connection and
attacked connection
To detect network intrusions protects a computer network from unauthorized users, including
perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a
classifier) capable of distinguishing between "bad" connections, called intrusions or attacks, and
"good" normal connections.
A connection is a sequence of TCP packets starting and ending at some well defined times,
between which data flows to and from a source IP address to a target IP address under some well
defined protocol. Each connection is labelled as either normal, or as an attack, with exactly one
specific attack type. Each connection record consists of about 100 bytes.
Attacks fall into four main categories
 DOS: denial-of-service, e.g. syn flood;
 R2L: unauthorized access from a remote machine, e.g. guessing password;
 U2R: unauthorized access to local superuser (root) privileges, e.g., various "buffer
overflow" attacks;
 PROBING: surveillance and other probing, e.g., port scanning.

5
ABOUT DATASET
Our dataset contains these features
Table 1: Basic features of individual TCP connections
feature name description type
duration length (number of seconds) of the connection continuous
protocol_type type of the protocol, e.g. tcp, udp, etc. discrete
service network service on the destination, e.g., http, telnet, etc. discrete
src_bytes number of data bytes from source to destination continuous
dst_bytes number of data bytes from destination to source continuous
flag normal or error status of the connection discrete
land 1 if connection is from/to the same host/port; 0 otherwise discrete
wrong_fragment number of "wrong" fragments continuous
urgent number of urgent packets continuous
Table 2: Content features within a connection suggested by domain knowledge
feature name description type
hot number of "hot" indicators continuous
num_failed_logins number of failed login attempts continuous
logged_in 1 if successfully logged in; 0 otherwise discrete
num_compromised number of "compromised" conditions continuous
root_shell 1 if root shell is obtained; 0 otherwise discrete
su_attempted 1 if "su root" command attempted; 0 otherwise discrete
num_root number of "root" accesses continuous

6
num_file_creations number of file creation operations continuous
num_shells number of shell prompts continuous
num_access_files number of operations on access control files continuous
num_outbound_cmds number of outbound commands in an ftp session continuous
is_hot_login 1 if the login belongs to the "hot" list; 0 otherwise discrete
is_guest_login 1 if the login is a "guest"login; 0 otherwise discrete
Table 3: Traffic features computed using a two-second time window
feature name description> type
count number of connections to the same host as the current connection
in the past two seconds
continuous
Note: The following features refer to these same-host connections.
serror_rate % of connections that have "SYN" errors continuous
rerror_rate % of connections that have "REJ" errors continuous
same_srv_rate % of connections to the same service continuous
diff_srv_rate % of connections to different services continuous
srv_count number of connections to the same service as the current
connection in the past two seconds
continuous
Note: The following features refer to these same-service connections.
srv_serror_rate % of connections that have "SYN" errors continuous
srv_rerror_rate % of connections that have "REJ" errors continuous
srv_diff_host_rate % of connections to different hosts continuous

7
Approach
1)There we will do some exploratory data analysis using Pandas.
2) After that we will do Data pre-processing and remove unnecessary features (attributes) from
our dataset
3) Then we will use clustering and anomality detection. We want our model to be able to work
well with unknown attack types and also to give an approximation of the closest attack type. We
will use K-mean clustering.
4) Then we will build a classifier using Scikit-learn (machine learning library).
Our classifier will just classify entries into normal or attack. By doing so, we can
generalise the model to new attack types.

8
Modules
1) Data Pre-processing:
Initially, we will use all features. We need to do something with our categorical variables. But
not all the features are numerical so we will do feature selection to remove unwanted features to
reduce the dimensionality of our data.
2) KMeans clustering
We will perform anomaly detection approach in the reduced dataset. We will start by doing k-
means clustering. Once we have the cluster centres, we can use it to identify the clusters of
attack or normal in new dataset
3) Classification
In classification we will train our dataset and make a classifier and use that classifier to predict
other data file and then we will test our estimation with R2
test to predict the accuracy of our
classifier.
4) Predictions
Based on the assumption that new attack types will resemble old type, we will be able to detect
those. Moreover, anything that falls too far from any cluster, will be considered anomalous and
therefore a possible attack.

9
Feature
selection
and scaling
DKK-1999
Labelled
dataset
Proposed Implementation Framework
DKK-1999
Labelled raw
dataset
DKK-1999
Corrected
raw
dataset
Clustering
and anomaly
detection
Anomaly
detection
algorithm
DKK-1999
Corrected
dataset
Unlabell
ed
dataset
labelled
dataset
Predicti
on
results

10
Implementation
1) CLUSTERING
LOADING THE DATA
In [2] : import pandas
from time import time
col_names = ["duration","protocol_type","service","flag","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
kdd_data_10percent =
pandas.read_csv("D:studysem5dataminingprojectdatasetdatakddcup.data_10_percent_corrected",
header=None, names = col_names)
kdd_data_10percent.describe()

11
OUTPUT
VIEWING THE LABELS
In [3] : kdd_data_10percent['label'].value_counts()
OUTPUT

12
FEATURE SELECTION
In [4] :num_features = [
"duration","src_bytes",
"dst_host_rerror_rate","dst_host_srv_rerror_rate"
]
features = kdd_data_10percent[num_features].astype(float)
features.describe()
OUTPUT

13
CLUSTERING
from sklearn.cluster import KMeans
k = 30
km = KMeans(n_clusters = k)
t0 = time()
km.fit(features)
tt = time()-t0
print("Clustered in",round(tt,3)," seconds")
#visualising cluster sample
for i in range(600,620):
print (km.labels_[i])
ASSIGINING LABELS
labels = kdd_data_10percent['label']
label_names = list(map(
lambda x: pandas.Series([labels[i] for i in range(len(km.labels_)) if km.labels_[i]==x]),
range(k)))
for i in range(k):
print ("Cluster ",i," labels:")
print (label_names[i].value_counts(),"n")
print

14
LOADING TESTING DATA
kdd_data_corrected = pandas.read_csv("D:studysem5dataminingprojectdatasetdatacorrected",
ASSIGINING CLUSTERS
t0 = time()
pred = km.predict(kdd_data_corrected[num_features])
tt = time() - t0
print ("Assigned clusters in",round(tt,3)," seconds")

15
2) CLASSIFICATIONS
LOADING THE DATA
In [2] : import pandas
from time import time
col_names = ["duration","protocol_type","service","flag","src_bytes",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
kdd_data_10percent =
pandas.read_csv("D:studysem5dataminingprojectdatasetdatakddcup.data_10_percent_corrected",
kdd_data_10percent.describe()
OUTPUT

16
VIEWING THE LABELS
In [3] : kdd_data_10percent['label'].value_counts()
OUTPUT
FEATURE SELECTION

17
In [4] :num_features = [
"duration","src_bytes",
"dst_host_rerror_rate","dst_host_srv_rerror_rate"
]
features = kdd_data_10percent[num_features].astype(float)
features.describe()
OUTPUT

18
ADDING LABELS
from sklearn.neighbors import KNeighborsClassifier
labels = kdd_data_10percent['label'].copy()
labels[labels!='normal.'] = 'attack.'
labels.value_counts()
1) TRAINING CLASSIFIER WITH BALL TREE
#algo=bruteforce , ball-tree,kd-tree
clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'ball_tree', leaf_size=500)
t0 = time()
clf.fit(features,labels)
tt = time() - t0
print ("Classifier trained in",round(tt,3),"seconds")
LOADING TESTING DATA
kdd_data_corrected = pandas.read_csv("D:studysem5dataminingprojectdatasetdatacorrected",
kdd_data_corrected['label'].value_counts()

19
CONVERTING LABELS
kdd_data_corrected['label'][kdd_data_corrected['label']!='normal.'] = 'attack.'
kdd_data_corrected['label'].value_counts()
CREATING TEST SAMPLE
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(
kdd_data_corrected[num_features],
kdd_data_corrected['label'],
test_size=0.1,
random_state=42)
PRIDICTING
t0 = time()
pred = clf.predict(features_test)
tt = time() - t0
print ("Predicted in",round(tt,3)," seconds")

20
CHECKING ACCURACY
from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)
print("R squared is ",round(acc,4),"")

21
2) TRAINING CLASSIFIER WITH KD-TREE
clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'kd-tree', leaf_size=500)
t0 = time()
tt = time() - t0
ACCURACY

22
3) TRAINING CLASSIFIER WITH BRUTEFORCE
clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'bruteforce', leaf_size=500)
t0 = time()
tt = time() - t0
ACCURACY

23
CONCLUSION
We have formed clusters . those clusters can e used with real data to predict an
attack and a normal connection. Even anything falling far from cluster can also be
considered as an attack
From classification we obtained results tabulated in below table
ALGORITHM TIME FOR TRAINING ACCURACY
Ball-Tree Least 0.925 (near max)
KD-TREE Little higher than Ball-tree 0.820 (least)
BRUTEFORCE High 0.932 (maximum)
Form our experiment we concluded bruteforce is most expensive algorithm but
produced max accuracy on the other hand kd-tree obtained least result for our data
and ball-tree algorithm worked better as it consumed almost least time and almost
max accuracy

24
References
Dataset
[1] http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Software
[2] https://spark.apache.org/downloads.html
Pyspark tutorial
[3] https://www.dezyre.com/apache-spark-tutorial/pyspark-tutorial
[4] https://www.datacamp.com/community/tutorials/apache-spark-python
Research article
[2] Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009, July). A detailed analysis of
the KDD CUP 99 data set. In Computational Intelligence for Security and Defense Applications,
2009. CISDA 2009. IEEE Symposium on (pp. 1-6). IEEE.

Data mining final report

More Related Content

Data mining final report