SlideShare a Scribd company logo
Data science and visualization MODULE 3 FG&FS
99 | P a g e
Data Wrangling in R
1. Dplyr-fundamental data-munging R bundle. Tool for Supreme Data Framing.Particularly useful
for the operation of categories of data.
2. Purrr-good for listing and error-checking features.
3. Splitstackshape-an oldie, but a goldie. Great for the shaping of full
4. Splitstackshape-an oldie, but a goldie. Perfect for shaping complex data sets and making
visualization easier.
5. JSOnline-a simple and quick scanning device.
6. Magrittr-good for wrangling scattered sets and putting them in a more cohesive shape.
8. FEATURE GENERATION
8.1 INTRODUCTION
The 2004 Text Retrieval Conference (TREC) Genomics Track was divided into two main tasks:
categorization and ad hoc retrieval. The categorization task consisted of a document triage subtask
and an annotation subtask to detect the presence of evidence in the document for each of the three
main Gene Ontology (GO) code hierarchies. Our work focused on the document triage subtask. We
also participated in the ad hoc retrieval task.
8.2 BACKGROUND
The classification of documents is a common problem in biomedicine. Training a support vector
machine (SVM) on vectors generated from stemmed and/or stopped document word counts has
proven to be a simple and generally efficient method (Yeh et al., 2003).
However, we agreed that the triage issue posed here had some distinctive features that would entail a
modification of the standard approach. First, it was understood that the number of true positive
results in both the training and the test set was low, about 6-7%. Second, the utility function chosen
as the record metric was heavily weighted to reward recall and not precision.
This was based on an overview of the existing working procedures of the annotators at the Mouse
Genome Institute (MGI) and an estimate of how they actually view false negative and false positive
classifications. The official utility feature weights a false negative as 20 times more extreme than a
false positive. Using this metric, the existing work procedure of MGI, which reads all the documents
in the test set, has a value of 0.25. In fact, the training and evaluation samples were not randomly
drawn from the same survey, but rather obtained from documents released in two consecutive years.
While this is a more practical simulation of the framework as it would be applied to MGI, it poses
the question of how well the features derived from one year of literature reflect the literature of
subsequent years. As a result of these problems, our approach included a rich collection of features,
statistically dependent selection of features, multiple classifiers, and an analysis of how well the
features extracted from the 2002 corpus reflected the documents in the 2003 corpus.
100 | P a g e
8.3 SYSTEM AND METHODS
We have tackled the question of triage in four stages: the generation of features, the selection of
features, the collection and training of classifiers and, finally, the classification of test documents.
Only the training package was used to complete the first three measures. The final step in collecting
the test was taken to produce the results submitted. During the development of the program, we used
ten-fold cross-validation on the training set to compare approaches and set program parameters. It
included the execution of the first two phases of the entire training series. Then 90% of the training
data was used to train the classifiers, which were then applied to the remaining 10% of the training
data. This has been repeated nine times, so that all the training data has been classified once. The
findings were then aggregated to compute cross-validation metrics for the training corpus. Figure 17
displays this phase diagrammatically.
Figure 17: Step-wise approach to test classification
1. Feature generation
The full text corpus with SGML mark-up offered an opportunity to explore the use of several types
of features. While many text classification methods view text as a "bag-of-words," we have opted to
use the information contained in the SGML mark-up to generate unique section type features. Since
we merged features that could occur several times in a single document with features that could only
occur once, after some initial testing, we decided to view each feature as binary, that is, each feature
was either present in a document or absent. One type of function we created consisted of pairs of
section names and stemmed words using the Porter stemming algorithm. Upon applying a stop list of
the 300 most common English words, the individual parts of the text collected were coded to include
abstract sections, body paragraphs, captions and section titles. We have created a similar hybrid
section, stammedword features, using the stopped and stammed section title in conjunction with the
stopped and stammed words in the named section. In addition, we have downloaded the related
MEDLINE documents from PubMed. For each post, the corresponding MeSH headings have been
extracted. We included MeSH-based features based on the full MeSH headings, the MeSH main
headings and the MeSH subheadings. Finally, we included features based on details in the reference
section of each text. The key author of each reference was taken as a form of attribute. We also
Feature Generation
Classifier Selection & Training
Feature Selection
Document Classification
Test Corpus
Training Corpus
101 | P a g e
included a long form of references as a feature sort, including the primary author, journal name,
length, year, and page number. Running the feature generation process on a full set of 5837 training
documents created over 100,000 potentially useful features along with a count of the number of
documents containing each feature.
2. Feature selection
We have opted to use the Chi-square selection method to pick the features that best differentiated
between positive and negative documents in the training corpus. The 2x2 Chisquare table is
constructed as shown in Table 1, using the number of documents obtained in the previous stage.
During machine tuning, an alpha value of 0.025 was found to produce the best results. Using this
value as a cut-off, 1885 features were selected as the most important. The number and type of each
feature found significant and used in the following steps are shown in Table 2.
3. Classifier selection and training
Three specific classifiers were applied to the problem: Naive Bayes, SVM, and Voting Perceptron.
Although it is widely thought that the best classifiers are based on Vapnik's SVM method (Vapnik,
2000), the distinctive aspects of the current classification issue discussed above inspired us to apply
three different classifiers. By using the same feature set with each of the classifiers, this helped us to
compare the efficiency of the classifier algorithms with the particular requirements of the triage
function.
102 | P a g e
Neither Naive Bayes nor the implementation of the SVM we used, SVMLight (Joachims, 2004),
offered adequate means to change the low frequency of positive and the high value of true positive
relative to false positive. We used our own implementation for Naive Bayes. Naive Bayes sets a
classification probability threshold that can be used to switch between precision and recall.
Nonetheless, this is an indirect form of compensation, and in practice, for this task of classification,
we found that raising the likelihood threshold did not have a meaningful impact.
We fully expected SVMLight to perform better than Naive Bayes, as it contained a cost factor
parameter that could be changed to require unequal penalties for false positives and false negatives.
Nevertheless, we found that the amount of impact of this parameter was limited and insufficient to
account for the 20 difference between the cost of false positives and negatives. Since neither Naive
Bayes nor one of the most common SVM implementations addressed our requirements, something
else was required.
An analysis of the classification literature reveals considerable progress in adjusting the classical
Rosenblatt Perceptron algorithm (Rosenblatt, 1958) to achieve efficiency at or near SVM for several
problems. One algorithm in particular, the Voting Perceptron algorithm (Freund and Schapire, 1999),
has quite good efficiency, is very fast and easy to implement. Although the algorithm as published
does not provide a way to account for asymmetric false positive and negative penalties, we have
made a change to the algorithm that does. Perceptron is basically an equation for a linear
combination of the values of the set of features.
For every element in the feature set, there is one term in the perceptron plus an optional bias phrase.
The document is defined by taking the dot product of the document's feature vector with the
perceptron and adding it to the bias word. When the result is greater than zero, the document is
classified as positive, if it is less than or equal to zero, then the document is classified as negative.
The original algorithm of Rosenblatt trained the perceptron by applying it to each sample in the
training results.
103 | P a g e
If the sample was wrongly labeled, the perceptron was changed by adding or subtracting the sample
back into the perceptron, adding when the sample was a true positive, and subtracting when the
sample was a true negative. Over a large number of training samples, the perceptron converges on a
solution that better approximates the distinction between positive and negative documents in the
training package. Freund and Schapire improved the performance of the perceptron by modifying the
algorithm to produce a series of perceptrons, each of which makes a prediction about the class of
each document and receives a number of "votes" depending on how many documents the perceptron
has correctly classified in the training set.
The class with the most votes is the class allocated to the paper. Our extension to this algorithm is
based on a specific modification of the perceptron learning rate for false negatives and false
positives. Although incorrectly classified samples are directly added or subtracted back to the
perceptron in the typical implementation, we first multiply the sample by a factor known as the
learning rate. In addition, we use separate learning thresholds for false positives and false negatives.
Given the concept of the utility function, we predicted that the optimal learning rate for false
negatives will be around 20 times that for false positives.
In reality, that's what we noticed during the training. We used 20.0 for false negatives, and 1.0 for
false positives. The training corpus was applied to each of the three classifiers. Ten-fold cross-
validation has been used to optimize all free parameters. The Naive Bayes classifier had one free
parameter, the threshold for the probability classification. It was left to the default value of 0.50. The
selected SVM-Light classifier settings used a linear kernel and a cost factor of 20.0. The Voting
Perceptron classifier was used with a linear kernel and the learning rate was given above. For each of
the three approaches, a trained classifier model was developed.
4. Classification of test documents
Eventually, the test corpus was added to the models developed by the Naive Bayes, SVM and Voting
Perceptron classifiers. It's done in two steps. The documents in the study sample were first examined
for the presence or absence of significant features observed during the selection process. This has
generated a vector function for each test paper. The documents were then categorized by applying
each of the three qualified classifiers.
104 | P a g e
5. Evaluation of conceptualdrift
One critical problem in applying text classification systems to documents of interest to curators and
annotators is how well the available training data reflect the documents to be categorized. When
classifying the biomedical text, the available training manuals must have been written in advance of
the text to be categorized. However, because of its very existence, the field of science shifts over
time, as does the vocabulary used to explain it.
How easily written science literature changes has a direct effect on the creation of biomedical text
classification systems in terms of how the features are developed and chosen, how much the systems
need to be re-trained, how much training is required, and the overall performance that can be
expected from such systems can be affected. Throughout biomedical literature, we decided to begin
to understand this significant topic of conceptual drift.
In order to determine how well the features chosen from the training collection reflected the
information that was relevant in classifying the document in the test collection, we took additional
steps in producing the features and selecting the test collection. The exact same method and
parameters were used for the collection of tests as for the collection of testing. We then calculated
how well the training collection feature set reflected the test collection feature set by the
computational similarity metrics between the two sets (Dunham, 2003).
9. FEATURE SELECTION ALGORITHMS
Feature selection is also called selection of variables or selection of attributes.
It is the automated selection of attributes in your data (such as columns in tabular data) that are most
important to the issue of predictive modeling that you are working on.
"Feature Selection... is the method of selecting a subset of the applicable features for use in model
construction.”
The selection of features is distinct from the reduction of dimensionality. Both methods aim to
minimize the number of attributes in the dataset, but the dimensional reduction approach does so by
introducing new combinations of attributes, while the feature selection methods include and remove
attributes present in the data without modifying them.
Examples of dimensionality reduction methods include Principal Component Analysis, Singular
Value Decomposition and Sammon’sMapping.
“Feature selection is itself useful, but it mostly acts as a filter, muting out features that aren’t useful
in addition to your existing features”.
105 | P a g e
9.1 The Problem the Feature Selection Solves
Feature selection methods allow you to build an effective predictive model in your task. We support
you by selecting features that will give you as good or better accuracy while needing less data.
Feature selection approaches may be used to recognize and delete unwanted, obsolete and redundant
attributes from data that do not contribute to the accuracy of the predictive model or can potentially
reduce the accuracy of the model.
Fewer attributes are preferable because they reduce the complexity of the model, and a simpler
model is easier to understand and describe.
The goal of variable selection is threefold:
1. to enhance predictor efficiency,
2. to provide quicker and more cost-effective predictors,
3. and to provide a better understanding of the underlying process that generated the data.
9.2 Feature Selection Algorithms
There are three general classes of feature selection algorithms:
1. Filter methods,
2. Wrapper methods,
3. Embedded methods.
Filter Methods
Filter feature selection approaches use a statistical test to assign a score to each element. The features
are ranked by the score and either selected to be stored or deleted from the dataset. Methods are
mostly univariate and consider the function separately or in relation to the dependent variable.
Examples of some of the filter methods include the Chi squared test, information gain and correlation
coefficient ratings.
Wrapper Methods
Wrapper approaches consider the collection of a set of features as a search problem where various
combinations are planned, evaluated and compared to other combinations. A predictive algorithm
used to test a combination of features and give a score based on the accuracy of the formula.
106 | P a g e
The search process may be methodical, such as a best-first search, stochastic, such as a random hill-
climbing algorithm, or heuristics, such as forward and backward passes, may be used to add and
remove features.
An example if the wrapper approach is a recursive elimination algorithm.
Embedded Methods
Embedded methods learn which features better contribute to the accuracy of the model when the
model is being built. Regularization methods are the most common type of embedded feature
selection methods. Regularization methods are often called penalization methods that apply
additional constraints to the design of a predictive algorithm (such as a regression algorithm) that
moves the model towards lower complexity (lower coefficients). Examples of regularization
algorithms include LASSO, Elastic Net and Ridge Regression.
9.3 How to Choose a Feature Selection Method for Machine Learning
Feature selection is a method that reduces the number of input variables when designing a predictive
model. It is beneficial to reduce the number of input variables, both to reduce the computational cost
of modeling and, in some cases, to increase the efficiency of the model.
Feature-based feature selection approaches include analyzing the relationship between each input
variable and the target variable using statistics and choosing those input variables that have the best
relationship with the target variable. These methods can be fast and efficient, although the choice of
statistical measures depends on the data type of both input and output variables. As such, it may be
difficult for a machine learning practitioner to choose an appropriate statistical measure for a data set
when choosing filter-based apps.
1. Feature Selection Methods
Feature selection approaches are designed to reduce the number of input variables to those deemed
most useful for the model in order to predict the target variable.
Some predictive modeling problems have a large number of variables that can slow down the
creation and training of models and require a large amount of machine memory. In addition, the
output of certain models can be degraded by adding input variables that are not important to the
target variable.
107 | P a g e
There are two major types of feature selection algorithms: the wrapper method and the filter method.
1. Wrapper Feature Selection Methods.
2. Filter Feature Selection Methods.
1. Wrapper feature selection approaches generate several models with various input features
subsets and pick those features that result in the best output model according to the performance
metric. These methods are not concerned with variable types, although they can be
computationally costly. RFE is a good example of a method for selecting a wrapper function.
Wrapper methods test several models using procedures that add and/or extract predictors to find
the optimum combination that maximizes model efficiency.
2. Filter feature selection approaches use statistical techniques to test the relationship between
each input variable and the target variable, and these scores are used as the basis for selecting
(filtering) the input variables that will be used in the model. Filter methods test the importance of
predictors outside the predictive models, and then only model predictors that pass any criterion.
Correlation style statistical measurements between input and output variables are widely used as
the basis for filter function selection. As such, the choice of statistical measures is highly
dependent on variable data types. Popular data types include numerical (such as height) and
categorical (such as label), although each can be further subdivided as integer and floating point
for numerical variables, and boolean, ordinal, or nominal for categorical variables.

More Related Content

Similar to Data science and visualization MODULE 3 FG&FS

Opinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationOpinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classication
IJECEIAES
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
Damian R. Mingle, MBA
 
RBHF_SDM_2011_Jie
RBHF_SDM_2011_JieRBHF_SDM_2011_Jie
RBHF_SDM_2011_Jie
MDO_Lab
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature set
ijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
ijccmsjournal
 
Data analysis_PredictingActivity_SamsungSensorData
Data analysis_PredictingActivity_SamsungSensorDataData analysis_PredictingActivity_SamsungSensorData
Data analysis_PredictingActivity_SamsungSensorData
Karen Yang
 
Building Predictive Models R_caret language
Building Predictive Models R_caret languageBuilding Predictive Models R_caret language
Building Predictive Models R_caret language
javed khan
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
IJMER
 
Threshold benchmarking for feature ranking techniques
Threshold benchmarking for feature ranking techniquesThreshold benchmarking for feature ranking techniques
Threshold benchmarking for feature ranking techniques
journalBEEI
 
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningSentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
IRJET Journal
 
Research proposal
Research proposalResearch proposal
Research proposal
Sadia Sharmin
 
APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL
APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL
APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL
IJCSEA Journal
 
Applying genetic algorithms to information retrieval using vector space model
Applying genetic algorithms to information retrieval using vector space modelApplying genetic algorithms to information retrieval using vector space model
Applying genetic algorithms to information retrieval using vector space model
IJCSEA Journal
 
KnowledgeFromDataAtScaleProject
KnowledgeFromDataAtScaleProjectKnowledgeFromDataAtScaleProject
KnowledgeFromDataAtScaleProject
Marciano Moreno
 
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
Rudradityo Saha
 
A Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionA Modified KS-test for Feature Selection
A Modified KS-test for Feature Selection
IOSR Journals
 
Applying Genetic Algorithms to Information Retrieval Using Vector Space Model
Applying Genetic Algorithms to Information Retrieval Using Vector Space ModelApplying Genetic Algorithms to Information Retrieval Using Vector Space Model
Applying Genetic Algorithms to Information Retrieval Using Vector Space Model
IJCSEA Journal
 
Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...
Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...
Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...
iosrjce
 
G017664551
G017664551G017664551
G017664551
IOSR Journals
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
eSAT Publishing House
 

Similar to Data science and visualization MODULE 3 FG&FS (20)

Opinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationOpinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classication
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
 
RBHF_SDM_2011_Jie
RBHF_SDM_2011_JieRBHF_SDM_2011_Jie
RBHF_SDM_2011_Jie
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Data analysis_PredictingActivity_SamsungSensorData
Data analysis_PredictingActivity_SamsungSensorDataData analysis_PredictingActivity_SamsungSensorData
Data analysis_PredictingActivity_SamsungSensorData
 
Building Predictive Models R_caret language
Building Predictive Models R_caret languageBuilding Predictive Models R_caret language
Building Predictive Models R_caret language
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
 
Threshold benchmarking for feature ranking techniques
Threshold benchmarking for feature ranking techniquesThreshold benchmarking for feature ranking techniques
Threshold benchmarking for feature ranking techniques
 
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningSentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
 
Research proposal
Research proposalResearch proposal
Research proposal
 
APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL
APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL
APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL
 
Applying genetic algorithms to information retrieval using vector space model
Applying genetic algorithms to information retrieval using vector space modelApplying genetic algorithms to information retrieval using vector space model
Applying genetic algorithms to information retrieval using vector space model
 
KnowledgeFromDataAtScaleProject
KnowledgeFromDataAtScaleProjectKnowledgeFromDataAtScaleProject
KnowledgeFromDataAtScaleProject
 
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
 
A Modified KS-test for Feature Selection
A Modified KS-test for Feature SelectionA Modified KS-test for Feature Selection
A Modified KS-test for Feature Selection
 
Applying Genetic Algorithms to Information Retrieval Using Vector Space Model
Applying Genetic Algorithms to Information Retrieval Using Vector Space ModelApplying Genetic Algorithms to Information Retrieval Using Vector Space Model
Applying Genetic Algorithms to Information Retrieval Using Vector Space Model
 
Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...
Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...
Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document...
 
G017664551
G017664551G017664551
G017664551
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
 

Recently uploaded

SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
Jim Mimlitz, P.E.
 
Social media management system project report.pdf
Social media management system project report.pdfSocial media management system project report.pdf
Social media management system project report.pdf
Kamal Acharya
 
Software Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project ManagementSoftware Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project Management
Prakhyath Rai
 
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdfOCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
Muanisa Waras
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
hamedmustafa094
 
How to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POSHow to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POS
Celine George
 
Exploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative ReviewExploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative Review
sipij
 
PMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOCPMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOC
itssurajthakur06
 
Germany Offshore Wind 010724 RE (1) 2 test.pptx
Germany Offshore Wind 010724 RE (1) 2 test.pptxGermany Offshore Wind 010724 RE (1) 2 test.pptx
Germany Offshore Wind 010724 RE (1) 2 test.pptx
rebecca841358
 
L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx
naseki5964
 
Biology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtuBiology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtu
santoshpatilrao33
 
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
sanabts249
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
PradeepKumarSK3
 
Quadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and ControlQuadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and Control
Blesson Easo Varghese
 
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
IJAEMSJORNAL
 
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval
KishorMahale5
 
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
Anwar Patel
 
Understanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Understanding Cybersecurity Breaches: Causes, Consequences, and PreventionUnderstanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Understanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Bert Blevins
 
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen FramesUnblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Sinan KOZAK
 
Development of Chatbot Using AI/ML Technologies
Development of  Chatbot Using AI/ML TechnologiesDevelopment of  Chatbot Using AI/ML Technologies
Development of Chatbot Using AI/ML Technologies
maisnampibarel
 

Recently uploaded (20)

SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
 
Social media management system project report.pdf
Social media management system project report.pdfSocial media management system project report.pdf
Social media management system project report.pdf
 
Software Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project ManagementSoftware Engineering and Project Management - Introduction to Project Management
Software Engineering and Project Management - Introduction to Project Management
 
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdfOCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
 
How to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POSHow to Manage Internal Notes in Odoo 17 POS
How to Manage Internal Notes in Odoo 17 POS
 
Exploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative ReviewExploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative Review
 
PMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOCPMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOC
 
Germany Offshore Wind 010724 RE (1) 2 test.pptx
Germany Offshore Wind 010724 RE (1) 2 test.pptxGermany Offshore Wind 010724 RE (1) 2 test.pptx
Germany Offshore Wind 010724 RE (1) 2 test.pptx
 
L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx
 
Biology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtuBiology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtu
 
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
 
Quadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and ControlQuadcopter Dynamics, Stability and Control
Quadcopter Dynamics, Stability and Control
 
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
 
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval
 
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
 
Understanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Understanding Cybersecurity Breaches: Causes, Consequences, and PreventionUnderstanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Understanding Cybersecurity Breaches: Causes, Consequences, and Prevention
 
Unblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen FramesUnblocking The Main Thread - Solving ANRs and Frozen Frames
Unblocking The Main Thread - Solving ANRs and Frozen Frames
 
Development of Chatbot Using AI/ML Technologies
Development of  Chatbot Using AI/ML TechnologiesDevelopment of  Chatbot Using AI/ML Technologies
Development of Chatbot Using AI/ML Technologies
 

Data science and visualization MODULE 3 FG&FS

  • 2. 99 | P a g e Data Wrangling in R 1. Dplyr-fundamental data-munging R bundle. Tool for Supreme Data Framing.Particularly useful for the operation of categories of data. 2. Purrr-good for listing and error-checking features. 3. Splitstackshape-an oldie, but a goldie. Great for the shaping of full 4. Splitstackshape-an oldie, but a goldie. Perfect for shaping complex data sets and making visualization easier. 5. JSOnline-a simple and quick scanning device. 6. Magrittr-good for wrangling scattered sets and putting them in a more cohesive shape. 8. FEATURE GENERATION 8.1 INTRODUCTION The 2004 Text Retrieval Conference (TREC) Genomics Track was divided into two main tasks: categorization and ad hoc retrieval. The categorization task consisted of a document triage subtask and an annotation subtask to detect the presence of evidence in the document for each of the three main Gene Ontology (GO) code hierarchies. Our work focused on the document triage subtask. We also participated in the ad hoc retrieval task. 8.2 BACKGROUND The classification of documents is a common problem in biomedicine. Training a support vector machine (SVM) on vectors generated from stemmed and/or stopped document word counts has proven to be a simple and generally efficient method (Yeh et al., 2003). However, we agreed that the triage issue posed here had some distinctive features that would entail a modification of the standard approach. First, it was understood that the number of true positive results in both the training and the test set was low, about 6-7%. Second, the utility function chosen as the record metric was heavily weighted to reward recall and not precision. This was based on an overview of the existing working procedures of the annotators at the Mouse Genome Institute (MGI) and an estimate of how they actually view false negative and false positive classifications. The official utility feature weights a false negative as 20 times more extreme than a false positive. Using this metric, the existing work procedure of MGI, which reads all the documents in the test set, has a value of 0.25. In fact, the training and evaluation samples were not randomly drawn from the same survey, but rather obtained from documents released in two consecutive years. While this is a more practical simulation of the framework as it would be applied to MGI, it poses the question of how well the features derived from one year of literature reflect the literature of subsequent years. As a result of these problems, our approach included a rich collection of features, statistically dependent selection of features, multiple classifiers, and an analysis of how well the features extracted from the 2002 corpus reflected the documents in the 2003 corpus.
  • 3. 100 | P a g e 8.3 SYSTEM AND METHODS We have tackled the question of triage in four stages: the generation of features, the selection of features, the collection and training of classifiers and, finally, the classification of test documents. Only the training package was used to complete the first three measures. The final step in collecting the test was taken to produce the results submitted. During the development of the program, we used ten-fold cross-validation on the training set to compare approaches and set program parameters. It included the execution of the first two phases of the entire training series. Then 90% of the training data was used to train the classifiers, which were then applied to the remaining 10% of the training data. This has been repeated nine times, so that all the training data has been classified once. The findings were then aggregated to compute cross-validation metrics for the training corpus. Figure 17 displays this phase diagrammatically. Figure 17: Step-wise approach to test classification 1. Feature generation The full text corpus with SGML mark-up offered an opportunity to explore the use of several types of features. While many text classification methods view text as a "bag-of-words," we have opted to use the information contained in the SGML mark-up to generate unique section type features. Since we merged features that could occur several times in a single document with features that could only occur once, after some initial testing, we decided to view each feature as binary, that is, each feature was either present in a document or absent. One type of function we created consisted of pairs of section names and stemmed words using the Porter stemming algorithm. Upon applying a stop list of the 300 most common English words, the individual parts of the text collected were coded to include abstract sections, body paragraphs, captions and section titles. We have created a similar hybrid section, stammedword features, using the stopped and stammed section title in conjunction with the stopped and stammed words in the named section. In addition, we have downloaded the related MEDLINE documents from PubMed. For each post, the corresponding MeSH headings have been extracted. We included MeSH-based features based on the full MeSH headings, the MeSH main headings and the MeSH subheadings. Finally, we included features based on details in the reference section of each text. The key author of each reference was taken as a form of attribute. We also Feature Generation Classifier Selection & Training Feature Selection Document Classification Test Corpus Training Corpus
  • 4. 101 | P a g e included a long form of references as a feature sort, including the primary author, journal name, length, year, and page number. Running the feature generation process on a full set of 5837 training documents created over 100,000 potentially useful features along with a count of the number of documents containing each feature. 2. Feature selection We have opted to use the Chi-square selection method to pick the features that best differentiated between positive and negative documents in the training corpus. The 2x2 Chisquare table is constructed as shown in Table 1, using the number of documents obtained in the previous stage. During machine tuning, an alpha value of 0.025 was found to produce the best results. Using this value as a cut-off, 1885 features were selected as the most important. The number and type of each feature found significant and used in the following steps are shown in Table 2. 3. Classifier selection and training Three specific classifiers were applied to the problem: Naive Bayes, SVM, and Voting Perceptron. Although it is widely thought that the best classifiers are based on Vapnik's SVM method (Vapnik, 2000), the distinctive aspects of the current classification issue discussed above inspired us to apply three different classifiers. By using the same feature set with each of the classifiers, this helped us to compare the efficiency of the classifier algorithms with the particular requirements of the triage function.
  • 5. 102 | P a g e Neither Naive Bayes nor the implementation of the SVM we used, SVMLight (Joachims, 2004), offered adequate means to change the low frequency of positive and the high value of true positive relative to false positive. We used our own implementation for Naive Bayes. Naive Bayes sets a classification probability threshold that can be used to switch between precision and recall. Nonetheless, this is an indirect form of compensation, and in practice, for this task of classification, we found that raising the likelihood threshold did not have a meaningful impact. We fully expected SVMLight to perform better than Naive Bayes, as it contained a cost factor parameter that could be changed to require unequal penalties for false positives and false negatives. Nevertheless, we found that the amount of impact of this parameter was limited and insufficient to account for the 20 difference between the cost of false positives and negatives. Since neither Naive Bayes nor one of the most common SVM implementations addressed our requirements, something else was required. An analysis of the classification literature reveals considerable progress in adjusting the classical Rosenblatt Perceptron algorithm (Rosenblatt, 1958) to achieve efficiency at or near SVM for several problems. One algorithm in particular, the Voting Perceptron algorithm (Freund and Schapire, 1999), has quite good efficiency, is very fast and easy to implement. Although the algorithm as published does not provide a way to account for asymmetric false positive and negative penalties, we have made a change to the algorithm that does. Perceptron is basically an equation for a linear combination of the values of the set of features. For every element in the feature set, there is one term in the perceptron plus an optional bias phrase. The document is defined by taking the dot product of the document's feature vector with the perceptron and adding it to the bias word. When the result is greater than zero, the document is classified as positive, if it is less than or equal to zero, then the document is classified as negative. The original algorithm of Rosenblatt trained the perceptron by applying it to each sample in the training results.
  • 6. 103 | P a g e If the sample was wrongly labeled, the perceptron was changed by adding or subtracting the sample back into the perceptron, adding when the sample was a true positive, and subtracting when the sample was a true negative. Over a large number of training samples, the perceptron converges on a solution that better approximates the distinction between positive and negative documents in the training package. Freund and Schapire improved the performance of the perceptron by modifying the algorithm to produce a series of perceptrons, each of which makes a prediction about the class of each document and receives a number of "votes" depending on how many documents the perceptron has correctly classified in the training set. The class with the most votes is the class allocated to the paper. Our extension to this algorithm is based on a specific modification of the perceptron learning rate for false negatives and false positives. Although incorrectly classified samples are directly added or subtracted back to the perceptron in the typical implementation, we first multiply the sample by a factor known as the learning rate. In addition, we use separate learning thresholds for false positives and false negatives. Given the concept of the utility function, we predicted that the optimal learning rate for false negatives will be around 20 times that for false positives. In reality, that's what we noticed during the training. We used 20.0 for false negatives, and 1.0 for false positives. The training corpus was applied to each of the three classifiers. Ten-fold cross- validation has been used to optimize all free parameters. The Naive Bayes classifier had one free parameter, the threshold for the probability classification. It was left to the default value of 0.50. The selected SVM-Light classifier settings used a linear kernel and a cost factor of 20.0. The Voting Perceptron classifier was used with a linear kernel and the learning rate was given above. For each of the three approaches, a trained classifier model was developed. 4. Classification of test documents Eventually, the test corpus was added to the models developed by the Naive Bayes, SVM and Voting Perceptron classifiers. It's done in two steps. The documents in the study sample were first examined for the presence or absence of significant features observed during the selection process. This has generated a vector function for each test paper. The documents were then categorized by applying each of the three qualified classifiers.
  • 7. 104 | P a g e 5. Evaluation of conceptualdrift One critical problem in applying text classification systems to documents of interest to curators and annotators is how well the available training data reflect the documents to be categorized. When classifying the biomedical text, the available training manuals must have been written in advance of the text to be categorized. However, because of its very existence, the field of science shifts over time, as does the vocabulary used to explain it. How easily written science literature changes has a direct effect on the creation of biomedical text classification systems in terms of how the features are developed and chosen, how much the systems need to be re-trained, how much training is required, and the overall performance that can be expected from such systems can be affected. Throughout biomedical literature, we decided to begin to understand this significant topic of conceptual drift. In order to determine how well the features chosen from the training collection reflected the information that was relevant in classifying the document in the test collection, we took additional steps in producing the features and selecting the test collection. The exact same method and parameters were used for the collection of tests as for the collection of testing. We then calculated how well the training collection feature set reflected the test collection feature set by the computational similarity metrics between the two sets (Dunham, 2003). 9. FEATURE SELECTION ALGORITHMS Feature selection is also called selection of variables or selection of attributes. It is the automated selection of attributes in your data (such as columns in tabular data) that are most important to the issue of predictive modeling that you are working on. "Feature Selection... is the method of selecting a subset of the applicable features for use in model construction.” The selection of features is distinct from the reduction of dimensionality. Both methods aim to minimize the number of attributes in the dataset, but the dimensional reduction approach does so by introducing new combinations of attributes, while the feature selection methods include and remove attributes present in the data without modifying them. Examples of dimensionality reduction methods include Principal Component Analysis, Singular Value Decomposition and Sammon’sMapping. “Feature selection is itself useful, but it mostly acts as a filter, muting out features that aren’t useful in addition to your existing features”.
  • 8. 105 | P a g e 9.1 The Problem the Feature Selection Solves Feature selection methods allow you to build an effective predictive model in your task. We support you by selecting features that will give you as good or better accuracy while needing less data. Feature selection approaches may be used to recognize and delete unwanted, obsolete and redundant attributes from data that do not contribute to the accuracy of the predictive model or can potentially reduce the accuracy of the model. Fewer attributes are preferable because they reduce the complexity of the model, and a simpler model is easier to understand and describe. The goal of variable selection is threefold: 1. to enhance predictor efficiency, 2. to provide quicker and more cost-effective predictors, 3. and to provide a better understanding of the underlying process that generated the data. 9.2 Feature Selection Algorithms There are three general classes of feature selection algorithms: 1. Filter methods, 2. Wrapper methods, 3. Embedded methods. Filter Methods Filter feature selection approaches use a statistical test to assign a score to each element. The features are ranked by the score and either selected to be stored or deleted from the dataset. Methods are mostly univariate and consider the function separately or in relation to the dependent variable. Examples of some of the filter methods include the Chi squared test, information gain and correlation coefficient ratings. Wrapper Methods Wrapper approaches consider the collection of a set of features as a search problem where various combinations are planned, evaluated and compared to other combinations. A predictive algorithm used to test a combination of features and give a score based on the accuracy of the formula.
  • 9. 106 | P a g e The search process may be methodical, such as a best-first search, stochastic, such as a random hill- climbing algorithm, or heuristics, such as forward and backward passes, may be used to add and remove features. An example if the wrapper approach is a recursive elimination algorithm. Embedded Methods Embedded methods learn which features better contribute to the accuracy of the model when the model is being built. Regularization methods are the most common type of embedded feature selection methods. Regularization methods are often called penalization methods that apply additional constraints to the design of a predictive algorithm (such as a regression algorithm) that moves the model towards lower complexity (lower coefficients). Examples of regularization algorithms include LASSO, Elastic Net and Ridge Regression. 9.3 How to Choose a Feature Selection Method for Machine Learning Feature selection is a method that reduces the number of input variables when designing a predictive model. It is beneficial to reduce the number of input variables, both to reduce the computational cost of modeling and, in some cases, to increase the efficiency of the model. Feature-based feature selection approaches include analyzing the relationship between each input variable and the target variable using statistics and choosing those input variables that have the best relationship with the target variable. These methods can be fast and efficient, although the choice of statistical measures depends on the data type of both input and output variables. As such, it may be difficult for a machine learning practitioner to choose an appropriate statistical measure for a data set when choosing filter-based apps. 1. Feature Selection Methods Feature selection approaches are designed to reduce the number of input variables to those deemed most useful for the model in order to predict the target variable. Some predictive modeling problems have a large number of variables that can slow down the creation and training of models and require a large amount of machine memory. In addition, the output of certain models can be degraded by adding input variables that are not important to the target variable.
  • 10. 107 | P a g e There are two major types of feature selection algorithms: the wrapper method and the filter method. 1. Wrapper Feature Selection Methods. 2. Filter Feature Selection Methods. 1. Wrapper feature selection approaches generate several models with various input features subsets and pick those features that result in the best output model according to the performance metric. These methods are not concerned with variable types, although they can be computationally costly. RFE is a good example of a method for selecting a wrapper function. Wrapper methods test several models using procedures that add and/or extract predictors to find the optimum combination that maximizes model efficiency. 2. Filter feature selection approaches use statistical techniques to test the relationship between each input variable and the target variable, and these scores are used as the basis for selecting (filtering) the input variables that will be used in the model. Filter methods test the importance of predictors outside the predictive models, and then only model predictors that pass any criterion. Correlation style statistical measurements between input and output variables are widely used as the basis for filter function selection. As such, the choice of statistical measures is highly dependent on variable data types. Popular data types include numerical (such as height) and categorical (such as label), although each can be further subdivided as integer and floating point for numerical variables, and boolean, ordinal, or nominal for categorical variables.