SlideShare a Scribd company logo
UNIT_5_Data Wrangling.pptx
• Wrangling Data:
• If you’ve gone through the previous chapters, by this point you’ve dealt with all the basic data loading and
manipulation methods offered by Python. Now it’s time to start using some more complex instruments for data
wrangling (or munging) and for machine learning.
• The final step of most data science projects is to build a data tool able to automatically summarize, predict,
and recommend directly from your data.
• Before taking that final step, you still have to process your data by enforcing transformations that are even
more radical.
• That’s the data wrangling or data munging part, where sophisticated transformations are followed by visual
and statistical explorations, and then again by further transformations.
• In the following sections, you learn how to handle huge streams of text, explore the basic characteristics of a
dataset, optimize the speed of your experiments, compress data and create new synthetic features, generate
new groups and classifications, and detect unexpected or exceptional cases that may cause your project to go
wrong.
2
• Sometimes the best way to discover how to use something is to spend time playing with it. The more
complex a tool, the more important play becomes.
• Given the complex math tasks you perform using Scikit-learn, playing becomes especially important.
• The following sections use the idea of playing with Scikit-learn to help you discover important
concepts in using Scikit-learn to perform amazing feats of data science work.
3
• Understanding classes in Scikit-learn
• Understanding how classes work is an important prerequisite for being able to use the Scikit-learn package
appropriately.
• Scikit-learn is the package for machine learning and data science experimentation favored by most data scientists.
• It contains a wide range of well-established learning algorithms, error functions, and testing procedures.
• At its core, Scikit-learn features some base classes on which all the algorithms are built. Apart from
BaseEstimator, the class from which all other classes inherit, there are four class types covering all the basic
machine-learning functionalities:
• Classifying
• Regressing
• Grouping by clusters
• Transforming data
4

Recommended for you

Python ml
Python mlPython ml
Python ml

I am shubham sharma graduated from Acropolis Institute of technology in Computer Science and Engineering. I have spent around 2 years in field of Machine learning. I am currently working as Data Scientist in Reliance industries private limited Mumbai. Mainly focused on problems related to data handing, data analysis, modeling, forecasting, statistics and machine learning, Deep learning, Computer Vision, Natural language processing etc. Area of interests are Data Analytics, Machine Learning, Machine learning, Time Series Forecasting, web information retrieval, algorithms, Data structures, design patterns, OOAD.

machine learningdata sciencedata analytics and big data spark
final
finalfinal
final

This document proposes an approach for automatic programming using deep learning. It describes a hybrid method using generative recurrent neural networks trained on source code to generate predictions, which are then used to build abstract syntax trees (ASTs) representing potential code structures. The ASTs are combined and mutated using techniques from genetic programming and random forests. Experimental results found the method was able to generate functions like computing the square root using an iterative method, demonstrating it can generalize logical algorithms from short descriptions. The document outlines the scope of the problem and approach, and describes using a GitHub scraper to collect a dataset of relevant Python source code files to train and evaluate the models.

Intro.ppt
Intro.pptIntro.ppt
Intro.ppt

This document provides an introduction to the CSE 326: Data Structures course. It discusses the following key points in 3 sentences or less: The course will cover common data structures and algorithms, how to choose the appropriate data structure for different needs, and how to justify design decisions through formal reasoning. It aims to help students become better developers by understanding fundamental data structures and when to apply them. The document provides examples of stacks and queues to illustrate abstract data types, data structures, and their implementations in different programming languages.

• Understanding classes in Scikit-learn
• Even though each base class has specific methods and attributes, the core functionalities for data processing and
machine learning are guaranteed by one or more series of methods and attributes called interfaces.
• The interfaces provide a uniform Application Programming Interface (API) to enforce similarity of methods and
attributes between all the different algorithms present in the package. There are four Scikit-learn object-based
interfaces:
1. estimator: For fitting parameters, learning them from data, according to the algorithm
2. predictor: For generating predictions from the fitted parameters
3. transformer: For transforming data, implementing the fitted parameters
4. model: For reporting goodness of fit or other score measures
5
Defining applications for data science
• Figuring out ways to use data science to obtain constructive results is important. For example, you can apply the
estimator interface to a
1. Classification problem: Guessing that a new observation is from a certain group
2. Regression problem: Guessing the value of a new observation
• It works with the method fit(X, y) where X is the bidimensional array of predictors (the set of observations to learn)
and y is the target outcome (another array, unidimensional).
• By applying fit, the information in X is related to y, so that, knowing some new information with the same
characteristics of X, it’s possible to guess y correctly.
• In the process, some parameters are estimated internally by the fit method. Using fit makes it possible to
distinguish between parameters, which are learned, and hyperparameters, which instead are fixed by you when
you instantiate the learner.
6
Defining applications for data science
• Instantiation involves assigning a Scikit-learn class to a Python variable.
• In addition to hyperparameters, you can also fix other working parameters, such as requiring normalization or
setting a random seed to reproduce the same results for each call, given the same input data.
7
Defining applications for data science
8
Here is an example with linear regression, a very basic and common machine learning algorithm. You upload some
data to use this example from the examples that Scikit-learn provides.
The Boston dataset, for instance, contains predictor variables that the example code can match against house
prices, which helps build a predictor that can calculate the value of a house given its characteristics.

Recommended for you

Intro.ppt
Intro.pptIntro.ppt
Intro.ppt

This document provides an overview of a Data Structures course. The course will cover basic data structures and algorithms used in software development. Students will learn about common data structures like lists, stacks, and queues; analyze the runtime of algorithms; and practice implementing data structures. The goal is for students to understand which data structures are appropriate for different problems and be able to justify design decisions. Key concepts covered include abstract data types, asymptotic analysis to evaluate algorithms, and the tradeoffs involved in choosing different data structure implementations.

data structure
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt

This document provides an introduction to the CSE 326: Data Structures course. It discusses the following key points in 3 sentences or less: The course will cover common data structures and algorithms, how to choose the appropriate data structure for different needs, and how to justify design decisions through formal reasoning. It aims to help students become better developers by understanding fundamental data structures and when to apply them. The document provides examples of stacks and queues to illustrate abstract data types, data structures, and their implementations in different programming languages.

Session 2
Session 2Session 2
Session 2

This document provides an agenda for a training session on AI and data science. The session is divided into two units: data science and data visualization. Key Python libraries that will be covered for data science include NumPy, Pandas, and Matplotlib. NumPy will be used to create and manipulate multi-dimensional arrays. Pandas allows users to work with labeled and relational data. Matplotlib enables data visualization through graphs and plots. The session aims to provide knowledge of core data science libraries and demonstrate data exploration techniques using these packages.

Defining applications for data science
9
• The output specifies that both arrays have the same number of rows and that X has 13 features.
• The shape method performs array analysis and reports the arrays’ dimensions.
Defining applications for data science
• After importing the LinearRegression class, you can instantiate a variable called hypothesis and set a parameter
indicating the algorithm to standardize (that is, to set mean zero and unit standard deviation for all the variables, a
statistical operation for having all the variables at a similar level) before estimating the parameters to learn.
10
Defining applications for data science
• After fitting, hypothesis holds the learned parameters, and you can visualize them using the coef_ method, which
is typical of all the linear models (where the model output is a summation of variables weighted by coefficients).
You can also call this fitting activity training (as in, “training a machine learning algorithm”).
• hypothesis is a way to describe a learning algorithm trained with data. The hypothesis defines a possible
representation of y given X that you test for validity. Therefore, it’s a hypothesis in both scientific and machine
learning language.
11
Defining applications for data science
• Apart from the estimator class, the predictor and the model object classes are also important.
• The predictor class, which predicts the probability of a certain result, obtains the result of new observations using
the predict and predict_proba methods, as in this script:
12
Make sure that new observations have the same feature number and order as in the training x; otherwise, the
prediction will be incorrect.

Recommended for you

Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx

1) The document discusses a self-study approach to learning data science through project-based learning using various online resources. 2) It recommends breaking down projects into 5 steps: defining problems/solutions, data extraction/preprocessing, exploration/engineering, model implementation, and evaluation. 3) Each step requires different skillsets from domains like statistics, programming, SQL, visualization, mathematics, and business knowledge.

H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell

This document provides an introduction to data science. It defines data science as using data to solve problems through the scientific method. The roles of data scientists, data analysts, and data engineers on a data science team are discussed. Popular tools for data science include Python, R, and APIs that connect data processing engines. Machine learning algorithms are used to perform tasks like classification, regression, and clustering by learning from data rather than being explicitly programmed. Deep learning and ensemble methods are also introduced. Resources for learning more about data science and machine learning are provided.

erin ledellh2o worlddata science
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn

Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets. The purpose of this one day course is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms for a variety of machine learning tasks and learn how to implement these tasks with our data using Scikit-Learn and Python. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology.

machine learningpythonscikit-learn
Defining applications for data science
• The class model provides information about the quality of the fit using the score method, as shown here:
• In this case, score returns the coefficient of determination R2 of the prediction. R2 is a measure ranging from 0 to
1, comparing our predictor to a simple mean. Higher values show that the predictor is working well.
• Different learning algorithms may use different scoring functions. Please consult the online documentation of each
algorithm or ask for help on the Python console:
13
Defining applications for data science
• The transform class applies transformations derived from the fitting phase to other data arrays.
• LinearRegression doesn’t have a transform method, but most preprocessing algorithms do.
• For example, MinMaxScaler, from the Scikit-learn preprocessing module, can transform values in a specific range
of minimum and maximum values, learning the transformation formula from an example array.
14
• Scikit-learn provides you with most of the data structures and functionality you need to complete your data science
project.
• You can even find classes for the trickiest and most advanced problems.
• For instance, when dealing with text, one of the most useful solutions provided by the Scikit-learn package is the
hashing trick.
• You discover how to work with text by using the bag of words model (as shown in the “Using the Bag of Words
Model and Beyond”) and weighting them with the Term Frequency times Inverse Document Frequency (TF-IDF)
transformation.
• All these powerful transformations can operate properly only if all your text is known and available in the memory
of your computer.
15
• A more serious data science challenge is to analyze online-generated text flows, such as from social networks or
large, online text repositories.
• This scenario poses quite a challenge when trying to turn the text into a data matrix suitable for analysis. When
working through such problems, knowing the hashing trick can give you quite a few advantages by helping you
 Handle large data matrices based on text on the fly
 Fix unexpected values or variables in your textual data
 Build scalable algorithms for large collections of documents
16

Recommended for you

Unit 1 dsa
Unit 1 dsaUnit 1 dsa
Unit 1 dsa

The document discusses algorithms and data structures. It defines an algorithm as a step-by-step procedure for solving a problem using a computer in a finite number of steps. It categorizes common types of algorithms as search, sort, insert, update, and delete algorithms. The document also defines a data structure as a way to store and organize data for efficient use. It distinguishes between linear and non-linear as well as static and dynamic data structures. Finally, it discusses algorithm design strategies like divide and conquer, merge sort, and dynamic programming.

data structure and algorithmdata structuredsa
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps

Simple steps to get started with machine learning. The use case uses python programming. Target audience is expected to have a very basic python knowledge.

#machinelearning#python#ai
RAJAT PROJECT.pptx
RAJAT PROJECT.pptxRAJAT PROJECT.pptx
RAJAT PROJECT.pptx

Rajat Kumar submitted a project report for partial fulfillment of his B.Tech degree in computer science from Lovely Professional University. The project involved developing a snake game where the player moves a snake to eat fruit without touching itself or the border, which ends the game. The report provided background on data structures, algorithms, common data structures like stacks and queues, and algorithm design techniques like recursion, dynamic programming, and backtracking. It concluded that data structures are important tools that enable efficient information storage, data management, and algorithm design.

project
• Hash functions can transform any input into an output whose characteristics are predictable.
• Usually they return a value where the output is bound at a specific interval — whose extremities range from
negative to positive numbers or just span through positive numbers.
• You can imagine them as enforcing a standard on your data — no matter what values you provide, they always
return a specific data product.
• Their most useful hash function characteristic is that, given a certain input, they always provide the same numeric
output value. Consequently, they’re called deterministic functions.
• For example, input a word like dog and the hashing function always returns the same number.
• In a certain sense, hash functions are like a secret code, transforming everything into numbers. Unlike secret
codes, however, you can’t convert the hashed code to its original value.
• In addition, in some rare cases, different words generate the same hashed result (also called a hash collision).
17
• There are many hash functions, with MD5 (often used to check file integrity, because you can hash entire files)
and SHA (used in cryptography) being the most popular.
• Python possesses a built-in hash function named hash that you can use to compare data objects before storing
them in dictionaries.
• For instance, you can test how Python hashes its name:
18
The command returns a large integer number:
• A Scikit-learn hash function can also return an index in a specific positive range.
• You can obtain something similar using a built-in hash by employing standard division and its remainder:
19
• When you ask for the remainder of the absolute number of the result from the hash function, you get a number
that never exceeds the value you used for the division.
• To see how this technique works, pretend that you want to transform a text string from the Internet into a numeric
vector (a feature vector) so that you can use it for starting a machine-learning project. A good strategy for
managing this data science task is to employ one-hot encoding, which produces a bag of words.
• Here are the steps for one-hot encoding a string (“Python for data science”) into a vector.
1. Assign an arbitrary number to each word, for instance, Python=0 for=1 data=2 science=3.
2. Initialize the vector, counting the number of unique words that you assigned a code in Step 1.
3. Use the codes assigned in Step 1 as indexes for populating the vector with values, assigning a 1
• where there is a coincidence with a word existing in the phrase.
20

Recommended for you

Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management

In this talk we'll look at simple building-block techniques for predicting metrics over time based on past data, taking into account trend, seasonality and noise, using Python with Tensorflow.

 
by EDB
postgrespostgresqlpostgres build
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...

The document summarizes an agenda for a presentation on machine learning and data science. It includes an introduction to CRISP-DM (Cross Industry Standard for Data Mining), guided analytics, and a KNIME demo. It also discusses the differences between machine learning, artificial intelligence, and data science. Machine learning produces predictions, artificial intelligence produces actions, and data science produces insights. It provides an overview of the CRISP-DM process for data mining projects including the business understanding, data understanding, data preparation, modeling, evaluation, and deployment phases. It also discusses guided analytics and interactive systems to assist business analysts in finding insights and predicting outcomes from data.

makine öğrenmesiyapay zekaveri bilimi
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx

ChatGPT Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves applying various techniques and methods to extract insights from data sets, often with the goal of uncovering patterns, trends, relationships, or making predictions. Here's an overview of the key steps and techniques involved in data analysis: Data Collection: The first step in data analysis is gathering relevant data from various sources. This can include structured data from databases, spreadsheets, or surveys, as well as unstructured data such as text documents, social media posts, or sensor readings. Data Cleaning and Preprocessing: Once the data is collected, it often needs to be cleaned and preprocessed to ensure its quality and suitability for analysis. This involves handling missing values, removing duplicates, addressing inconsistencies, and transforming data into a suitable format for analysis. Exploratory Data Analysis (EDA): EDA involves examining and understanding the data through summary statistics, visualizations, and statistical techniques. It helps identify patterns, distributions, outliers, and potential relationships between variables. EDA also helps in formulating hypotheses and guiding further analysis. Data Modeling and Statistical Analysis: In this step, various statistical techniques and models are applied to the data to gain deeper insights. This can include descriptive statistics, inferential statistics, hypothesis testing, regression analysis, time series analysis, clustering, classification, and more. The choice of techniques depends on the nature of the data and the research questions being addressed. Data Visualization: Data visualization plays a crucial role in data analysis. It involves creating meaningful and visually appealing representations of data through charts, graphs, plots, and interactive dashboards. Visualizations help in communicating insights effectively and spotting trends or patterns that may be difficult to identify in raw data. Interpretation and Conclusion: Once the analysis is performed, the findings need to be interpreted in the context of the problem or research objectives. Conclusions are drawn based on the results, and recommendations or insights are provided to stakeholders or decision-makers. Reporting and Communication: The final step is to present the results and findings of the data analysis in a clear and concise manner. This can be in the form of reports, presentations, or interactive visualizations. Effective communication of the analysis results is crucial for stakeholders to understand and make informed decisions based on the insights gained. Data analysis is widely used in various fields, including business, finance, marketing, healthcare, social sciences, and more. It plays a crucial role in extracting value from data, supporting evidence-based decision-making, and driving actionable insig

analysis
• The resulting feature vector is expressed as the sequence [1,1,1,1] and made of exactly four elements.
• You have started the machine-learning process, telling the program to expect sequences of four text features,
when suddenly a new phrase arrives and you must convert the following text into a numeric vector as well:
“Python for machine learning”.
• Now you have two new words — “machine learning” — to work with. The following steps help you create the new
vectors:
1. Assign these new codes: machine=4 learning=5. This is called encoding.
2. Enlarge the previous vector to include the new words: [1,1,1,1,0,0].
3. Compute the vector for the new string: [1,1,0,0,1,1].
• One-hot encoding is quite optimal because it creates efficient and ordered feature vectors.
21
• The command returns a dictionary containing the words and their encodings:
22
• Unfortunately, one-hot encoding fails and becomes difficult to handle when your project experiences a lot of
variability with regard to its inputs.
• This is a common situation in data science projects working with text or other symbolic features where flow from
the Internet or other online environments can suddenly create or add to your initial data.
• Using hash functions is a smarter way to handle unpredictability in your inputs:
1. Define a range for the hash function outputs. All your feature vectors will use that range. The example uses a
range of values from 0 to 24.
2. Compute an index for each word in your string using the hash function.
3. Assign a unit value to vector’s positions according to word indexes.
23
24

Recommended for you

Machine Learning
Machine LearningMachine Learning
Machine Learning

This document provides an introduction to machine learning concepts and tools. It begins with an overview of what will be covered in the course, including machine learning types, algorithms, applications, and mathematics. It then discusses data science concepts like feature engineering and the typical steps in a machine learning project, including collecting and examining data, fitting models, evaluating performance, and deploying models. Finally, it reviews common machine learning tools and terminologies and where to find datasets.

machine learning
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance

This document provides an overview of a course on data structures and algorithms. The course covers fundamental data structures like arrays, stacks, queues, lists, trees, hashing, and graphs. It emphasizes good programming practices like modularity, documentation and readability. Key concepts covered include data types, abstract data types, algorithms, selecting appropriate data structures based on efficiency requirements, and the goals of learning commonly used structures and analyzing structure costs and benefits.

Biology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtuBiology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtu

BBOC407

• As before, your results may not precisely match those in the book because hashes may not match across
machines.
• The code now prints the second string encoded:
25
• When viewing the feature vectors, you should notice that:
• You don’t know where each word is located. When it’s important to be able to reverse the process of assigning
words to indexes, you must store the relationship between words and their hashed value separately (for example,
you can use a dictionary where the keys are the hashed values and the values are the words).
• For small values of the vector_size function parameter (for example, vector_size=10), many words overlap in the
same positions in the list representing the feature vector. To keep the overlap to a minimum, you must create
hash function boundaries that are greater than the number of elements you plan to index later.
• The feature vectors in this example are made mostly of zero entries, representing a waste of memory when
compared to the more memory-efficient one-hot-encoding.
• One of the ways in which you can solve this problem is to rely on sparse matrices, as described in the next
section. 26
• Sparse matrices are the answer when dealing with data that has few values, that is, when most of the matrix
values are zeroes.
• Sparse matrices store just the coordinates of the cells and their values, instead of storing the information for all
the cells in the matrix.
• When an application requests data from an empty cell, the sparse matrix will return a zero value after looking for
the coordinates and not finding them.
• Here’s an example vector:
• [1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0]
27
• The following code turns it into a sparse matrix.
28
Notice that the data representation is in coordinates (expressed in a tuple of row and column index) and
the cell value.

Recommended for you

21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY

VLSI design 21ec63 MOS TRANSISTOR THEORY

Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...

This study aimed to profile the coffee shops in Talavera, Nueva Ecija, to develop a standardized checklist for aspiring entrepreneurs. The researchers surveyed 10 coffee shop owners in the municipality of Talavera. Through surveys, the researchers delved into the Owner's Demographic, Business details, Financial Requirements, and other requirements needed to consider starting up a coffee shop. Furthermore, through accurate analysis, the data obtained from the coffee shop owners are arranged to derive key insights. By analyzing this data, the study identifies best practices associated with start-up coffee shops’ profitability in Talavera. These findings were translated into a standardized checklist outlining essential procedures including the lists of equipment needed, financial requirements, and the Traditional and Social Media Marketing techniques. This standardized checklist served as a valuable tool for aspiring and existing coffee shop owners in Talavera, streamlining operations, ensuring consistency, and contributing to business success.

coffee shopcheckliststartup
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx

Very Important design

• As a data scientist, you don’t have to worry about programming your own version of the hashing trick unless you
would like some special implementation of the idea.
• Scikit-learn offers HashingVectorizer, a class that rapidly transforms any collection of text into a sparse data
matrix using the hashing trick.
• Here’s an example script that replicates the previous example:
29
Python reports the size of the sparse matrix and a count of the stored elements present in it:
<2x20
• As soon as new text arrives, CountVectorizer transforms the text based on the previous encoding schema where
the new words weren’t present; hence, the result is simply an empty vector of zeros.
• You can check this by transforming the sparse matrix into a normal, dense one using todense:
30
• Contrast the output from CountVectorizer with HashingVectorizer, which always provides a place for new words in
the data matrix:
31
At worst, a word settles in an already occupied position, causing two different words to be treated
as the same one by the algorithm (which won’t noticeably degrade the algorithm’s performances).
HashingVectorizer is the perfect function to use when your data can’t fit into memory and its
features aren’t fixed. In the other cases, consider using the more intuitive CountVectorizer.
• when testing your application code for performance (speed) characteristics, you can obtain analogous
information about memory usage.
• Keeping track of memory consumption could tell you about possible problems in the way data is processed or
transmitted to the learning algorithms.
• The memory_profiler package implements the required functionality. This package is not provided as a default
Python package and it requires installation.
• Use the following command to install the package directly from a cell of your Jupyter notebook,
32

Recommended for you

kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker

Kiln

Lecture 3 Biomass energy...............ppt
Lecture 3 Biomass energy...............pptLecture 3 Biomass energy...............ppt
Lecture 3 Biomass energy...............ppt

Biomass energy

rujan timsina
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...

Pre-trained Large Language Models (LLM) have achieved remarkable successes in several domains. However, code-oriented LLMs are often heavy in computational complexity, and quadratically with the length of the input code sequence. Toward simplifying the input program of an LLM, the state-of-the-art approach has the strategies to filter the input code tokens based on the attention scores given by the LLM. The decision to simplify the input program should not rely on the attention patterns of an LLM, as these patterns are influenced by both the model architecture and the pre-training dataset. Since the model and dataset are part of the solution domain, not the problem domain where the input program belongs, the outcome may differ when the model is trained on a different dataset. We propose SlimCode, a model-agnostic code simplification solution for LLMs that depends on the nature of input code tokens. As an empirical study on the LLMs including CodeBERT, CodeT5, and GPT-4 for two main tasks: code search and summarization. We reported that 1) the reduction ratio of code has a linear-like relation with the saving ratio on training time, 2) the impact of categorized tokens on code simplification can vary significantly, 3) the impact of categorized tokens on code simplification is task-specific but model-agnostic, and 4) the above findings hold for the paradigm–prompt engineering and interactive in-context learning and this study can save reduce the cost of invoking GPT-4 by 24%per API query. Importantly, SlimCode simplifies the input code with its greedy strategy and can obtain at most 133 times faster than the state-of-the-art technique with a significant improvement. This paper calls for a new direction on code-based, model-agnostic code simplification solutions to further empower LLMs.

code simplification
• Use the following command for each Jupyter Notebook session you want to monitor:
33
• After performing these tasks, you can easily track how much memory a command consumes:
The reported peak memory and increment tell you about memory usage:
peak memory: 90.42 MiB, increment: 0.09 MiB
• Obtaining a complete overview of memory
consumption is possible by saving a notebook cell to
disk and then profiling it using the line magic %mprun
on an externally imported function.
• The line magic works only by operating with external
Python scripts.
• Profiling produces a detailed report, command by
command, as shown in the following example:
34
The resulting report details the memory usage from every line
in the function, pointing out the major increments.
• Most computers today are multicore (two or more processors in a single package), some with multiple physical
CPUs. One of the most important limitations of Python is that it uses a single core by default.
• Data science projects require quite a lot of computations. In particular, a part of the scientific aspect of data
science relies on repeated tests and experiments on different data matrices.
• Using more CPU cores accelerates a computation by a factor that almost matches the number of cores.
• For example, having four cores would mean working at best four times faster.
• You don’t receive a full fourfold increase because there is overhead when starting a parallel process — new
running Python instances have to be set up with the right in-memory information and launched; consequently, the
improvement will be less than potentially achievable but still significant.
• Knowing how to use more than one CPU is therefore an advanced but incredibly useful skill for increasing the
number of analyses completed, and for speeding up your operations both when setting up and when using your
data Products
35
• Performing multicore parallelism
• To perform multicore parallelism with Python, you integrate the Scikit-learn package with the joblib package for
time-consuming operations, such as replicating models for validating results or for looking for the best
hyperparameters. In particular, Scikit-learn allows multiprocessing when
• Cross-validating: Testing the results of a machine-learning hypothesis using different training and testing data
• Grid-searching: Systematically changing the hyperparameters of a machine-learning hypothesis and testing the
consequent results
• Multilabel prediction: Running an algorithm multiple times against multiple targets when there are many
different target outcomes to predict at the same time
• Ensemble machine-learning methods: Modeling a large host of classifiers, each one independent from the
other, such as when using RandomForest-based modeling
36

Recommended for you

L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx

..

Response & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITHResponse & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITH

Talk covering Guardrails , Jailbreak, What is an alignment problem? RLHF, EU AI Act, Machine & Graph unlearning, Bias, Inconsistency, Probing, Interpretability, Bias

machine learningchatgptaisafety
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-IDUNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID

20CDE09- INFORMATION DESIGN UNIT I INCEPTION OF INFORMATION DESIGN Introduction and Definition History of Information Design Need of Information Design Types of Information Design Identifying audience Defining the audience and their needs Inclusivity and Visual impairment Case study.

information designinceptiondefine
• Using Jupyter provides the advantage of using the %timeit magic
• command for timing execution. You start by loading a multiclass dataset, a complex machine learning algorithm
(the Support Vector Classifier, or SVC), and a cross-validation procedure for estimating reliable resulting scores
from all the procedures.
• The most important thing to know is that the procedures become quite large because the SVC produces 10
models, which it repeats 10 times each using cross-validation, for a total of 100 models
37
As a result, you get the recorded average running time for a single core: 10.9 S
• After this test, you need to activate the multicore parallelism and time the results using the following commands:
• %timeit multi_core
38
As a result, you get the recorded average running time for a Multi core: 4.44 S
UNIT_5_Data Wrangling.pptx
• Data science relies on complex algorithms for building predictions and spotting important signals in
data, and each algorithm presents different strong and weak points.
• In short, you select a range of algorithms, you have them run on the data, you optimize their
parameters as much as you can, and finally you decide which one will best help you build your data
product or generate insight into your problem.
• Exploratory Data Analysis (EDA) is a general approach to exploring datasets by means of simple
summary statistics and graphic visualizations in order to gain a deeper understanding of data.
• EDA helps you become more effective in the subsequent data analysis and modeling.
40

Recommended for you

MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme

Syllabus

msbte syllabusxcxzcxzcxzxccxzczc
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdfGUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf

Energy market

IWISS Catalog 2024
IWISS Catalog 2024IWISS Catalog 2024
IWISS Catalog 2024

A brand new catalog for the 2024 edition of IWISS. We have enriched our product range and have more innovations in electrician tools, plumbing tools, wire rope tools and banding tools. Let's explore together!

iwissicrimpcable tool
• EDA was developed at Bell Labs by John Tukey, a mathematician and statistician who wanted to
promote more questions and actions on data based on the data itself (the exploratory motif) in
contrast to the dominant confirmatory approach of the time.
• EDA goes further than IDA (Initial Data Analysis). It’s moved by a different attitude: going beyond
basic assumptions. With
• EDA, You can
Describe of your data
Closely explore data distributions
Understand the relations between variables
Notice unusual or unexpected situations
Place the data into groups
Notice unexpected patterns within groups
Take note of group differences
41
• The first actions that you can take with the data are to produce some synthetic measures to help
figure out what is going in it.
• You acquire knowledge of measures such as maximum and minimum values, and you define which
intervals are the best place to start.
• During your exploration, you use a simple but useful dataset that is used in previous chapters, the
Fisher’s Iris dataset. You can load it from the Scikit-learn package by using the following code:
42
• Mean and median are the first measures to calculate for numeric variables when starting EDA.
• They can provide you with an estimate when the variables are centered and somehow symmetric.
• Using pandas, you can quickly compute both means and medians.
• Here is the command for getting the mean from the Iris DataFrame:
43
• When checking for central
tendency measures, you should:
1. Verify whether means are zero
2. Check whether they are different
from each other
3. Notice whether the median is
different from the mean
• As a next step, you should check the variance by using its square root, the standard deviation.
• The standard deviation is as informative as the variance, but comparing to the mean is easier
because it’s expressed in the same unit of measure.
• The variance is a good indicator of whether a mean is a suitable indicator of the variable distribution
because it tells you how the values of a variable distribute around the mean.
• The higher the variance, the farther you can expect some values to appear from the mean.
44

Recommended for you

Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeRohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe

Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx

Chlorine and Nitric acid

OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdfOCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf

OCS Training Institute is pleased to co-operate with a Global provider of Rig Inspection/Audits, Commission-ing, Compliance & Acceptance as well as & Engineering for Offshore Drilling Rigs, to deliver Drilling Rig Inspec-tion Workshops (RIW) which teaches the inspection & maintenance procedures required to ensure equipment integrity. Candidates learn to implement the relevant standards & understand industry requirements so that they can verify the condition of a rig’s equipment & improve safety, thus reducing the number of accidents and protecting the asset.

traininginspectiontrainingcourse
• In addition, you also check the range, which is the difference between the maximum and minimum
value for each quantitative variable, and it is quite informative about the difference in scale among
variables.
45
• Note the standard deviation and the range in relation to the mean and median.
• A standard deviation or range that’s too high with respect to the measures of centrality (mean and median) may
point to a possible problem, with extremely unusual values affecting the calculation or an unexpected distribution
of values around the mean.
• Because the median is the value in the central position of your distribution of values, you may need to
consider other notable positions.
• Apart from the minimum and maximum, the position at 25 percent of your values (the lower quartile)
and the position at 75 percent (the upper quartile) are useful for determining the data distribution, and
they are the basis of an illustrative graph called a boxplot.
46
• comparison that uses quartiles for rows and the different dataset variables as columns.
• So, the 25-percent quartile for sepal length (cm) is 5.1, which means that 25 percent of the dataset values for this
measure are less than 5.1.
• The last indicative measures of how the numeric variables used for these examples are structured are
skewness and kurtosis:
• Skewness defines the asymmetry of data with respect to the mean. If the skew is negative, the left tail
is too long and the mass of the observations are on the right side of the distribution. If it is positive, it is
exactly the opposite.
• Kurtosis shows whether the data distribution, especially the peak and the tails, are of the right shape.
If the kurtosis is above zero, the distribution has a marked peak. If it is below zero, the distribution is
too flat instead.
47
• When performing the skewness and kurtosis tests, you determine whether the p-value is less than or
equal 0.05.
• If so, you have to reject normality (your variable distributed as a Gaussian distribution), which implies
that you could obtain better results if you try to transform the variable into a normal one.
• The following code shows how to perform the required test:
48

Recommended for you

Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeBangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe

For Ad post Contact : adityaroy0215@gmail.com Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe

PMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOCPMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOC

This is research about a process called field-oriented control (FOC) that is used to control the pmsm motor.

#pmsmfoc
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval

Introduction to information retrieval, Major challenges in IR

• The test results tell you that the data is slightly skewed to the left, but not enough to make it unusable.
• The real problem is that the curve is much too flat to be bell shaped, so you should investigate the
matter further.
49
• The Iris dataset is made of four metric variables and a qualitative target outcome.
• Just as you use means and variance as descriptive measures for metric variables, so do frequencies
strictly relate to qualitative ones.
• Because the dataset is made up of metric measurements (width and lengths in centimeters), you must
render it qualitative by dividing it into bins according to specific intervals.
• The pandas package features two useful functions, cut and qcut, that can transform a metric variable
into a qualitative one:
• cut expects a series of edge values used to cut the measurements or an integer number of
groups used to cut the variables into equal-width bins.
• qcut expects a series of percentiles used to cut the variable.
50
51
52

Recommended for you

GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdfGUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf

Foreign trade and customs

• By matching different categorical frequency distributions, you can display the relationship between
qualitative variables.
• The pandas.crosstab function can match variables or groups of variables, helping to locate possible data
structures or relationships.
• In the following example, you check how the outcome variable is related to petal length and observe
how certain outcomes and petal binned classes never appear together.
• Figure shows the various iris types along the left side of the output, followed by the output as related to
petal length.
53
• The data is rich in information because it offers a perspective that goes beyond the single variable, presenting more
variables with their reciprocal variations.
• The way to use more of the data is to create a bivariate (seeing how couples of variables relate to each other)
exploration.
• This is also the basis for complex data analysis based on a multivariate (simultaneously considering all the existent
relations between variables) approach.
• If the univariate approach inspected a limited number of descriptive statistics, then matching different variables or
groups of variables increases the number of possibilities.
• Such exploration overloads the data scientist with different tests and bivariate analysis.
• Using visualization is a rapid way to limit test and analysis to only interesting traces and hints.
• Visualizations, using a few informative graphics, can convey the variety of statistical characteristics of the variables
and their reciprocal relationships with greater ease.
54
• Boxplots provide a way to represent distributions and their extreme ranges, signaling whether some observations
are too far from the core of the data — a problematic situation for some learning algorithms.
• The following code shows how to create a basic boxplot using the Iris dataset:
55
• In Figure, you see the structure of
each variable’s distribution at its core,
represented by the 25° and 75°
percentile (the sides of the box) and
the median (at the center of the box).
• The lines, the socalled whiskers,
represent 1.5 times the IQR from the
box sides (or by the distance to the
most extreme value, if within 1.5 times
the IQR).
• After you have spotted a possible group difference relative to a variable, a t-test (you use a t-test in situations in
which the sampled population has an exact normal distribution) or a one-way Analysis Of Variance (ANOVA) can
provide you with a statistical verification of the significance of the difference between the groups’ means.
56
• The t-test compares two groups at a time, and
it requires that you define whether the groups
have similar variance or not.
• You interpret the pvalue as the probability
that the calculated t statistic difference is just
due to chance.
• Usually, when it is below 0.05, you can
confirm that the groups’ means are
significantly different.

Recommended for you

• You can simultaneously check more than two groups using the one-way ANOVA test. In this case, the pvalue has an
interpretation similar to the t-test:
57
• Parallel coordinates can help spot which groups in the outcome variable you could easily separate from the other.
• It is a truly multivariate plot, because at a glance it represents all your data at the same time.
• The following example shows how to use parallel coordinates.
58
• If the parallel lines of each group stream
together along the visualization in a
separate part of the graph far from other
groups, the group is easily separable.
• The visualization also provides the means
to assert the capability of certain features
to separate the groups.
• You usually render the information that boxplot and descriptive statistics provide into a curve or a histogram, which shows an
overview of the complete distribution of values.
• The output shown in Figure represents all the distributions in the dataset.
• Different variable scales and shapes are immediately visible, such as the fact that petals’ features display two peaks.
59
• Histograms present another, more detailed, view over distributions:
60

Recommended for you

• In scatterplots, the two compared variables provide the coordinates for plotting the observations as points on a plane.
• The result is usually a cloud of points. When the cloud is elongated and resembles a line, you can deduce that the variables are
correlated.
• The following example demonstrates this principle:
61
• The scatterplot highlights different groups using
different colors.
• The elongated shape described by the points hints
at a strong correlation between the two observed
variables, and the division of the cloud into groups
suggests a possible separability of the groups.
• Because the number of variables isn’t too large, you
can also generate all the scatterplots automatically
• from the combination of the variables.
• This representation is a matrix of scatterplots.
• Just as the relationship between variables is graphically representable, it is also measurable by a statistical estimate.
• When working with numeric variables, the estimate is a correlation, and the Pearson’s correlation is the most famous.
• The Pearson’s correlation is the foundation for complex linear estimation models.
• When you work with categorical variables, the estimate is an association, and the chi-square statistic is the most frequently used
tool for measuring association between features.
• Using covariance and correlation
• Covariance is the first measure of the relationship of two variables.
• It determines whether both variables have a coincident behavior with respect to their mean. If the single values of two variables
are usually above or below their respective averages, the two variables have a positive association.
• It means that they tend to agree, and you can figure out the behavior of one of the two by looking at the other.
• In such a case, their covariance will be a positive number, and the higher the number, the higher the agreement. 62
• If, instead, one variable is usually above and the other variable usually below their respective averages,
the two variables are negatively associated.
• Even though the two disagree, it’s an interesting situation for making predictions, because by observing
the state of one of them, you can figure out the likely state of the other (albeit they’re opposite).
• In this case, their covariance will be a negative number.
• A third state is that the two variables don’t systematically agree or disagree with each other. In this case,
the covariance will tend to be zero, a sign that the variables don’t share much and have independent
behaviors.
63

More Related Content

Similar to UNIT_5_Data Wrangling.pptx

Intro to Data Structure & Algorithms
Intro to Data Structure & AlgorithmsIntro to Data Structure & Algorithms
Intro to Data Structure & Algorithms
Akhil Kaushik
 
Module 3 - Basics of Data Manipulation in Time Series
Module 3 - Basics of Data Manipulation in Time SeriesModule 3 - Basics of Data Manipulation in Time Series
Module 3 - Basics of Data Manipulation in Time Series
ssusere5ddd6
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
rohithprabhas1
 
Python ml
Python mlPython ml
Python ml
Shubham Sharma
 
final
finalfinal
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
SharabiNaif
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
Anonymous9etQKwW
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
MumitAhmed1
 
Session 2
Session 2Session 2
Session 2
HarithaAshok3
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
Shanmugasundaram M
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
Sri Ambati
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Unit 1 dsa
Unit 1 dsaUnit 1 dsa
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
Renjith M P
 
RAJAT PROJECT.pptx
RAJAT PROJECT.pptxRAJAT PROJECT.pptx
RAJAT PROJECT.pptx
SayedMohdAsim2
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
EDB
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Ali Alkan
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
Shree Shree
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Ramiro Aduviri Velasco
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
Anaya Zafar
 

Similar to UNIT_5_Data Wrangling.pptx (20)

Intro to Data Structure & Algorithms
Intro to Data Structure & AlgorithmsIntro to Data Structure & Algorithms
Intro to Data Structure & Algorithms
 
Module 3 - Basics of Data Manipulation in Time Series
Module 3 - Basics of Data Manipulation in Time SeriesModule 3 - Basics of Data Manipulation in Time Series
Module 3 - Basics of Data Manipulation in Time Series
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
Python ml
Python mlPython ml
Python ml
 
final
finalfinal
final
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Session 2
Session 2Session 2
Session 2
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Unit 1 dsa
Unit 1 dsaUnit 1 dsa
Unit 1 dsa
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
RAJAT PROJECT.pptx
RAJAT PROJECT.pptxRAJAT PROJECT.pptx
RAJAT PROJECT.pptx
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
 

Recently uploaded

Biology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtuBiology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtu
santoshpatilrao33
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
PradeepKumarSK3
 
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
IJAEMSJORNAL
 
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx
surekha1287
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
hamedmustafa094
 
Lecture 3 Biomass energy...............ppt
Lecture 3 Biomass energy...............pptLecture 3 Biomass energy...............ppt
Lecture 3 Biomass energy...............ppt
RujanTimsina1
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
YanKing2
 
L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx
naseki5964
 
Response & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITHResponse & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITH
IIIT Hyderabad
 
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-IDUNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
GOWSIKRAJA PALANISAMY
 
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
Anwar Patel
 
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdfGUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
ProexportColombia1
 
IWISS Catalog 2024
IWISS Catalog 2024IWISS Catalog 2024
IWISS Catalog 2024
Iwiss Tools Co.,Ltd
 
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeRohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
binna singh$A17
 
Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx
yadavsuyash008
 
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdfOCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
Muanisa Waras
 
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeBangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
bookhotbebes1
 
PMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOCPMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOC
itssurajthakur06
 
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval
KishorMahale5
 
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdfGUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
ProexportColombia1
 

Recently uploaded (20)

Biology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtuBiology for computer science BBOC407 vtu
Biology for computer science BBOC407 vtu
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
 
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
Profiling of Cafe Business in Talavera, Nueva Ecija: A Basis for Development ...
 
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
 
Lecture 3 Biomass energy...............ppt
Lecture 3 Biomass energy...............pptLecture 3 Biomass energy...............ppt
Lecture 3 Biomass energy...............ppt
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
 
L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx
 
Response & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITHResponse & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITH
 
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-IDUNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
UNIT I INCEPTION OF INFORMATION DESIGN 20CDE09-ID
 
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
 
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdfGUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
GUIA_LEGAL_CHAPTER-9_COLOMBIAN ELECTRICITY (1).pdf
 
IWISS Catalog 2024
IWISS Catalog 2024IWISS Catalog 2024
IWISS Catalog 2024
 
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeRohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
Chlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptxChlorine and Nitric Acid application, properties, impacts.pptx
Chlorine and Nitric Acid application, properties, impacts.pptx
 
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdfOCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
OCS Training - Rig Equipment Inspection - Advanced 5 Days_IADC.pdf
 
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeBangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Bangalore @ℂall @Girls ꧁❤ 0000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
 
PMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOCPMSM-Motor-Control : A research about FOC
PMSM-Motor-Control : A research about FOC
 
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval
 
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdfGUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
GUIA_LEGAL_CHAPTER_4_FOREIGN TRADE CUSTOMS.pdf
 

UNIT_5_Data Wrangling.pptx

  • 2. • Wrangling Data: • If you’ve gone through the previous chapters, by this point you’ve dealt with all the basic data loading and manipulation methods offered by Python. Now it’s time to start using some more complex instruments for data wrangling (or munging) and for machine learning. • The final step of most data science projects is to build a data tool able to automatically summarize, predict, and recommend directly from your data. • Before taking that final step, you still have to process your data by enforcing transformations that are even more radical. • That’s the data wrangling or data munging part, where sophisticated transformations are followed by visual and statistical explorations, and then again by further transformations. • In the following sections, you learn how to handle huge streams of text, explore the basic characteristics of a dataset, optimize the speed of your experiments, compress data and create new synthetic features, generate new groups and classifications, and detect unexpected or exceptional cases that may cause your project to go wrong. 2
  • 3. • Sometimes the best way to discover how to use something is to spend time playing with it. The more complex a tool, the more important play becomes. • Given the complex math tasks you perform using Scikit-learn, playing becomes especially important. • The following sections use the idea of playing with Scikit-learn to help you discover important concepts in using Scikit-learn to perform amazing feats of data science work. 3
  • 4. • Understanding classes in Scikit-learn • Understanding how classes work is an important prerequisite for being able to use the Scikit-learn package appropriately. • Scikit-learn is the package for machine learning and data science experimentation favored by most data scientists. • It contains a wide range of well-established learning algorithms, error functions, and testing procedures. • At its core, Scikit-learn features some base classes on which all the algorithms are built. Apart from BaseEstimator, the class from which all other classes inherit, there are four class types covering all the basic machine-learning functionalities: • Classifying • Regressing • Grouping by clusters • Transforming data 4
  • 5. • Understanding classes in Scikit-learn • Even though each base class has specific methods and attributes, the core functionalities for data processing and machine learning are guaranteed by one or more series of methods and attributes called interfaces. • The interfaces provide a uniform Application Programming Interface (API) to enforce similarity of methods and attributes between all the different algorithms present in the package. There are four Scikit-learn object-based interfaces: 1. estimator: For fitting parameters, learning them from data, according to the algorithm 2. predictor: For generating predictions from the fitted parameters 3. transformer: For transforming data, implementing the fitted parameters 4. model: For reporting goodness of fit or other score measures 5
  • 6. Defining applications for data science • Figuring out ways to use data science to obtain constructive results is important. For example, you can apply the estimator interface to a 1. Classification problem: Guessing that a new observation is from a certain group 2. Regression problem: Guessing the value of a new observation • It works with the method fit(X, y) where X is the bidimensional array of predictors (the set of observations to learn) and y is the target outcome (another array, unidimensional). • By applying fit, the information in X is related to y, so that, knowing some new information with the same characteristics of X, it’s possible to guess y correctly. • In the process, some parameters are estimated internally by the fit method. Using fit makes it possible to distinguish between parameters, which are learned, and hyperparameters, which instead are fixed by you when you instantiate the learner. 6
  • 7. Defining applications for data science • Instantiation involves assigning a Scikit-learn class to a Python variable. • In addition to hyperparameters, you can also fix other working parameters, such as requiring normalization or setting a random seed to reproduce the same results for each call, given the same input data. 7
  • 8. Defining applications for data science 8 Here is an example with linear regression, a very basic and common machine learning algorithm. You upload some data to use this example from the examples that Scikit-learn provides. The Boston dataset, for instance, contains predictor variables that the example code can match against house prices, which helps build a predictor that can calculate the value of a house given its characteristics.
  • 9. Defining applications for data science 9 • The output specifies that both arrays have the same number of rows and that X has 13 features. • The shape method performs array analysis and reports the arrays’ dimensions.
  • 10. Defining applications for data science • After importing the LinearRegression class, you can instantiate a variable called hypothesis and set a parameter indicating the algorithm to standardize (that is, to set mean zero and unit standard deviation for all the variables, a statistical operation for having all the variables at a similar level) before estimating the parameters to learn. 10
  • 11. Defining applications for data science • After fitting, hypothesis holds the learned parameters, and you can visualize them using the coef_ method, which is typical of all the linear models (where the model output is a summation of variables weighted by coefficients). You can also call this fitting activity training (as in, “training a machine learning algorithm”). • hypothesis is a way to describe a learning algorithm trained with data. The hypothesis defines a possible representation of y given X that you test for validity. Therefore, it’s a hypothesis in both scientific and machine learning language. 11
  • 12. Defining applications for data science • Apart from the estimator class, the predictor and the model object classes are also important. • The predictor class, which predicts the probability of a certain result, obtains the result of new observations using the predict and predict_proba methods, as in this script: 12 Make sure that new observations have the same feature number and order as in the training x; otherwise, the prediction will be incorrect.
  • 13. Defining applications for data science • The class model provides information about the quality of the fit using the score method, as shown here: • In this case, score returns the coefficient of determination R2 of the prediction. R2 is a measure ranging from 0 to 1, comparing our predictor to a simple mean. Higher values show that the predictor is working well. • Different learning algorithms may use different scoring functions. Please consult the online documentation of each algorithm or ask for help on the Python console: 13
  • 14. Defining applications for data science • The transform class applies transformations derived from the fitting phase to other data arrays. • LinearRegression doesn’t have a transform method, but most preprocessing algorithms do. • For example, MinMaxScaler, from the Scikit-learn preprocessing module, can transform values in a specific range of minimum and maximum values, learning the transformation formula from an example array. 14
  • 15. • Scikit-learn provides you with most of the data structures and functionality you need to complete your data science project. • You can even find classes for the trickiest and most advanced problems. • For instance, when dealing with text, one of the most useful solutions provided by the Scikit-learn package is the hashing trick. • You discover how to work with text by using the bag of words model (as shown in the “Using the Bag of Words Model and Beyond”) and weighting them with the Term Frequency times Inverse Document Frequency (TF-IDF) transformation. • All these powerful transformations can operate properly only if all your text is known and available in the memory of your computer. 15
  • 16. • A more serious data science challenge is to analyze online-generated text flows, such as from social networks or large, online text repositories. • This scenario poses quite a challenge when trying to turn the text into a data matrix suitable for analysis. When working through such problems, knowing the hashing trick can give you quite a few advantages by helping you  Handle large data matrices based on text on the fly  Fix unexpected values or variables in your textual data  Build scalable algorithms for large collections of documents 16
  • 17. • Hash functions can transform any input into an output whose characteristics are predictable. • Usually they return a value where the output is bound at a specific interval — whose extremities range from negative to positive numbers or just span through positive numbers. • You can imagine them as enforcing a standard on your data — no matter what values you provide, they always return a specific data product. • Their most useful hash function characteristic is that, given a certain input, they always provide the same numeric output value. Consequently, they’re called deterministic functions. • For example, input a word like dog and the hashing function always returns the same number. • In a certain sense, hash functions are like a secret code, transforming everything into numbers. Unlike secret codes, however, you can’t convert the hashed code to its original value. • In addition, in some rare cases, different words generate the same hashed result (also called a hash collision). 17
  • 18. • There are many hash functions, with MD5 (often used to check file integrity, because you can hash entire files) and SHA (used in cryptography) being the most popular. • Python possesses a built-in hash function named hash that you can use to compare data objects before storing them in dictionaries. • For instance, you can test how Python hashes its name: 18 The command returns a large integer number:
  • 19. • A Scikit-learn hash function can also return an index in a specific positive range. • You can obtain something similar using a built-in hash by employing standard division and its remainder: 19
  • 20. • When you ask for the remainder of the absolute number of the result from the hash function, you get a number that never exceeds the value you used for the division. • To see how this technique works, pretend that you want to transform a text string from the Internet into a numeric vector (a feature vector) so that you can use it for starting a machine-learning project. A good strategy for managing this data science task is to employ one-hot encoding, which produces a bag of words. • Here are the steps for one-hot encoding a string (“Python for data science”) into a vector. 1. Assign an arbitrary number to each word, for instance, Python=0 for=1 data=2 science=3. 2. Initialize the vector, counting the number of unique words that you assigned a code in Step 1. 3. Use the codes assigned in Step 1 as indexes for populating the vector with values, assigning a 1 • where there is a coincidence with a word existing in the phrase. 20
  • 21. • The resulting feature vector is expressed as the sequence [1,1,1,1] and made of exactly four elements. • You have started the machine-learning process, telling the program to expect sequences of four text features, when suddenly a new phrase arrives and you must convert the following text into a numeric vector as well: “Python for machine learning”. • Now you have two new words — “machine learning” — to work with. The following steps help you create the new vectors: 1. Assign these new codes: machine=4 learning=5. This is called encoding. 2. Enlarge the previous vector to include the new words: [1,1,1,1,0,0]. 3. Compute the vector for the new string: [1,1,0,0,1,1]. • One-hot encoding is quite optimal because it creates efficient and ordered feature vectors. 21
  • 22. • The command returns a dictionary containing the words and their encodings: 22
  • 23. • Unfortunately, one-hot encoding fails and becomes difficult to handle when your project experiences a lot of variability with regard to its inputs. • This is a common situation in data science projects working with text or other symbolic features where flow from the Internet or other online environments can suddenly create or add to your initial data. • Using hash functions is a smarter way to handle unpredictability in your inputs: 1. Define a range for the hash function outputs. All your feature vectors will use that range. The example uses a range of values from 0 to 24. 2. Compute an index for each word in your string using the hash function. 3. Assign a unit value to vector’s positions according to word indexes. 23
  • 24. 24
  • 25. • As before, your results may not precisely match those in the book because hashes may not match across machines. • The code now prints the second string encoded: 25
  • 26. • When viewing the feature vectors, you should notice that: • You don’t know where each word is located. When it’s important to be able to reverse the process of assigning words to indexes, you must store the relationship between words and their hashed value separately (for example, you can use a dictionary where the keys are the hashed values and the values are the words). • For small values of the vector_size function parameter (for example, vector_size=10), many words overlap in the same positions in the list representing the feature vector. To keep the overlap to a minimum, you must create hash function boundaries that are greater than the number of elements you plan to index later. • The feature vectors in this example are made mostly of zero entries, representing a waste of memory when compared to the more memory-efficient one-hot-encoding. • One of the ways in which you can solve this problem is to rely on sparse matrices, as described in the next section. 26
  • 27. • Sparse matrices are the answer when dealing with data that has few values, that is, when most of the matrix values are zeroes. • Sparse matrices store just the coordinates of the cells and their values, instead of storing the information for all the cells in the matrix. • When an application requests data from an empty cell, the sparse matrix will return a zero value after looking for the coordinates and not finding them. • Here’s an example vector: • [1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0] 27
  • 28. • The following code turns it into a sparse matrix. 28 Notice that the data representation is in coordinates (expressed in a tuple of row and column index) and the cell value.
  • 29. • As a data scientist, you don’t have to worry about programming your own version of the hashing trick unless you would like some special implementation of the idea. • Scikit-learn offers HashingVectorizer, a class that rapidly transforms any collection of text into a sparse data matrix using the hashing trick. • Here’s an example script that replicates the previous example: 29 Python reports the size of the sparse matrix and a count of the stored elements present in it: <2x20
  • 30. • As soon as new text arrives, CountVectorizer transforms the text based on the previous encoding schema where the new words weren’t present; hence, the result is simply an empty vector of zeros. • You can check this by transforming the sparse matrix into a normal, dense one using todense: 30
  • 31. • Contrast the output from CountVectorizer with HashingVectorizer, which always provides a place for new words in the data matrix: 31 At worst, a word settles in an already occupied position, causing two different words to be treated as the same one by the algorithm (which won’t noticeably degrade the algorithm’s performances). HashingVectorizer is the perfect function to use when your data can’t fit into memory and its features aren’t fixed. In the other cases, consider using the more intuitive CountVectorizer.
  • 32. • when testing your application code for performance (speed) characteristics, you can obtain analogous information about memory usage. • Keeping track of memory consumption could tell you about possible problems in the way data is processed or transmitted to the learning algorithms. • The memory_profiler package implements the required functionality. This package is not provided as a default Python package and it requires installation. • Use the following command to install the package directly from a cell of your Jupyter notebook, 32
  • 33. • Use the following command for each Jupyter Notebook session you want to monitor: 33 • After performing these tasks, you can easily track how much memory a command consumes: The reported peak memory and increment tell you about memory usage: peak memory: 90.42 MiB, increment: 0.09 MiB
  • 34. • Obtaining a complete overview of memory consumption is possible by saving a notebook cell to disk and then profiling it using the line magic %mprun on an externally imported function. • The line magic works only by operating with external Python scripts. • Profiling produces a detailed report, command by command, as shown in the following example: 34 The resulting report details the memory usage from every line in the function, pointing out the major increments.
  • 35. • Most computers today are multicore (two or more processors in a single package), some with multiple physical CPUs. One of the most important limitations of Python is that it uses a single core by default. • Data science projects require quite a lot of computations. In particular, a part of the scientific aspect of data science relies on repeated tests and experiments on different data matrices. • Using more CPU cores accelerates a computation by a factor that almost matches the number of cores. • For example, having four cores would mean working at best four times faster. • You don’t receive a full fourfold increase because there is overhead when starting a parallel process — new running Python instances have to be set up with the right in-memory information and launched; consequently, the improvement will be less than potentially achievable but still significant. • Knowing how to use more than one CPU is therefore an advanced but incredibly useful skill for increasing the number of analyses completed, and for speeding up your operations both when setting up and when using your data Products 35
  • 36. • Performing multicore parallelism • To perform multicore parallelism with Python, you integrate the Scikit-learn package with the joblib package for time-consuming operations, such as replicating models for validating results or for looking for the best hyperparameters. In particular, Scikit-learn allows multiprocessing when • Cross-validating: Testing the results of a machine-learning hypothesis using different training and testing data • Grid-searching: Systematically changing the hyperparameters of a machine-learning hypothesis and testing the consequent results • Multilabel prediction: Running an algorithm multiple times against multiple targets when there are many different target outcomes to predict at the same time • Ensemble machine-learning methods: Modeling a large host of classifiers, each one independent from the other, such as when using RandomForest-based modeling 36
  • 37. • Using Jupyter provides the advantage of using the %timeit magic • command for timing execution. You start by loading a multiclass dataset, a complex machine learning algorithm (the Support Vector Classifier, or SVC), and a cross-validation procedure for estimating reliable resulting scores from all the procedures. • The most important thing to know is that the procedures become quite large because the SVC produces 10 models, which it repeats 10 times each using cross-validation, for a total of 100 models 37 As a result, you get the recorded average running time for a single core: 10.9 S
  • 38. • After this test, you need to activate the multicore parallelism and time the results using the following commands: • %timeit multi_core 38 As a result, you get the recorded average running time for a Multi core: 4.44 S
  • 40. • Data science relies on complex algorithms for building predictions and spotting important signals in data, and each algorithm presents different strong and weak points. • In short, you select a range of algorithms, you have them run on the data, you optimize their parameters as much as you can, and finally you decide which one will best help you build your data product or generate insight into your problem. • Exploratory Data Analysis (EDA) is a general approach to exploring datasets by means of simple summary statistics and graphic visualizations in order to gain a deeper understanding of data. • EDA helps you become more effective in the subsequent data analysis and modeling. 40
  • 41. • EDA was developed at Bell Labs by John Tukey, a mathematician and statistician who wanted to promote more questions and actions on data based on the data itself (the exploratory motif) in contrast to the dominant confirmatory approach of the time. • EDA goes further than IDA (Initial Data Analysis). It’s moved by a different attitude: going beyond basic assumptions. With • EDA, You can Describe of your data Closely explore data distributions Understand the relations between variables Notice unusual or unexpected situations Place the data into groups Notice unexpected patterns within groups Take note of group differences 41
  • 42. • The first actions that you can take with the data are to produce some synthetic measures to help figure out what is going in it. • You acquire knowledge of measures such as maximum and minimum values, and you define which intervals are the best place to start. • During your exploration, you use a simple but useful dataset that is used in previous chapters, the Fisher’s Iris dataset. You can load it from the Scikit-learn package by using the following code: 42
  • 43. • Mean and median are the first measures to calculate for numeric variables when starting EDA. • They can provide you with an estimate when the variables are centered and somehow symmetric. • Using pandas, you can quickly compute both means and medians. • Here is the command for getting the mean from the Iris DataFrame: 43 • When checking for central tendency measures, you should: 1. Verify whether means are zero 2. Check whether they are different from each other 3. Notice whether the median is different from the mean
  • 44. • As a next step, you should check the variance by using its square root, the standard deviation. • The standard deviation is as informative as the variance, but comparing to the mean is easier because it’s expressed in the same unit of measure. • The variance is a good indicator of whether a mean is a suitable indicator of the variable distribution because it tells you how the values of a variable distribute around the mean. • The higher the variance, the farther you can expect some values to appear from the mean. 44
  • 45. • In addition, you also check the range, which is the difference between the maximum and minimum value for each quantitative variable, and it is quite informative about the difference in scale among variables. 45 • Note the standard deviation and the range in relation to the mean and median. • A standard deviation or range that’s too high with respect to the measures of centrality (mean and median) may point to a possible problem, with extremely unusual values affecting the calculation or an unexpected distribution of values around the mean.
  • 46. • Because the median is the value in the central position of your distribution of values, you may need to consider other notable positions. • Apart from the minimum and maximum, the position at 25 percent of your values (the lower quartile) and the position at 75 percent (the upper quartile) are useful for determining the data distribution, and they are the basis of an illustrative graph called a boxplot. 46 • comparison that uses quartiles for rows and the different dataset variables as columns. • So, the 25-percent quartile for sepal length (cm) is 5.1, which means that 25 percent of the dataset values for this measure are less than 5.1.
  • 47. • The last indicative measures of how the numeric variables used for these examples are structured are skewness and kurtosis: • Skewness defines the asymmetry of data with respect to the mean. If the skew is negative, the left tail is too long and the mass of the observations are on the right side of the distribution. If it is positive, it is exactly the opposite. • Kurtosis shows whether the data distribution, especially the peak and the tails, are of the right shape. If the kurtosis is above zero, the distribution has a marked peak. If it is below zero, the distribution is too flat instead. 47
  • 48. • When performing the skewness and kurtosis tests, you determine whether the p-value is less than or equal 0.05. • If so, you have to reject normality (your variable distributed as a Gaussian distribution), which implies that you could obtain better results if you try to transform the variable into a normal one. • The following code shows how to perform the required test: 48
  • 49. • The test results tell you that the data is slightly skewed to the left, but not enough to make it unusable. • The real problem is that the curve is much too flat to be bell shaped, so you should investigate the matter further. 49
  • 50. • The Iris dataset is made of four metric variables and a qualitative target outcome. • Just as you use means and variance as descriptive measures for metric variables, so do frequencies strictly relate to qualitative ones. • Because the dataset is made up of metric measurements (width and lengths in centimeters), you must render it qualitative by dividing it into bins according to specific intervals. • The pandas package features two useful functions, cut and qcut, that can transform a metric variable into a qualitative one: • cut expects a series of edge values used to cut the measurements or an integer number of groups used to cut the variables into equal-width bins. • qcut expects a series of percentiles used to cut the variable. 50
  • 51. 51
  • 52. 52
  • 53. • By matching different categorical frequency distributions, you can display the relationship between qualitative variables. • The pandas.crosstab function can match variables or groups of variables, helping to locate possible data structures or relationships. • In the following example, you check how the outcome variable is related to petal length and observe how certain outcomes and petal binned classes never appear together. • Figure shows the various iris types along the left side of the output, followed by the output as related to petal length. 53
  • 54. • The data is rich in information because it offers a perspective that goes beyond the single variable, presenting more variables with their reciprocal variations. • The way to use more of the data is to create a bivariate (seeing how couples of variables relate to each other) exploration. • This is also the basis for complex data analysis based on a multivariate (simultaneously considering all the existent relations between variables) approach. • If the univariate approach inspected a limited number of descriptive statistics, then matching different variables or groups of variables increases the number of possibilities. • Such exploration overloads the data scientist with different tests and bivariate analysis. • Using visualization is a rapid way to limit test and analysis to only interesting traces and hints. • Visualizations, using a few informative graphics, can convey the variety of statistical characteristics of the variables and their reciprocal relationships with greater ease. 54
  • 55. • Boxplots provide a way to represent distributions and their extreme ranges, signaling whether some observations are too far from the core of the data — a problematic situation for some learning algorithms. • The following code shows how to create a basic boxplot using the Iris dataset: 55 • In Figure, you see the structure of each variable’s distribution at its core, represented by the 25° and 75° percentile (the sides of the box) and the median (at the center of the box). • The lines, the socalled whiskers, represent 1.5 times the IQR from the box sides (or by the distance to the most extreme value, if within 1.5 times the IQR).
  • 56. • After you have spotted a possible group difference relative to a variable, a t-test (you use a t-test in situations in which the sampled population has an exact normal distribution) or a one-way Analysis Of Variance (ANOVA) can provide you with a statistical verification of the significance of the difference between the groups’ means. 56 • The t-test compares two groups at a time, and it requires that you define whether the groups have similar variance or not. • You interpret the pvalue as the probability that the calculated t statistic difference is just due to chance. • Usually, when it is below 0.05, you can confirm that the groups’ means are significantly different.
  • 57. • You can simultaneously check more than two groups using the one-way ANOVA test. In this case, the pvalue has an interpretation similar to the t-test: 57
  • 58. • Parallel coordinates can help spot which groups in the outcome variable you could easily separate from the other. • It is a truly multivariate plot, because at a glance it represents all your data at the same time. • The following example shows how to use parallel coordinates. 58 • If the parallel lines of each group stream together along the visualization in a separate part of the graph far from other groups, the group is easily separable. • The visualization also provides the means to assert the capability of certain features to separate the groups.
  • 59. • You usually render the information that boxplot and descriptive statistics provide into a curve or a histogram, which shows an overview of the complete distribution of values. • The output shown in Figure represents all the distributions in the dataset. • Different variable scales and shapes are immediately visible, such as the fact that petals’ features display two peaks. 59
  • 60. • Histograms present another, more detailed, view over distributions: 60
  • 61. • In scatterplots, the two compared variables provide the coordinates for plotting the observations as points on a plane. • The result is usually a cloud of points. When the cloud is elongated and resembles a line, you can deduce that the variables are correlated. • The following example demonstrates this principle: 61 • The scatterplot highlights different groups using different colors. • The elongated shape described by the points hints at a strong correlation between the two observed variables, and the division of the cloud into groups suggests a possible separability of the groups. • Because the number of variables isn’t too large, you can also generate all the scatterplots automatically • from the combination of the variables. • This representation is a matrix of scatterplots.
  • 62. • Just as the relationship between variables is graphically representable, it is also measurable by a statistical estimate. • When working with numeric variables, the estimate is a correlation, and the Pearson’s correlation is the most famous. • The Pearson’s correlation is the foundation for complex linear estimation models. • When you work with categorical variables, the estimate is an association, and the chi-square statistic is the most frequently used tool for measuring association between features. • Using covariance and correlation • Covariance is the first measure of the relationship of two variables. • It determines whether both variables have a coincident behavior with respect to their mean. If the single values of two variables are usually above or below their respective averages, the two variables have a positive association. • It means that they tend to agree, and you can figure out the behavior of one of the two by looking at the other. • In such a case, their covariance will be a positive number, and the higher the number, the higher the agreement. 62
  • 63. • If, instead, one variable is usually above and the other variable usually below their respective averages, the two variables are negatively associated. • Even though the two disagree, it’s an interesting situation for making predictions, because by observing the state of one of them, you can figure out the likely state of the other (albeit they’re opposite). • In this case, their covariance will be a negative number. • A third state is that the two variables don’t systematically agree or disagree with each other. In this case, the covariance will tend to be zero, a sign that the variables don’t share much and have independent behaviors. 63