SlideShare a Scribd company logo
Data Science Process
1. Setting the research goal
1. Spend time understanding the goals and context of your research
2. Create a project charter
2. Retrieving Data
1. Data within company
a. This data can be stored in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a team of IT professionals.
b. The primary goal of a database is data storage, while a data warehouse is designed for
reading and analyzing that data.
c. A data mart is a subset of the data warehouse and geared toward serving a specific business
unit. While data warehouses and data marts are home to preprocessed data, data lakes
contains data in its natural or raw format.
2. Open source data
3. Data Preparation
1. Cleansing

Recommended for you

Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)

This document discusses different types of concept hierarchies that can be used in data warehousing including schema hierarchies, set grouping hierarchies, and operation-derived hierarchies. It also discusses techniques for data discretization such as binning methods and concept hierarchies that reduce data by replacing low-level concepts with higher-level concepts. Finally, it briefly mentions histogram analysis, clustering analysis, and different ways concept hierarchies can be specified including explicitly by users or generated automatically through analysis.

Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases

The document provides an overview of data science applications and use cases. It defines data science as using computer science, statistics, machine learning and other techniques to analyze data and create data products to help businesses make better decisions. It discusses big data challenges, the differences between data science and software engineering, and key areas of data science competence including data analytics, engineering, domain expertise and data management. Finally, it outlines several common data science applications and use cases such as recommender systems, credit scoring, dynamic pricing, customer churn analysis and fraud detection with examples of how each works and real world cases.

data sciencedata analyticsapplications of data science
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science

Defining Data Science • What Does a Data Science Professional Do? • Data Science in Business • Use Cases for Data Science • Installation of R and R studio

databig dataanalytics
Data Science Process.pptx
Why the errors should be corrected asap?
● Not everyone spots the data anomalies. Decision-makers may make costly
mistakes on information based on incorrect data from applications that fail to
correct for the faulty data.
● If errors are not corrected early on in the process, the cleansing will have to be
done for every project that uses that data.
● Data errors may point to defective equipment, such as broken transmission lines
and defective sensors.
● Data errors can point to bugs in software or in the integration of software that
may be critical to the company. While doing a small project at a bank we
discovered that two software applications used different local settings. This
caused problems with numbers greater than 1,000. For one app the number 1.000
meant one, and for the other it meant one thousand.
Combining Data
1. Joining Tables
2. Appending Tables
3. Creating views
Data Transformation

Recommended for you

A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)

Presentation given by Dr. Diego Kuonen, CStat PStat CSci, on November 20, 2013, at the "IBM Developer Days 2013" in Zurich, Switzerland. ABSTRACT There is no question that big data has hit the business, government and scientific sectors. The demand for skills in data science is unprecedented in sectors where value, competitiveness and efficiency are driven by data. However, there is plenty of misleading hype around the terms big data and data science. This presentation gives a professional statistician's view on these terms and illustrates the connection between data science and statistics. The presentation is also available at http://www.statoo.com/BigDataDataScience/.

data minerdata scientiststatistics
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining

This document summarizes key aspects of data integration and transformation in data mining. It discusses data integration as combining data from multiple sources to provide a unified view. Key issues in data integration include schema integration, redundancy, and resolving data conflicts. Data transformation prepares the data for mining and can include smoothing, aggregation, generalization, normalization, and attribute construction. Specific normalization techniques are also outlined.

Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining

The document discusses major issues in data mining including mining methodology, user interaction, performance, and data types. Specifically, it outlines challenges of mining different types of knowledge, interactive mining at multiple levels of abstraction, incorporating background knowledge, visualization of results, handling noisy data, evaluating pattern interestingness, efficiency and scalability of algorithms, parallel and distributed mining, and handling relational and complex data types from heterogeneous databases.

data miningissues
Data Science Process.pptx
4. EDA
5. Build the Model
Building a model is an iterative process. The way you build your model depends on
whether you go with classic statistics or the somewhat more recent machine
learning school, and the type of technique you want to use. Either way, most
models consist of the following main steps:
1 Selection of a modeling technique and variables to enter in the model
2 Execution of the model
3 Diagnosis and model comparison
Data Science Process.pptx

Recommended for you

Spatial databases
Spatial databasesSpatial databases
Spatial databases

The document discusses spatial data and spatial databases. It defines spatial data as data related to space, including location, shape, size and orientation of objects. It discusses the types of spatial data like points, lines, polygons and pixels. It also discusses non-spatial data and how spatial data is organized using coordinates. The key properties of spatial data are geometry, distribution of objects in space, temporal changes and data volume. Spatial databases allow for efficient storage and querying of spatial data through the use of spatial data types and indexes.

bsccsittribhuvan universityseventh semester
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle

The document outlines the typical lifecycle of a data science project, including business requirements, data acquisition, data preparation, hypothesis and modeling, evaluation and interpretation, and deployment. It discusses collecting data from various sources, cleaning and integrating data in the preparation stage, selecting and engineering features, building and validating models, and ultimately deploying results.

big datadata analysishadoop
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science

This document provides an overview of data science including what is big data and data science, applications of data science, and system infrastructure. It then discusses recommendation systems in more detail, describing them as systems that predict user preferences for items. A case study on recommendation systems follows, outlining collaborative filtering and content-based recommendation algorithms, and diving deeper into collaborative filtering approaches of user-based and item-based filtering. Challenges with collaborative filtering are also noted.

big datadata sciencemachine learning
Data Science Process.pptx
6 . Presentation and automation—
Presenting your results to the stakeholders and industrializing your analysis
process for repetitive reuse and integration with other tools.
Working with data from files
 Working with different Data Types, Different Formats, Different Compression,
Different Parsing on Different Systems are very challenging task to prepare data.
 Dealing with different formats can become a Tedious Task.
 Thus, it is mandatory for any Data Scientist To Be Aware Of Different File Formats,
common challenges in handling them and the best / efficient ways to handle this data
in real life.
What is a file format?

Recommended for you

Analysis vs reporting
Analysis vs reportingAnalysis vs reporting
Analysis vs reporting

Reporting involves users selecting predefined reports to view results in standardized formats like tables and graphs. No person is involved beyond the user requesting the report. Reports have limited flexibility. Analysis provides answers to specific questions by taking any necessary steps through a guided process. It is customized and flexible to the questions being addressed. Modern data analytics tools use statistical modeling approaches like probability, sampling, inference, and machine learning algorithms like neural networks and support vector machines to gain insights from data.

Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2

Big data analytics tools from vendors like IBM, Tableau, and SAS can help organizations process and analyze big data. For smaller organizations, Excel is often used, while larger organizations employ data mining, predictive analytics, and dashboards. Business intelligence applications include OLAP, data mining, and decision support systems. Big data comes from many sources like web logs, sensors, social networks, and scientific research. It is defined by the volume, variety, velocity, veracity, variability, and value of the data. Hadoop and MapReduce are common technologies for storing and analyzing big data across clusters of machines. Stream analytics is useful for real-time analysis of data like sensor data.

big dataanalyticshadoop
Data warehouse physical design
Data warehouse physical designData warehouse physical design
Data warehouse physical design

Data Warehouse Physical Design,Physical Data Model, Tablespaces, Integrity Constraints, ETL (Extract-Transform-Load) ,OLAP Server Architectures, MOLAP vs. ROLAP, Distributed Data Warehouse ,

data warehouse physical designphysical data modeltablespaces
Why should a data scientist understand different file formats?
 The files will depend on the application you are building.
 For example, in an image processing system, you need image files as input and output.
Therefore, we will mostly see files in jpeg, gif or png format.
 As a data scientist, we need to understand the underlying structure of various file
formats, their advantages and dis-advantages.
 Choosing the optimal file format for storing data can improve the performance of your
models in data processing.
Why should a data scientist understand different file formats?
Different File Formats.
 XLSX
 Comma-separated values (CSV)
 ZIP
 Plain Text (txt)
 JSON
 XML
 HTML
 Images
 Hierarchical Data Format
 PDF
 DOCX
 MP3
 MP4
Different file formats and how to read them in Python
Comma-separated values (CSV):
 Comma-Separated Values (CSV) file format falls under spreadsheet file format.
 In spreadsheet file format, data is stored in cells. Each cell has organized in rows and
columns.
 A column in the spreadsheet file can have different types. For example, a column can
be of string type, a date type or an integer type.
 Some of the most popular spreadsheet file formats are Comma Separated Values (CSV),
Microsoft Excel Spreadsheet (xls) and Microsoft Excel Open XML Spreadsheet (xlsx).
 Some files are separated using tab. This file format is known as TSV
(Tab Separated Values) file format.
Different file formats and how to read them in Python
The below image shows a CSV file which is opened in Notepad.

Recommended for you

Spatial databases
Spatial databasesSpatial databases
Spatial databases

This document discusses spatial databases and spatial data mining. It introduces spatial databases as databases that store large amounts of space-related data with special data types for spatial information. Spatial data mining extracts patterns and relationships from spatial data. The document also discusses spatial data warehousing with dimensions and measures for spatial and non-spatial data, mining spatial association patterns from spatial databases, techniques for spatial clustering, classification, and trend analysis.

databasesspatialdata mining
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining

Data mining is the process of automatically discovering useful information from large data sets. It draws from machine learning, statistics, and database systems to analyze data and identify patterns. Common data mining tasks include classification, clustering, association rule mining, and sequential pattern mining. These tasks are used for applications like credit risk assessment, fraud detection, customer segmentation, and market basket analysis. Data mining aims to extract unknown and potentially useful patterns from large data sets.

parenttoolboxdata mining
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisParallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis

Outline § Performance scalability § Analytical performance measures § Amdahl’s law and Gustafson-Barsis’ law Performance Analysis

performance analysisparallelcomputing
Reading the data from CSV in Python
 For loading the data, you can use the “pandas” library in python.
import pandas as pd
pd.read_csv(r'F:IT DEPTWINTER 202210212IT105 - DATA SCIENCE IN PYTHON/addresses.csv')
Different file formats and how to read them in Python
Read Excel file:
 XLSX is a Microsoft Excel Open XML file format. It also comes under the
Spreadsheet file format.
 It is an XML-based file format created by Microsoft Excel.
 In XLSX data is organized under the cells and columns in a sheet.
 Each XLSX file may contain one or more sheets. Therefore, a workbook can contain
multiple sheets.
Different file formats and how to read them in Python
Excel File:
Different file formats and how to read them in Python
Read Excel file:
import pandas as pd
pd.read_excel(r'C:UsersNITHIDesktopMentees List.xlsx')

Recommended for you

Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methods

The document discusses frequent itemset mining methods. It describes the Apriori algorithm which uses a candidate generation-and-test approach involving joining and pruning steps. It also describes the FP-Growth method which mines frequent itemsets without candidate generation by building a frequent-pattern tree. The advantages of each method are provided, such as Apriori being easily parallelized but requiring multiple database scans.

Data science - An Introduction
Data science - An IntroductionData science - An Introduction
Data science - An Introduction

This document provides an introduction and overview of data science. It discusses Ravishankar Rajagopalan's educational and professional background working in data science. It then covers various topics related to data science including common applications, required skills, the typical project lifecycle, team aspects, career progression, interviews, and resources for learning. Examples of unusual real-world applications are also summarized, such as using machine learning to optimize inventory levels for an oil and gas company and implementing speech recognition to predict customer intent for a call center.

data-scienceanalyticsartificial-intelligence
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction Framework

The document discusses a framework for extracting information from resumes. Resumes are semi-structured documents that contain varying information like different fields, field names, and formats, making them difficult to parse. The proposed framework uses text mining and rule-based parsing to extract keywords from resumes, scores qualifications and skills, clusters the extracted information using DBSCAN, and classifies the resumes using gradient boosting machines. It aims to help recruiters filter and categorize large numbers of resumes more efficiently.

irjet
Different file formats and how to read them in Python
Read Excel file:
Different file formats and how to read them in Python
Read some particular columns:
import pandas as pd
pd.read_excel(r'C:UsersNITHIDesktopMentees List.xlsx',index_col=0, usecols="A:C")
Different file formats and how to read them in Python
Read some particular columns:
Different file formats and how to read them in Python
Read some particular columns:
import pandas as pd
pd.read_excel(r'C:UsersNITHIDesktopMentees List.xlsx',index_col=0, usecols=[3,5,6)

Recommended for you

Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python

Python is open source and has so many libraries for data wrangling and visualization that makes life of data scientists easier. For data wrangling pandas is used as it represent tabular data and it has other function to parse data from different sources, data cleaning, handling missing values, merging data sets etc. To visualize data, low level matplotlib can be used. But it is a base package for other high level packages such as seaborn, that draw well customized plot in just one line of code. Python has dash framework that is used to make interactive web application using python code without javascript and html. These dash application can be published on any server as well as on clouds like google cloud but freely on heroku cloud.

pandas numpy matpotlib seaborn dash python
Bba203 unit 2data processing concepts
Bba203   unit 2data processing conceptsBba203   unit 2data processing concepts
Bba203 unit 2data processing concepts

This document outlines key concepts related to data processing including: - Data refers to facts and observations represented by symbols. Data processing manipulates data to transform it into useful information. - Data processing activities include tools to convert data into information, from manual to electronic tools. - The data processing cycle includes input, processing, output, and storage steps. - Data hierarchy shows the arrangement of data from fields to records to files to databases.

computer
chapter 1 HARDWARE AND NETWORKING SERVICE.pptx
chapter 1 HARDWARE AND NETWORKING SERVICE.pptxchapter 1 HARDWARE AND NETWORKING SERVICE.pptx
chapter 1 HARDWARE AND NETWORKING SERVICE.pptx

Hardware and networking service is very important in ICT.

hardwarecomputernetwork
Different file formats and how to read them in Python
Read some particular columns:
Different file formats and how to read them in Python
Read a particular Sheet:
import pandas as pd
pd.read_excel(r'C:UsersNITHIDesktopMentees List.xlsx',index_col=0, sheet_name=0)
Different file formats and how to read them in Python
Read a particular Sheet:
Different file formats and how to read them in Python
Read a particular Sheet:
import pandas as pd
pd.read_excel(r'C:UsersNITHIDesktopMentees List.xlsx',index_col=0, sheet_name=“Second Year”)

Recommended for you

Disclosing Private Information from Metadata, hidden info and lost data
Disclosing Private Information from  Metadata, hidden info and lost data Disclosing Private Information from  Metadata, hidden info and lost data
Disclosing Private Information from Metadata, hidden info and lost data

Documents contain metadata and hidden information that can be used to disclose private data and to fingerprint an organisation and its network computers. This document shows what kinds of data can be found, how to extract them and proposes some solutions to the problem stated here.

metadatadata lost preventionfingerprinting
Database Systems Concepts, 5th Ed
Database Systems Concepts, 5th EdDatabase Systems Concepts, 5th Ed
Database Systems Concepts, 5th Ed

This document provides an overview of the topics that will be covered in a database systems textbook. It introduces the major parts of the book, including relational databases, database design, data storage and querying, transaction management, and database architectures. Each chapter is briefly described to give the reader an understanding of what concepts will be discussed in more depth throughout the textbook.

D I T211 Chapter 1
D I T211    Chapter 1D I T211    Chapter 1
D I T211 Chapter 1

The document provides an introduction to databases including: - The structure of databases and the hierarchy of data types - The differences between file-based and database approaches - The components of a database system including the database, DBMS, applications, users, and tools - The purposes of using a database including storing, finding, and analyzing information

Different file formats and how to read them in Python
Read a particular Sheet:
Different file formats and how to read them in Python
Read Microsoft Word file:
 XLSX is a Microsoft Word Open
file format with extension .docx
Different file formats and how to read them in Python
Read Microsoft Word file:
pip install python-docx
Different file formats and how to read them in Python
Read Microsoft Word file:
from doc import Document
document = Document(r'F:IT DEPTWINTER 202210212IT105 - DATA SCIENCE IN
PYTHONtest.docx')
type(document)

Recommended for you

Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptx

This document provides an overview of data science and key concepts in data. It defines data science and describes the data value chain, which identifies the main activities in generating value from data: data acquisition, analysis, curation, storage, and usage. It also defines different data types such as structured, unstructured, and semi-structured data. The document discusses characteristics of big data, including the 3Vs of volume, velocity, and variety as well as other characteristics like veracity and variability. Finally, it outlines the typical big data lifecycle of ingesting, persisting, computing/analyzing, and visualizing data.

Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes

Data science involves extracting knowledge from data to solve business problems. The data science life cycle includes defining the problem, collecting and preparing data, exploring the data, building models, and communicating results. Data preparation is an essential step that can consume 60% of a project's time. It involves cleaning, transforming, handling outliers, integrating, and reducing data. Models are built using machine learning algorithms like regression for continuous variables and classification for discrete variables. Results are visualized and communicated effectively to clients.

introduction to data science with r notes
D I T211 Chapter 1 1
D I T211    Chapter 1 1D I T211    Chapter 1 1
D I T211 Chapter 1 1

The document introduces databases and their components. It defines key terms like data, information, database, DBMS, and discusses the evolution from file-based systems to database systems. The main advantages of database systems are minimal data redundancy, sharing of data across systems, improved data consistency when values are stored and updated in one place.

Different file formats and how to read them in Python
Read Microsoft Word file:
document.paragraphs
Different file formats and how to read them in Python
Read Microsoft Word file:
type(document.paragraphs)
Different file formats and how to read them in Python
Read Microsoft Word file:
document.paragraphs[1]
document.paragraphs[0]
Different file formats and how to read them in Python
Read Microsoft Word file:
document.paragraphs[0].text
document.paragraphs[1].text

Recommended for you

Database Management System, Lecture-1
Database Management System, Lecture-1Database Management System, Lecture-1
Database Management System, Lecture-1

This is prepared from the book- Database System Concepts, written by ©Silberschatz, Korth and Sudarshan.

dbmsdatabase management systemdatabase management
2016 Chapter 2 - Intro. to Data Sciences.pptx
2016  Chapter 2 - Intro. to Data Sciences.pptx2016  Chapter 2 - Intro. to Data Sciences.pptx
2016 Chapter 2 - Intro. to Data Sciences.pptx

Description

Qiagram
QiagramQiagram
Qiagram

The document describes a business intelligence software called Qiagram that allows non-technical domain experts to easily explore and query complex datasets through a visual drag-and-drop interface without SQL or programming knowledge. It provides centralized data management, integration with various data sources, and self-service visual querying capabilities to help researchers gain insights from their data.

data explorationvisual sqlcollaboration
Different file formats and how to read them in Python
Read Microsoft Word file:
document.paragraphs[2].text
Data Science Process.pptx
Exploratory Data Analysis
Exploratory Data Analysis
 A method used to analyze and summarize data sets.
 Data scientists to analyze and investigate data sets and summarize their main
characteristics use exploratory Data Analysis (EDA), often employing data
visualization methods.
 It helps the data scientists to discover patterns, spot anomalies, test a hypothesis, or
check assumptions.
 EDA is primarily used to provide a better understanding of data set variables and the
relationships between them.
 It can helps to determine if the statistical techniques you are
considering for data analysis are appropriate.

Recommended for you

Database
DatabaseDatabase
Database

The document provides an overview of fundamentals of database design including definitions of key concepts like data, information, and databases. It discusses the purpose of databases and database management systems. It also covers topics like selecting a database system, database development best practices, and data entry considerations.

Data Life Cycle
Data Life CycleData Life Cycle
Data Life Cycle

This document discusses best practices for data organization, documentation, and metadata. It recommends using open standard file formats that will remain readable over time, consistent file naming conventions with descriptive names, and version control for files. Metadata should include descriptive, technical, and administrative information to document the data and ensure it can be understood and managed. Good documentation involves information on the data collection process and dataset structure.

Data Science Course.pdf
Data Science Course.pdfData Science Course.pdf
Data Science Course.pdf

Techedo Technologies provides the best Data Science Course in Chandigarh. Both offline/online classes are available here. Data science is a term used to explain a whole branch of Scouring | extraction | Retrieving | Processing | Taking output of data using different tolls, programming, platforms etc. Data science is all about taking out, use and process data or any one among mentioned tasks irrespective to the software or platform you are using. You may use MS-excel, SQL, R-Language, Python or any other medium for it. Anything you will to work on data comes under data science. Data science is one of the most processing career today. For More Details: Visit: https://www.techedo.com/data-Science-course-chandigarh.php Call: 7837505001, 7717255001, 8198055001, 0172-5275001, 0172-5265001

#data_science#data_science_course#data_science_program
Exploratory Data Analysis
Why is exploratory data analysis important in data science?
 Identify obvious errors, understand patterns, detect outliers or anomalous events,
interesting relations among the variables.
 To ensure the results that the data scientist produce are valid and applicable to any desired
business outcomes and goals.
 EDA helps stakeholders by confirming they are asking the right questions.
 EDA can help to answer questions about standard deviations, categorical variables, and
confidence intervals.
 Once EDA is complete and insights are drawn, its features can then
be used for more sophisticated data analysis or modeling,
including machine learning.
Exploratory Data Analysis
Exploratory Data Analysis Tools:
 Anyone spends a lot of time doing EDA to get a better understanding of data.
 EDA can be minimized by using auto visualizations tools such as –
1.Pandas-profiling,
2.Sweetviz,
3.Autoviz
4.D-Tale
Exploratory Data Analysis
Exploratory Data Analysis Tools:
 EDA involves a lot of steps including some statistical tests, visualization of data using
different kinds of plots
1. Data Quality Check: Can be done using pandas library functions like describe(), info(),
dtypes(), etc. It is used to find several features like its datatypes, duplicate values, missing
value, etc.
2. Statistical Test: Some statistical tests like Pearson correlation, Spearman correlation,
Kendall test, etc are done to get a correlation between the features. It can be
implemented in python using the “stats” library.
Exploratory Data Analysis
Exploratory Data Analysis Tools:
3. Quantitative Test: Find the spread of numerical features, count of categorical features. It
can be implemented in python using the functions of the “pandas” library.
4. Visualization: To get an understanding of the data. Graphical techniques like bar plots,
pie charts are used to get an understanding of categorical features, whereas scatter plots,
histograms are used for numerical features.

Recommended for you

How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack

Watch here: https://bit.ly/3cZGCxr For their machine learning and data science projects to be successful, data scientists need access to all of the enterprise data delivered through their myriad of data models. However, gaining access to all data, integrated into a central repository has been a challenge. Often 80% of the project time is spent on these tasks. But, a virtual layer can help the data scientist speed up some of the most tedious tasks, like data exploration and analysis. At the same time, it also integrates well with the data science ecosystem. There is no need to change tools and learn new languages. The data virtualization platform helps data scientists offload these data integration tasks, allowing them to focus on advanced analytics. In this session, you will learn how data virtualization: - Provides all of the enterprise data, in real-time, and without replication - Enables data scientists to create and share multiple logical models using simple drag and drop - Provides a catalog of all business definitions, lineage, and relationships

data analyticsdata virtualization
Evaluation of Research Tools
Evaluation of Research ToolsEvaluation of Research Tools
Evaluation of Research Tools

EndNote is a reference management software that allows users to: - Search online databases and import references into a personal library - Insert formatted citations from the library directly into a research paper - Automatically generate bibliographies in the desired style While its interface is easy to use, it takes time to learn to fully utilize EndNote's features for organizing, sorting, and formatting references. Only EndNote can open compressed EndNote libraries for sharing full libraries.

 
by HATS
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research

This document discusses using the Natural Language Toolkit (NLTK) for keyword research and analysis. It provides instructions on installing NLTK and other Python libraries, preparing keyword data, and running scripts to classify and cluster keywords to identify trends and topics. The document demonstrates how to automate aspects of keyword research using NLTK to help analyze large datasets.

seo automation
Exploratory Data Analysis Tools
Pandas-Profiling:
 Pandas profiling is an open-source python library that automates the EDA process and
creates a detailed report.
 Pandas Profiling can be used easily for large datasets as it is blazingly fast and creates
reports in a few seconds.
 Installation:
pip install pandas-profiling
Exploratory Data Analysis Tools
Pandas-Profiling:
#Install the below libraries before importing
import pandas as pd
from pandas_profiling import ProfileReport
#EDA using pandas-profiling
profile = ProfileReport(pd.read_excel('Mentees List.xlsx'), explorative=True)
#Saving results to a HTML file
profile.to_file("output.html")
Exploratory Data Analysis Tools
Pandas-Profiling Report:
The pandas-profiling library generates a report having:
 An overview of the dataset
 Variable properties
 Interaction of variables
 Correlation of variables
 Missing values
 Sample data
Exploratory Data Analysis Tools
Pandas-Profiling:
Report: file:///C:/Users/NITHI/output.html

Recommended for you

Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx

Page 1/8 Goal: Implement a complete search engine. Milestones Overview Milestone Goal #1 Produce an initial index for the corpus and a basic retrieval component #2 Complete Search System Page 2/8 PROJECT: SEARCH ENGINE Corpus: all ICS web pages We will provide you with the crawled data as a zip file (webpages_raw.zip). This contains the downloaded content of the ICS web pages that were crawled by a previous quarter. You are expected to build your search engine index off of this data. Main challenges: Full HTML parsing, File/DB handling, handling user input (either using command line or desktop GUI application or web interface) COMPONENT 1 - INDEX: Create an inverted index for all the corpus given to you. You can either use a database to store your index (MongoDB, Redis, memcached are some examples) or you can store the index in a file. You are free to choose an approach here. The index should store more than just a simple list of documents where the token occurs. At the very least, your index should store the TF-IDF of every term/document. Sample Index: Note: This is a simplistic example provided for your understanding. Please do not consider this as the expected index format. A good inverted index will store more information than this. Index Structure: token – docId1, tf-idf1 ; docId2, tf-idf2 Example: informatics – doc_1, 5 ; doc_2, 10 ; doc_3, 7 You are encouraged to come up with heuristics that make sense and will help in retrieving relevant search results. For e.g. - words in bold and in heading (h1, h2, h3) could be treated as more important than the other words. These are useful metadata that could be added to your inverted index data. Optional (1 point for each meta data item up to 2 points max):: Extra credit will be given for ideas that improve the quality of the retrieval, so you may add more metadata to your index, if you think it will help improve the quality of the retrieval. For this, instead of storing a simple TF-IDF count for every page, you can store more information related to the page (e.g. position of the words in the page). To store this information, you need to design your index in such a way that it can store and retrieve all this metadata efficiently. Your index lookup during search should not be horribly slow, so pay attention to the structure of your index COMPONENT 2 – SEARCH AND RETRIEVE: Your program should prompt the user for a query. This doesn’t need to be a Web interface, it can be a console prompt. At the time of the query, your program will look up your index, perform some calculations (see ranking below) and give out the ranked list of pages that are relevant for the query.   COMPONENT 3 - RANKING: At the very least, your ranking formula should include tf-idf scoring, but you should feel free to add additional components to this formula if you think they improve the retrieval. Optional (1 point for each parameter up to 2 points max): Extra credit will be given if your ranking formula includes par.

LeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdfLeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdf

Pyspark

Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...

This study primarily aimed to determine the best practices of clothing businesses to use it as a foundation of strategic business advancements. Moreover, the frequency with which the business's best practices are tracked, which best practices are the most targeted of the apparel firms to be retained, and how does best practices can be used as strategic business advancement. The respondents of the study is the owners of clothing businesses in Talavera, Nueva Ecija. Data were collected and analyzed using a quantitative approach and utilizing a descriptive research design. Unveiling best practices of clothing businesses as a foundation for strategic business advancement through statistical analysis: frequency and percentage, and weighted means analyzing the data in terms of identifying the most to the least important performance indicators of the businesses among all of the variables. Based on the survey conducted on clothing businesses in Talavera, Nueva Ecija, several best practices emerge across different areas of business operations. These practices are categorized into three main sections, section one being the Business Profile and Legal Requirements, followed by the tracking of indicators in terms of Product, Place, Promotion, and Price, and Key Performance Indicators (KPIs) covering finance, marketing, production, technical, and distribution aspects. The research study delved into identifying the core best practices of clothing businesses, serving as a strategic guide for their advancement. Through meticulous analysis, several key findings emerged. Firstly, prioritizing product factors, such as maintaining optimal stock levels and maximizing customer satisfaction, was deemed essential for driving sales and fostering loyalty. Additionally, selecting the right store location was crucial for visibility and accessibility, directly impacting footfall and sales. Vigilance towards competitors and demographic shifts was highlighted as essential for maintaining relevance. Understanding the relationship between marketing spend and customer acquisition proved pivotal for optimizing budgets and achieving a higher ROI. Strategic analysis of profit margins across clothing items emerged as crucial for maximizing profitability and revenue. Creating a positive customer experience, investing in employee training, and implementing effective inventory management practices were also identified as critical success factors. In essence, these findings underscored the holistic approach needed for sustainable growth in the clothing business, emphasizing the importance of product management, marketing strategies, customer experience, and operational efficiency.

best practicesclothing businesskey performance indicators
Exploratory Data Analysis Tools
Sweetviz:
 Sweetviz is an open-source python auto-visualization library that generates a report,
exploring the data with the help of high-density plots.
 It not only automates the EDA but is also used for comparing datasets and drawing
inferences from it.
 A comparison of two datasets can be done by treating one as training and the other as
testing.
Installation:
pip install sweetviz
Exploratory Data Analysis Tools
Sweetviz:
 Sweetviz is an open-source python auto-visualization library that generates a report,
exploring the data with the help of high-density plots.
 It not only automates the EDA but is also used for comparing datasets and drawing
inferences from it.
 A comparison of two datasets can be done by treating one as training and the other as
testing.
Installation:
pip install sweetviz
Exploratory Data Analysis Tools
Sweetviz:
#Install the below libraries before importing
import pandas as pd
import sweetviz as sv
#EDA using Sweetviz
sweet_report = sv.analyze(pd.read_excel(r'C:UsersNITHIDesktopMentees List.xlsx'))
#Saving results to HTML file
sweet_report.show_html('sweet_report.html')
Exploratory Data Analysis Tools
Sweetviz Report:
The Sweetviz library generates a report having:
 An overview of the dataset
 Variable properties
 Categorical associations
 Numerical associations
 Most frequent, smallest, largest values for numerical features

Recommended for you

Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...

Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air Mobility

22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf

CSS chapter 1 notes

Response & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITHResponse & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITH

Talk covering Guardrails , Jailbreak, What is an alignment problem? RLHF, EU AI Act, Machine & Graph unlearning, Bias, Inconsistency, Probing, Interpretability, Bias

machine learningchatgptaisafety
Exploratory Data Analysis Tools
Sweetviz Report:
file:///C:/Users/NITHI/sweet_report.html
Exploratory Data Analysis Tools
Autoviz:
 Autoviz is an open-source python auto visualization library that mainly focuses on
visualizing the relationship of the data by generating different types of plot.
 Installation:
pip install autoviz
Exploratory Data Analysis Tools
Autoviz:
#Install the below libraries before importing
import pandas as pd
from autoviz.AutoViz_Class import AutoViz_Class
#EDA using Autoviz
autoviz = AutoViz_Class().AutoViz(r'C:UsersNITHIDesktopMentees List.xlsx')
Exploratory Data Analysis Tools
Autoviz Report:
The Autoviz library generates a report having:
 An overview of the dataset
 Pairwise scatter plot of continuous variables
 Distribution of categorical variables
 Heatmaps of continuous variables
 Average numerical variable by each categorical variable

Recommended for you

21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY

VLSI design 21ec63 MOS TRANSISTOR THEORY

CCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.docCCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc

CCS367-STORAGE TECHNOLOGIES QUESTION BANK

 
by Dss
ccs367-storage technologies qb
Development of Chatbot Using AI/ML Technologies
Development of  Chatbot Using AI/ML TechnologiesDevelopment of  Chatbot Using AI/ML Technologies
Development of Chatbot Using AI/ML Technologies

The rapid advancements in artificial intelligence and natural language processing have significantly transformed human-computer interactions. This thesis presents the design, development, and evaluation of an intelligent chatbot capable of engaging in natural and meaningful conversations with users. The chatbot leverages state-of-the-art deep learning techniques, including transformer-based architectures, to understand and generate human-like responses. Key contributions of this research include the implementation of a context- aware conversational model that can maintain coherent dialogue over extended interactions. The chatbot's performance is evaluated through both automated metrics and user studies, demonstrating its effectiveness in various applications such as customer service, mental health support, and educational assistance. Additionally, ethical considerations and potential biases in chatbot responses are examined to ensure the responsible deployment of this technology. The findings of this thesis highlight the potential of intelligent chatbots to enhance user experience and provide valuable insights for future developments in conversational AI.

thesischatbotai/ml
Exploratory Data Analysis Tools
Autoviz Report:
Exploratory Data Analysis Tools
D-Tale:
 D-Tale is an open-source python auto-visualization library. It is one of the best auto
data-visualization libraries.
 D-Tale helps you to get a detailed EDA of the data. It also has a feature of code export for
every plot or analysis in the report.
 Installation:
pip install dtale
Exploratory Data Analysis Tools
D-Tale:
 D-Tale is an open-source python auto-visualization library. It is one of the best auto
data-visualization libraries.
 D-Tale helps you to get a detailed EDA of the data. It also has a feature of code export for
every plot or analysis in the report.
 Installation:
pip install dtale
Exploratory Data Analysis Tools
D-Tale:
import dtale
import pandas as pd
dtale.show(pd.read_excel(r'C:UsersNITHIDesktopMentees List.xlsx'))

Recommended for you

Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...

Pre-trained Large Language Models (LLM) have achieved remarkable successes in several domains. However, code-oriented LLMs are often heavy in computational complexity, and quadratically with the length of the input code sequence. Toward simplifying the input program of an LLM, the state-of-the-art approach has the strategies to filter the input code tokens based on the attention scores given by the LLM. The decision to simplify the input program should not rely on the attention patterns of an LLM, as these patterns are influenced by both the model architecture and the pre-training dataset. Since the model and dataset are part of the solution domain, not the problem domain where the input program belongs, the outcome may differ when the model is trained on a different dataset. We propose SlimCode, a model-agnostic code simplification solution for LLMs that depends on the nature of input code tokens. As an empirical study on the LLMs including CodeBERT, CodeT5, and GPT-4 for two main tasks: code search and summarization. We reported that 1) the reduction ratio of code has a linear-like relation with the saving ratio on training time, 2) the impact of categorized tokens on code simplification can vary significantly, 3) the impact of categorized tokens on code simplification is task-specific but model-agnostic, and 4) the above findings hold for the paradigm–prompt engineering and interactive in-context learning and this study can save reduce the cost of invoking GPT-4 by 24%per API query. Importantly, SlimCode simplifies the input code with its greedy strategy and can obtain at most 133 times faster than the state-of-the-art technique with a significant improvement. This paper calls for a new direction on code-based, model-agnostic code simplification solutions to further empower LLMs.

code simplification
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker

Kiln

L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx

..

Exploratory Data Analysis Tools
D-Tale Report:
The dtale library generates a report having:
 An overview of the dataset
 Custom filters
 Correlation, Charts, and Heatmaps
 Highlight datatypes, missing values, ranges
 Code export
Exploratory Data Analysis Tools
D-Tale Report:
Data Management
Data Management
What is Data Management?
 Data management is the practice of collecting, organizing, protecting, and storing an
organization’s data so it can be analyzed for business decisions.
 As organizations create and consume data at unprecedented rates, data management
solutions become essential for making sense of the vast quantities of data.
 Today’s leading data management software ensures that reliable, up-to-date data is
always used to drive decisions.

Recommended for you

Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx

Very Important design

Lecture 3 Biomass energy...............ppt
Lecture 3 Biomass energy...............pptLecture 3 Biomass energy...............ppt
Lecture 3 Biomass energy...............ppt

Biomass energy

rujan timsina
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx

Notes of Construction management and entrepreneurship

construction management
Data Management
Types of Data Management
Data management plays several roles in an organization’s data environment, making
essential functions easier and less time-intensive.
1. Data preparation is used to clean and transform raw data into the right shape and format
for analysis, including making corrections and combining data sets.
2. Data Pipelines enable the automated transfer of data from one system to another.
3. ETLs (Extract, Transform, Load) are built to take the data from one system,
transform it, and load it into the organization’s data warehouse.
Data Management
Types of Data Management (cont...)
4. Data Catalogs - help manage metadata to create a complete picture of the data,
providing a summary of its changes, locations, and quality while also making the data
easy to find.
5. Data Warehouses are places to consolidate various data sources, contend with the
many data types businesses store, and provide a clear route for data analysis.
6. Data Governance defines standards, processes, and policies to maintain data security
and integrity.
Data Management
Types of Data Management (cont...)
7. Data Architecture provides a formal approach for creating and managing data flow.
8. Data Security protects data from unauthorized access and corruption.
9. Data Modeling documents the flow of data through an application or organization.
Data Management
Why data management is important?
 Data management is a crucial first step that leads to add value to our customers and
improve our business bottom line.
 The effective data management, people across an organization can find and access
trusted data for their queries.
 Some benefits of an effective data management solution includes:
1) Visibility
2) Reliability
3) Security
4) Scalability

Recommended for you

Exploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative ReviewExploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative Review

Image recognition, which comes under Artificial Intelligence (AI) is a critical aspect of computer vision, enabling computers or other computing devices to identify and categorize objects within images. Among numerous fields of life, food processing is an important area, in which image processing plays a vital role, both for producers and consumers. This study focuses on the binary classification of strawberries, where images are sorted into one of two categories. We Utilized a dataset of strawberry images for this study; we aim to determine the effectiveness of different models in identifying whether an image contains strawberries. This research has practical applications in fields such as agriculture and quality control. We compared various popular deep learning models, including MobileNetV2, Convolutional Neural Networks (CNN), and DenseNet121, for binary classification of strawberry images. The accuracy achieved by MobileNetV2 is 96.7%, CNN is 99.8%, and DenseNet121 is 93.6%. Through rigorous testing and analysis, our results demonstrate that CNN outperforms the other models in this task. In the future, the deep learning models can be evaluated on a richer and larger number of images (datasets) for better/improved results.

image recognitiondeep learning
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...

SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training 2024 July 09

Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval

Introduction to information retrieval, Major challenges in IR

Data Management
Important of Data Management
1) Visibility –
 Increase the visibility of your organization’s data assets.
 Easier for people to quickly and confidently find the right data for their analysis.
1) Reliability –
 By establishing processes and policies to build the trust in the data being used to make decisions
across your organization.
1) Security –
 Protects your organization and its employees from data losses, thefts, and breaches with
authentication and encryption tools.
1) Scalability –
 Allows organizations to effectively scale data and usage occasions with
repeatable processes to keep data and metadata up to date.
Data Management
Data Management Challenges:
 Traditional Data Management processes make it difficult to scale capabilities without compromising
governance or security.
 Modern Data Management software must address several challenges to ensure trusted data can be found.
Challenge 1:
Increased Data Volumes - Organization to become unaware of what data it has, where the data is, and how
to use it.
Challenge 2:
New Roles for Analytics - Understanding naming conventions, complex data structures, and databases can
be a challenge.
Challenge 3:
Compliance Requirements - Constantly changing compliance requirements make
it a challenge to ensure people are using the right data.
Data Management
Establish Best Data Management:
An effective data management strategy.
1. Clearly Identify Your Business Goals
2. Focus on the Quality of Data
3. Allow the Right People to Access the Data
4. Prioritize Data Security
Data Cleaning
Data Cleaning –
 Process of identifying the incorrect, incomplete, inaccurate, irrelevant or missing part
of the data.
 Modifying, Replacing or Deleting them according to the necessity.
 Data Cleaning is considered a foundational element of the basic data science.

Recommended for you

Understanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Understanding Cybersecurity Breaches: Causes, Consequences, and PreventionUnderstanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Understanding Cybersecurity Breaches: Causes, Consequences, and Prevention

Cybersecurity breaches are a growing threat in today’s interconnected digital landscape, affecting individuals, businesses, and governments alike. These breaches compromise sensitive information and erode trust in online services and systems. Understanding the causes, consequences, and prevention strategies of cybersecurity breaches is crucial to protect against these pervasive risks. Cybersecurity breaches refer to unauthorized access, manipulation, or destruction of digital information or systems. They can occur through various means such as malware, phishing attacks, insider threats, and vulnerabilities in software or hardware. Once a breach happens, cybercriminals can exploit the compromised data for financial gain, espionage, or sabotage. Causes of breaches include software and hardware vulnerabilities, phishing attacks, insider threats, weak passwords, and a lack of security awareness. The consequences of cybersecurity breaches are severe. Financial loss is a significant impact, as organizations face theft of funds, legal fees, and repair costs. Breaches also damage reputations, leading to a loss of trust among customers, partners, and stakeholders. Regulatory penalties are another consequence, with hefty fines imposed for non-compliance with data protection regulations. Intellectual property theft undermines innovation and competitiveness, while disruptions of critical services like healthcare and utilities impact public safety and well-being.

cybersecurity breachesdata securityphishing attacks
Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.

Trends in CAD CAM

MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme

Syllabus

msbte syllabusxcxzcxzcxzxccxzczc
Data Cleaning
Data Cleaning –
 Data is the most valuable thing for Analytics and Machine learning.
 In computing or Business, data is needed everywhere. When it comes to the real world
data, it is not improbable that data may contain incomplete, inconsistent or missing
values.
 If the data is corrupted then it may hinder the process or provide inaccurate results.
Data Cleaning
Data Cleaning –
 Data is the most valuable thing for Analytics and Machine learning.
 In computing or Business, data is needed everywhere. When it comes to the real world
data, it is not improbable that data may contain incomplete, inconsistent or missing
values.
 If the data is corrupted then it may hinder the process or provide inaccurate results.
Data Cleaning

More Related Content

What's hot

Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
DHIVYADEVAKI
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
datapreprocessing
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
Yugal Kumar
 
Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)
snegacmr
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
Sreenatha Reddy K R
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
bhavesh lande
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
Prof. Dr. Diego Kuonen
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
kavitha muneeshwaran
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
Slideshare
 
Spatial databases
Spatial databasesSpatial databases
Spatial databases
Dabbal Singh Mahara
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
Jason Geng
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
Jason Geng
 
Analysis vs reporting
Analysis vs reportingAnalysis vs reporting
Analysis vs reporting
Rajashree Thirupathi
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
RojaT4
 
Data warehouse physical design
Data warehouse physical designData warehouse physical design
Data warehouse physical design
Er. Nawaraj Bhandari
 
Spatial databases
Spatial databasesSpatial databases
Spatial databases
Seraphic Nazir
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
DataminingTools Inc
 
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisParallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Shah Zaib
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methods
Prof.Nilesh Magar
 
Data science - An Introduction
Data science - An IntroductionData science - An Introduction
Data science - An Introduction
Ravishankar Rajagopalan
 

What's hot (20)

Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
 
Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Spatial databases
Spatial databasesSpatial databases
Spatial databases
 
Data Science Project Lifecycle
Data Science Project LifecycleData Science Project Lifecycle
Data Science Project Lifecycle
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Analysis vs reporting
Analysis vs reportingAnalysis vs reporting
Analysis vs reporting
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Data warehouse physical design
Data warehouse physical designData warehouse physical design
Data warehouse physical design
 
Spatial databases
Spatial databasesSpatial databases
Spatial databases
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisParallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methods
 
Data science - An Introduction
Data science - An IntroductionData science - An Introduction
Data science - An Introduction
 

Similar to Data Science Process.pptx

IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction Framework
IRJET Journal
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
 
Bba203 unit 2data processing concepts
Bba203   unit 2data processing conceptsBba203   unit 2data processing concepts
Bba203 unit 2data processing concepts
kinjal patel
 
chapter 1 HARDWARE AND NETWORKING SERVICE.pptx
chapter 1 HARDWARE AND NETWORKING SERVICE.pptxchapter 1 HARDWARE AND NETWORKING SERVICE.pptx
chapter 1 HARDWARE AND NETWORKING SERVICE.pptx
sufiyanhussein798
 
Disclosing Private Information from Metadata, hidden info and lost data
Disclosing Private Information from  Metadata, hidden info and lost data Disclosing Private Information from  Metadata, hidden info and lost data
Disclosing Private Information from Metadata, hidden info and lost data
Chema Alonso
 
Database Systems Concepts, 5th Ed
Database Systems Concepts, 5th EdDatabase Systems Concepts, 5th Ed
Database Systems Concepts, 5th Ed
Daniel Francisco Tamayo
 
D I T211 Chapter 1
D I T211    Chapter 1D I T211    Chapter 1
D I T211 Chapter 1
askme
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptx
JethroDignadice2
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
LakshmiSarvani6
 
D I T211 Chapter 1 1
D I T211    Chapter 1 1D I T211    Chapter 1 1
D I T211 Chapter 1 1
askme
 
Database Management System, Lecture-1
Database Management System, Lecture-1Database Management System, Lecture-1
Database Management System, Lecture-1
Sonia Mim
 
2016 Chapter 2 - Intro. to Data Sciences.pptx
2016  Chapter 2 - Intro. to Data Sciences.pptx2016  Chapter 2 - Intro. to Data Sciences.pptx
2016 Chapter 2 - Intro. to Data Sciences.pptx
mussie143tadesse
 
Qiagram
QiagramQiagram
Qiagram
jwppz
 
Database
DatabaseDatabase
Database
sumit621
 
Data Life Cycle
Data Life CycleData Life Cycle
Data Life Cycle
Jason Henderson
 
Data Science Course.pdf
Data Science Course.pdfData Science Course.pdf
Data Science Course.pdf
AnjaliSharma655429
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
Denodo
 
Evaluation of Research Tools
Evaluation of Research ToolsEvaluation of Research Tools
Evaluation of Research Tools
HATS
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
Erudite
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
smile790243
 

Similar to Data Science Process.pptx (20)

IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction Framework
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Bba203 unit 2data processing concepts
Bba203   unit 2data processing conceptsBba203   unit 2data processing concepts
Bba203 unit 2data processing concepts
 
chapter 1 HARDWARE AND NETWORKING SERVICE.pptx
chapter 1 HARDWARE AND NETWORKING SERVICE.pptxchapter 1 HARDWARE AND NETWORKING SERVICE.pptx
chapter 1 HARDWARE AND NETWORKING SERVICE.pptx
 
Disclosing Private Information from Metadata, hidden info and lost data
Disclosing Private Information from  Metadata, hidden info and lost data Disclosing Private Information from  Metadata, hidden info and lost data
Disclosing Private Information from Metadata, hidden info and lost data
 
Database Systems Concepts, 5th Ed
Database Systems Concepts, 5th EdDatabase Systems Concepts, 5th Ed
Database Systems Concepts, 5th Ed
 
D I T211 Chapter 1
D I T211    Chapter 1D I T211    Chapter 1
D I T211 Chapter 1
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptx
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
D I T211 Chapter 1 1
D I T211    Chapter 1 1D I T211    Chapter 1 1
D I T211 Chapter 1 1
 
Database Management System, Lecture-1
Database Management System, Lecture-1Database Management System, Lecture-1
Database Management System, Lecture-1
 
2016 Chapter 2 - Intro. to Data Sciences.pptx
2016  Chapter 2 - Intro. to Data Sciences.pptx2016  Chapter 2 - Intro. to Data Sciences.pptx
2016 Chapter 2 - Intro. to Data Sciences.pptx
 
Qiagram
QiagramQiagram
Qiagram
 
Database
DatabaseDatabase
Database
 
Data Life Cycle
Data Life CycleData Life Cycle
Data Life Cycle
 
Data Science Course.pdf
Data Science Course.pdfData Science Course.pdf
Data Science Course.pdf
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
 
Evaluation of Research Tools
Evaluation of Research ToolsEvaluation of Research Tools
Evaluation of Research Tools
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 

Recently uploaded

LeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdfLeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdf
pavanaroshni1977
 
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
IJAEMSJORNAL
 
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
VICTOR MAESTRE RAMIREZ
 
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
sharvaridhokte
 
Response & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITHResponse & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITH
IIIT Hyderabad
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
PradeepKumarSK3
 
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.docCCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
Dss
 
Development of Chatbot Using AI/ML Technologies
Development of  Chatbot Using AI/ML TechnologiesDevelopment of  Chatbot Using AI/ML Technologies
Development of Chatbot Using AI/ML Technologies
maisnampibarel
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
YanKing2
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
hamedmustafa094
 
L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx
naseki5964
 
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx
surekha1287
 
Lecture 3 Biomass energy...............ppt
Lecture 3 Biomass energy...............pptLecture 3 Biomass energy...............ppt
Lecture 3 Biomass energy...............ppt
RujanTimsina1
 
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
sanabts249
 
Exploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative ReviewExploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative Review
sipij
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
Jim Mimlitz, P.E.
 
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval
KishorMahale5
 
Understanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Understanding Cybersecurity Breaches: Causes, Consequences, and PreventionUnderstanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Understanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Bert Blevins
 
Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.
Tool and Die Tech
 
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
Anwar Patel
 

Recently uploaded (20)

LeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdfLeetCode Database problems solved using PySpark.pdf
LeetCode Database problems solved using PySpark.pdf
 
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
Best Practices of Clothing Businesses in Talavera, Nueva Ecija, A Foundation ...
 
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
Advances in Detect and Avoid for Unmanned Aircraft Systems and Advanced Air M...
 
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
22519 - Client-Side Scripting Language (CSS) chapter 1 notes .pdf
 
Response & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITHResponse & Safe AI at Summer School of AI at IIITH
Response & Safe AI at Summer School of AI at IIITH
 
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
21EC63_Module1B.pptx VLSI design 21ec63 MOS TRANSISTOR THEORY
 
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.docCCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
CCS367-STORAGE TECHNOLOGIES QUESTION BANK.doc
 
Development of Chatbot Using AI/ML Technologies
Development of  Chatbot Using AI/ML TechnologiesDevelopment of  Chatbot Using AI/ML Technologies
Development of Chatbot Using AI/ML Technologies
 
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large...
 
kiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinkerkiln burning and kiln burner system for clinker
kiln burning and kiln burner system for clinker
 
L-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptxL-3536-Cost Benifit Analysis in ESIA.pptx
L-3536-Cost Benifit Analysis in ESIA.pptx
 
Rotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptxRotary Intersection in traffic engineering.pptx
Rotary Intersection in traffic engineering.pptx
 
Lecture 3 Biomass energy...............ppt
Lecture 3 Biomass energy...............pptLecture 3 Biomass energy...............ppt
Lecture 3 Biomass energy...............ppt
 
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
21CV61- Module 3 (CONSTRUCTION MANAGEMENT AND ENTREPRENEURSHIP.pptx
 
Exploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative ReviewExploring Deep Learning Models for Image Recognition: A Comparative Review
Exploring Deep Learning Models for Image Recognition: A Comparative Review
 
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
SCADAmetrics Instrumentation for Sensus Water Meters - Core and Main Training...
 
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval
 
Understanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Understanding Cybersecurity Breaches: Causes, Consequences, and PreventionUnderstanding Cybersecurity Breaches: Causes, Consequences, and Prevention
Understanding Cybersecurity Breaches: Causes, Consequences, and Prevention
 
Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.Trends in Computer Aided Design and MFG.
Trends in Computer Aided Design and MFG.
 
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K SchemeMSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme MSBTE K Scheme
 

Data Science Process.pptx

  • 2. 1. Setting the research goal 1. Spend time understanding the goals and context of your research 2. Create a project charter
  • 3. 2. Retrieving Data 1. Data within company a. This data can be stored in official data repositories such as databases, data marts, data warehouses, and data lakes maintained by a team of IT professionals. b. The primary goal of a database is data storage, while a data warehouse is designed for reading and analyzing that data. c. A data mart is a subset of the data warehouse and geared toward serving a specific business unit. While data warehouses and data marts are home to preprocessed data, data lakes contains data in its natural or raw format. 2. Open source data
  • 6. Why the errors should be corrected asap? ● Not everyone spots the data anomalies. Decision-makers may make costly mistakes on information based on incorrect data from applications that fail to correct for the faulty data. ● If errors are not corrected early on in the process, the cleansing will have to be done for every project that uses that data. ● Data errors may point to defective equipment, such as broken transmission lines and defective sensors. ● Data errors can point to bugs in software or in the integration of software that may be critical to the company. While doing a small project at a bank we discovered that two software applications used different local settings. This caused problems with numbers greater than 1,000. For one app the number 1.000 meant one, and for the other it meant one thousand.
  • 7. Combining Data 1. Joining Tables 2. Appending Tables 3. Creating views
  • 11. 5. Build the Model Building a model is an iterative process. The way you build your model depends on whether you go with classic statistics or the somewhat more recent machine learning school, and the type of technique you want to use. Either way, most models consist of the following main steps: 1 Selection of a modeling technique and variables to enter in the model 2 Execution of the model 3 Diagnosis and model comparison
  • 14. 6 . Presentation and automation— Presenting your results to the stakeholders and industrializing your analysis process for repetitive reuse and integration with other tools.
  • 15. Working with data from files  Working with different Data Types, Different Formats, Different Compression, Different Parsing on Different Systems are very challenging task to prepare data.  Dealing with different formats can become a Tedious Task.  Thus, it is mandatory for any Data Scientist To Be Aware Of Different File Formats, common challenges in handling them and the best / efficient ways to handle this data in real life.
  • 16. What is a file format?
  • 17. Why should a data scientist understand different file formats?  The files will depend on the application you are building.  For example, in an image processing system, you need image files as input and output. Therefore, we will mostly see files in jpeg, gif or png format.  As a data scientist, we need to understand the underlying structure of various file formats, their advantages and dis-advantages.  Choosing the optimal file format for storing data can improve the performance of your models in data processing.
  • 18. Why should a data scientist understand different file formats? Different File Formats.  XLSX  Comma-separated values (CSV)  ZIP  Plain Text (txt)  JSON  XML  HTML  Images  Hierarchical Data Format  PDF  DOCX  MP3  MP4
  • 19. Different file formats and how to read them in Python Comma-separated values (CSV):  Comma-Separated Values (CSV) file format falls under spreadsheet file format.  In spreadsheet file format, data is stored in cells. Each cell has organized in rows and columns.  A column in the spreadsheet file can have different types. For example, a column can be of string type, a date type or an integer type.  Some of the most popular spreadsheet file formats are Comma Separated Values (CSV), Microsoft Excel Spreadsheet (xls) and Microsoft Excel Open XML Spreadsheet (xlsx).  Some files are separated using tab. This file format is known as TSV (Tab Separated Values) file format.
  • 20. Different file formats and how to read them in Python The below image shows a CSV file which is opened in Notepad.
  • 21. Reading the data from CSV in Python  For loading the data, you can use the “pandas” library in python. import pandas as pd pd.read_csv(r'F:IT DEPTWINTER 202210212IT105 - DATA SCIENCE IN PYTHON/addresses.csv')
  • 22. Different file formats and how to read them in Python Read Excel file:  XLSX is a Microsoft Excel Open XML file format. It also comes under the Spreadsheet file format.  It is an XML-based file format created by Microsoft Excel.  In XLSX data is organized under the cells and columns in a sheet.  Each XLSX file may contain one or more sheets. Therefore, a workbook can contain multiple sheets.
  • 23. Different file formats and how to read them in Python Excel File:
  • 24. Different file formats and how to read them in Python Read Excel file: import pandas as pd pd.read_excel(r'C:UsersNITHIDesktopMentees List.xlsx')
  • 25. Different file formats and how to read them in Python Read Excel file:
  • 26. Different file formats and how to read them in Python Read some particular columns: import pandas as pd pd.read_excel(r'C:UsersNITHIDesktopMentees List.xlsx',index_col=0, usecols="A:C")
  • 27. Different file formats and how to read them in Python Read some particular columns:
  • 28. Different file formats and how to read them in Python Read some particular columns: import pandas as pd pd.read_excel(r'C:UsersNITHIDesktopMentees List.xlsx',index_col=0, usecols=[3,5,6)
  • 29. Different file formats and how to read them in Python Read some particular columns:
  • 30. Different file formats and how to read them in Python Read a particular Sheet: import pandas as pd pd.read_excel(r'C:UsersNITHIDesktopMentees List.xlsx',index_col=0, sheet_name=0)
  • 31. Different file formats and how to read them in Python Read a particular Sheet:
  • 32. Different file formats and how to read them in Python Read a particular Sheet: import pandas as pd pd.read_excel(r'C:UsersNITHIDesktopMentees List.xlsx',index_col=0, sheet_name=“Second Year”)
  • 33. Different file formats and how to read them in Python Read a particular Sheet:
  • 34. Different file formats and how to read them in Python Read Microsoft Word file:  XLSX is a Microsoft Word Open file format with extension .docx
  • 35. Different file formats and how to read them in Python Read Microsoft Word file: pip install python-docx
  • 36. Different file formats and how to read them in Python Read Microsoft Word file: from doc import Document document = Document(r'F:IT DEPTWINTER 202210212IT105 - DATA SCIENCE IN PYTHONtest.docx') type(document)
  • 37. Different file formats and how to read them in Python Read Microsoft Word file: document.paragraphs
  • 38. Different file formats and how to read them in Python Read Microsoft Word file: type(document.paragraphs)
  • 39. Different file formats and how to read them in Python Read Microsoft Word file: document.paragraphs[1] document.paragraphs[0]
  • 40. Different file formats and how to read them in Python Read Microsoft Word file: document.paragraphs[0].text document.paragraphs[1].text
  • 41. Different file formats and how to read them in Python Read Microsoft Word file: document.paragraphs[2].text
  • 44. Exploratory Data Analysis  A method used to analyze and summarize data sets.  Data scientists to analyze and investigate data sets and summarize their main characteristics use exploratory Data Analysis (EDA), often employing data visualization methods.  It helps the data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.  EDA is primarily used to provide a better understanding of data set variables and the relationships between them.  It can helps to determine if the statistical techniques you are considering for data analysis are appropriate.
  • 45. Exploratory Data Analysis Why is exploratory data analysis important in data science?  Identify obvious errors, understand patterns, detect outliers or anomalous events, interesting relations among the variables.  To ensure the results that the data scientist produce are valid and applicable to any desired business outcomes and goals.  EDA helps stakeholders by confirming they are asking the right questions.  EDA can help to answer questions about standard deviations, categorical variables, and confidence intervals.  Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning.
  • 46. Exploratory Data Analysis Exploratory Data Analysis Tools:  Anyone spends a lot of time doing EDA to get a better understanding of data.  EDA can be minimized by using auto visualizations tools such as – 1.Pandas-profiling, 2.Sweetviz, 3.Autoviz 4.D-Tale
  • 47. Exploratory Data Analysis Exploratory Data Analysis Tools:  EDA involves a lot of steps including some statistical tests, visualization of data using different kinds of plots 1. Data Quality Check: Can be done using pandas library functions like describe(), info(), dtypes(), etc. It is used to find several features like its datatypes, duplicate values, missing value, etc. 2. Statistical Test: Some statistical tests like Pearson correlation, Spearman correlation, Kendall test, etc are done to get a correlation between the features. It can be implemented in python using the “stats” library.
  • 48. Exploratory Data Analysis Exploratory Data Analysis Tools: 3. Quantitative Test: Find the spread of numerical features, count of categorical features. It can be implemented in python using the functions of the “pandas” library. 4. Visualization: To get an understanding of the data. Graphical techniques like bar plots, pie charts are used to get an understanding of categorical features, whereas scatter plots, histograms are used for numerical features.
  • 49. Exploratory Data Analysis Tools Pandas-Profiling:  Pandas profiling is an open-source python library that automates the EDA process and creates a detailed report.  Pandas Profiling can be used easily for large datasets as it is blazingly fast and creates reports in a few seconds.  Installation: pip install pandas-profiling
  • 50. Exploratory Data Analysis Tools Pandas-Profiling: #Install the below libraries before importing import pandas as pd from pandas_profiling import ProfileReport #EDA using pandas-profiling profile = ProfileReport(pd.read_excel('Mentees List.xlsx'), explorative=True) #Saving results to a HTML file profile.to_file("output.html")
  • 51. Exploratory Data Analysis Tools Pandas-Profiling Report: The pandas-profiling library generates a report having:  An overview of the dataset  Variable properties  Interaction of variables  Correlation of variables  Missing values  Sample data
  • 52. Exploratory Data Analysis Tools Pandas-Profiling: Report: file:///C:/Users/NITHI/output.html
  • 53. Exploratory Data Analysis Tools Sweetviz:  Sweetviz is an open-source python auto-visualization library that generates a report, exploring the data with the help of high-density plots.  It not only automates the EDA but is also used for comparing datasets and drawing inferences from it.  A comparison of two datasets can be done by treating one as training and the other as testing. Installation: pip install sweetviz
  • 54. Exploratory Data Analysis Tools Sweetviz:  Sweetviz is an open-source python auto-visualization library that generates a report, exploring the data with the help of high-density plots.  It not only automates the EDA but is also used for comparing datasets and drawing inferences from it.  A comparison of two datasets can be done by treating one as training and the other as testing. Installation: pip install sweetviz
  • 55. Exploratory Data Analysis Tools Sweetviz: #Install the below libraries before importing import pandas as pd import sweetviz as sv #EDA using Sweetviz sweet_report = sv.analyze(pd.read_excel(r'C:UsersNITHIDesktopMentees List.xlsx')) #Saving results to HTML file sweet_report.show_html('sweet_report.html')
  • 56. Exploratory Data Analysis Tools Sweetviz Report: The Sweetviz library generates a report having:  An overview of the dataset  Variable properties  Categorical associations  Numerical associations  Most frequent, smallest, largest values for numerical features
  • 57. Exploratory Data Analysis Tools Sweetviz Report: file:///C:/Users/NITHI/sweet_report.html
  • 58. Exploratory Data Analysis Tools Autoviz:  Autoviz is an open-source python auto visualization library that mainly focuses on visualizing the relationship of the data by generating different types of plot.  Installation: pip install autoviz
  • 59. Exploratory Data Analysis Tools Autoviz: #Install the below libraries before importing import pandas as pd from autoviz.AutoViz_Class import AutoViz_Class #EDA using Autoviz autoviz = AutoViz_Class().AutoViz(r'C:UsersNITHIDesktopMentees List.xlsx')
  • 60. Exploratory Data Analysis Tools Autoviz Report: The Autoviz library generates a report having:  An overview of the dataset  Pairwise scatter plot of continuous variables  Distribution of categorical variables  Heatmaps of continuous variables  Average numerical variable by each categorical variable
  • 61. Exploratory Data Analysis Tools Autoviz Report:
  • 62. Exploratory Data Analysis Tools D-Tale:  D-Tale is an open-source python auto-visualization library. It is one of the best auto data-visualization libraries.  D-Tale helps you to get a detailed EDA of the data. It also has a feature of code export for every plot or analysis in the report.  Installation: pip install dtale
  • 63. Exploratory Data Analysis Tools D-Tale:  D-Tale is an open-source python auto-visualization library. It is one of the best auto data-visualization libraries.  D-Tale helps you to get a detailed EDA of the data. It also has a feature of code export for every plot or analysis in the report.  Installation: pip install dtale
  • 64. Exploratory Data Analysis Tools D-Tale: import dtale import pandas as pd dtale.show(pd.read_excel(r'C:UsersNITHIDesktopMentees List.xlsx'))
  • 65. Exploratory Data Analysis Tools D-Tale Report: The dtale library generates a report having:  An overview of the dataset  Custom filters  Correlation, Charts, and Heatmaps  Highlight datatypes, missing values, ranges  Code export
  • 66. Exploratory Data Analysis Tools D-Tale Report:
  • 68. Data Management What is Data Management?  Data management is the practice of collecting, organizing, protecting, and storing an organization’s data so it can be analyzed for business decisions.  As organizations create and consume data at unprecedented rates, data management solutions become essential for making sense of the vast quantities of data.  Today’s leading data management software ensures that reliable, up-to-date data is always used to drive decisions.
  • 69. Data Management Types of Data Management Data management plays several roles in an organization’s data environment, making essential functions easier and less time-intensive. 1. Data preparation is used to clean and transform raw data into the right shape and format for analysis, including making corrections and combining data sets. 2. Data Pipelines enable the automated transfer of data from one system to another. 3. ETLs (Extract, Transform, Load) are built to take the data from one system, transform it, and load it into the organization’s data warehouse.
  • 70. Data Management Types of Data Management (cont...) 4. Data Catalogs - help manage metadata to create a complete picture of the data, providing a summary of its changes, locations, and quality while also making the data easy to find. 5. Data Warehouses are places to consolidate various data sources, contend with the many data types businesses store, and provide a clear route for data analysis. 6. Data Governance defines standards, processes, and policies to maintain data security and integrity.
  • 71. Data Management Types of Data Management (cont...) 7. Data Architecture provides a formal approach for creating and managing data flow. 8. Data Security protects data from unauthorized access and corruption. 9. Data Modeling documents the flow of data through an application or organization.
  • 72. Data Management Why data management is important?  Data management is a crucial first step that leads to add value to our customers and improve our business bottom line.  The effective data management, people across an organization can find and access trusted data for their queries.  Some benefits of an effective data management solution includes: 1) Visibility 2) Reliability 3) Security 4) Scalability
  • 73. Data Management Important of Data Management 1) Visibility –  Increase the visibility of your organization’s data assets.  Easier for people to quickly and confidently find the right data for their analysis. 1) Reliability –  By establishing processes and policies to build the trust in the data being used to make decisions across your organization. 1) Security –  Protects your organization and its employees from data losses, thefts, and breaches with authentication and encryption tools. 1) Scalability –  Allows organizations to effectively scale data and usage occasions with repeatable processes to keep data and metadata up to date.
  • 74. Data Management Data Management Challenges:  Traditional Data Management processes make it difficult to scale capabilities without compromising governance or security.  Modern Data Management software must address several challenges to ensure trusted data can be found. Challenge 1: Increased Data Volumes - Organization to become unaware of what data it has, where the data is, and how to use it. Challenge 2: New Roles for Analytics - Understanding naming conventions, complex data structures, and databases can be a challenge. Challenge 3: Compliance Requirements - Constantly changing compliance requirements make it a challenge to ensure people are using the right data.
  • 75. Data Management Establish Best Data Management: An effective data management strategy. 1. Clearly Identify Your Business Goals 2. Focus on the Quality of Data 3. Allow the Right People to Access the Data 4. Prioritize Data Security
  • 76. Data Cleaning Data Cleaning –  Process of identifying the incorrect, incomplete, inaccurate, irrelevant or missing part of the data.  Modifying, Replacing or Deleting them according to the necessity.  Data Cleaning is considered a foundational element of the basic data science.
  • 77. Data Cleaning Data Cleaning –  Data is the most valuable thing for Analytics and Machine learning.  In computing or Business, data is needed everywhere. When it comes to the real world data, it is not improbable that data may contain incomplete, inconsistent or missing values.  If the data is corrupted then it may hinder the process or provide inaccurate results.
  • 78. Data Cleaning Data Cleaning –  Data is the most valuable thing for Analytics and Machine learning.  In computing or Business, data is needed everywhere. When it comes to the real world data, it is not improbable that data may contain incomplete, inconsistent or missing values.  If the data is corrupted then it may hinder the process or provide inaccurate results.