Browse Definitions :
Definition

data set

What is a data set?

A data set, sometimes spelled dataset, is a collection of related data that's usually organized in a standardized format. Data sets are used for analytics, business intelligence, artificial intelligence (AI) model training and a variety of other use cases. Data sets can vary significantly in both size and type of data. For example, a data set might contain information about tree species, ocean temperatures, regional sales totals, fruit prices, lottery winners, diseases or just about any other type of data.

Although formats differ from one data set to another, their underlying organization can often be conceptualized as columns and rows, such as those found in spreadsheets or database tables. Each column represents a variable that describes the data, and each row represents a record that contains a related set of variable values. A value within a data set is sometimes referred to as datum or data point.

Many data sets are freely available online. They can be used to develop and test applications, train AI models, perform analytics or carry out other projects. For example, the figure below shows the air quality data set from Data.gov, which offers a wide range of free data sets. The air quality data set contains air quality surveillance data for New York City.

Screenshot of a data set.
Example of a data set: Air quality surveillance data in New York city displayed in Microsoft Excel.

In the figure, the air quality data set is displayed in a Microsoft Excel spreadsheet. However, the data originated as a comma-separated values (CSV) file downloaded from Data.gov. The data set includes columns such as Unique ID, Geo Place Name and Time Period, which are three of the data set's variables.

The data set also includes rows for each air quality measurement, specific to a place and time. That is, each row is a record of a specific air quality measurement. The record is made up of a set of related values, with each value corresponding to a column, i.e., variable. For example, the value in the Start_Date column for the first record is 12/1/2010.

Data set vs. database

The term data set is sometimes confused with the term database, but the two have different meanings. A database is used to store and manage data. It is part of a larger management platform that includes features for securing, accessing, updating and in other ways working with and protecting data. A data set is simply a file or other structure that contains the data values in a specific format. A database might contain the data from one or more data sets, but the two are not the same.

Data set formats

Data sets are available in a variety of formats, such as JavaScript Object Notation (JSON) and Extensible Markup Language (XML). Such formats provide a standardized structure for sharing data across multiple platforms and applications. The data itself is usually written in plain text, so it can be easily filtered, updated and in other ways transformed to meet specific requirements.

Some data sets are available in more than one format. For example, the air quality data set shown above can be downloaded from Data.gov as a CSV, JSON, XML or Resource Description Framework (RDF) file. When a data set is available in multiple formats, the expectation is that each file contains the same set of records, with each record formatted according to the applicable standard.

A good way to demonstrate how this works is to look at the same air quality record in each of the four formats. For instance, one of the records has a unique ID value of 172653, which identifies that record from all other records. The following four script samples show the record in each format:

CSV record:

172653,375,Nitrogen dioxide (NO2),Mean,ppb,UHF34,203,Bedford Stuyvesant – Crown Heights,Annual Average 2011,12/01/2010,25.3

JSON record:

[ "row-frzi_7bar_4cbg", "00000000-0000-0000-AF08-C339B5581012", 0, 1698955938, null, 1698955938, null, "{ }", "172653", "375", "Nitrogen dioxide (NO2)", "Mean", "ppb", "UHF34", "203", "Bedford Stuyvesant – Crown Heights", "Annual Average 2011", "2010-12-01T00:00:00", "25.30", null ]

XML record:

<row _id="row-frzi_7bar_4cbg" _uuid="00000000-0000-0000-AF08-C339B5581012" _position="0" _address="https://data.cityofnewyork.us/resource/c3uy-2p5r/172653"><unique_id>172653</unique_id><indicator_id>375</indicator_id><name>Nitrogen dioxide (NO2)</name><measure>Mean</measure><measure_info>ppb</measure_info><geo_type_name>UHF34</geo_type_name><geo_join_id>203</geo_join_id><geo_place_name>Bedford Stuyvesant – Crown Heights</geo_place_name><time_period>Annual Average 2011</time_period><start_date>2010-12-01T00:00:00</start_date><data_value>25.30</data_value></row>

RDF record:

<rdf:Description rdf:about="https://data.cityofnewyork.us/resource/c3uy-2p5r/172653">

    • <socrata:rowID>row-frzi_7bar_4cbg</socrata:rowID>
    • <rdfs:member rdf:resource="https://data.cityofnewyork.us/resource/c3uy-2p5r"/>
    • <ds:unique_id>172653</ds:unique_id>
    • <ds:indicator_id>375</ds:indicator_id>
    • <ds:name>Nitrogen dioxide (NO2)</ds:name>
    • <ds:measure>Mean</ds:measure>
    • <ds:measure_info>ppb</ds:measure_info>
    • <ds:geo_type_name>UHF34</ds:geo_type_name>
    • <ds:geo_join_id>203</ds:geo_join_id>
    • <ds:geo_place_name>Bedford Stuyvesant – Crown Heights</ds:geo_place_name>
    • <ds:time_period>Annual Average 2011</ds:time_period>
    • <ds:start_date>2010-12-01T00:00:00</ds:start_date>
    • <ds:data_value>25.30</ds:data_value></rdf:Description>

Each format provides the same core information but does so in a way different from the others. When a data set is available in multiple formats, data scientists and other users can choose whichever format best meets their needs for a specific project or environment. Because the formats are standardized, users can load the data into a system that supports the format, making it relatively simple to view, modify and manipulate data from multiple sources.

Types of data sets

Data sets can be categorized in different ways. One common approach, which is often used in statistics, is to break them down into the following categories:

  • Numerical. All the values within the data set are numerical. Numerical data sets are used for a variety of analytics, ranging from customer sales to weather station readings. This type of data set is also called quantitative.
  • Bivariate. The data set contains two variables that express a relationship between the data. For example, a data set might include a temperature variable and a time variable. Together the variables provide insight into how temperature fluctuations are related to the time of day.
  • Multivariate. This type of data set contains three or more variables that are somehow related. For example, a data set might include variables that describe a product's color, size, weight and other characteristics. Multivariate data sets often define complex relationships between the data.
  • Categorical. A categorical data set divides the data into distinct groups based on the specific qualities of people or objects. There are two types of categorical data: dichotomous and polytomous. Dichotomous data contains only two values, such as true and false. Polytomous data can contain more than two values, although still a limited number, such as hair colors or shirt sizes.
  • Correlation. This data set contains variables that are in some way related and have a dependency between them. For instance, the variables in a data set related to ice cream sales might show a correlation between the outside temperature and amount of sales. Correlations can be positive (variables move in the same direction), negative (variables move in opposite directions) or zero (variables don't impact each other).

The term data set originated with IBM, where its meaning was similar to that of file. In an IBM mainframe operating system, a data set is a named group of records that contains individual data units formatted in an IBM-prescribed way and accessed by a specific access method based on the data set format. Format types include sequential, relative sequential, indexed sequential and partitioned. Access methods include the Virtual Sequential Access Method (VSAM) and the Indexed Sequential Access Method (ISAM).

A data set is also an older and now deprecated term for a modem.

Working with numerical data

Numerical data within a data set is often characterized by specific measures that are used in statistics and analytics to describe the properties of a statistical distribution. Such a distribution reflects the set of possible values within the target data. The most common measures include the following:

  • Mean. The mean is the average of all the values in the data set, determined by adding the values together and then dividing by the total number of values.
  • Median. This is the data set's middle value, based on the values being sorted in ascending or descending order. If the data set contains an even number of values, the median value is determined by finding the mean of the two middle numbers.
  • Mode. Mode is the value that occurs most often. A data set can contain multiple modes if each set of repeated values occurs at the same frequency, such as a data set that includes three instances of 5 and three instances of 6. If there is only one instance of each value, the data set is said to include no modes.
  • Range. The difference between the minimum value and maximum value in the data set is the range.
  • Minimum. This represents the lowest value in the data set.
  • Maximum. This is the highest value in the data set.
  • Sum. The total of all values in the data set is the sum.
  • Count. Count represents the number of values in the data set.

To better understand how these measures work, consider the following numerical data set:

{2,4,4,6,8,10,13,14,16,18,20,22}

This is a very small numerical data set that contains 12 values, with only one value repeated. All of the values are integers. When the measures are applied to the data, they return the following properties:

  • Mean = 11.417.
  • Median = 11.5.
  • Mode = 4 (two instances).
  • Range = 20.
  • Minimum = 2.
  • Maximum = 22.
  • Sum = 137.
  • Count = 12.

If the data set had contained another pair of duplicate numbers, such as two instances of 10, there would have been two modes: 4 and 10. However, if there had been three instances of 4 and only two instances of 10, 4 would have been the only mode.

Data quality directly influences the success of machine learning models and AI initiatives. But a comprehensive approach requires considering real-world outcomes and data privacy. See how data quality shapes machine learning and AI outcomes.

This was last updated in April 2024

Continue Reading About data set

Networking
  • subnet (subnetwork)

    A subnet, or subnetwork, is a segmented piece of a larger network. More specifically, subnets are a logical partition of an IP ...

  • secure access service edge (SASE)

    Secure access service edge (SASE), pronounced sassy, is a cloud architecture model that bundles together network and cloud-native...

  • Transmission Control Protocol (TCP)

    Transmission Control Protocol (TCP) is a standard protocol on the internet that ensures the reliable transmission of data between...

Security
  • cyber attack

    A cyber attack is any malicious attempt to gain unauthorized access to a computer, computing system or computer network with the ...

  • digital signature

    A digital signature is a mathematical technique used to validate the authenticity and integrity of a digital document, message or...

  • What is security information and event management (SIEM)?

    Security information and event management (SIEM) is an approach to security management that combines security information ...

CIO
  • product development (new product development)

    Product development -- also called new product management -- is a series of steps that includes the conceptualization, design, ...

  • innovation culture

    Innovation culture is the work environment that leaders cultivate to nurture unorthodox thinking and its application.

  • technology addiction

    Technology addiction is an impulse control disorder that involves the obsessive use of mobile devices, the internet or video ...

HRSoftware
  • organizational network analysis (ONA)

    Organizational network analysis (ONA) is a quantitative method for modeling and analyzing how communications, information, ...

  • HireVue

    HireVue is an enterprise video interviewing technology provider of a platform that lets recruiters and hiring managers screen ...

  • Human Resource Certification Institute (HRCI)

    Human Resource Certification Institute (HRCI) is a U.S.-based credentialing organization offering certifications to HR ...

Customer Experience
  • contact center agent (call center agent)

    A contact center agent is a person who handles incoming or outgoing customer communications for an organization.

  • contact center management

    Contact center management is the process of overseeing contact center operations with the goal of providing an outstanding ...

  • digital marketing

    Digital marketing is the promotion and marketing of goods and services to consumers through digital channels and electronic ...

Close