SlideShare a Scribd company logo
+
Basics in Statistics
Dr Maher Alaraj
+
Data Mining/Analytics Process
+
Statistics
 Definition
 The practice or science of collecting and analyzing numerical data in large
quantities, especially for the purpose of inferring proportions in a whole from
those in a representative sample.
 Descriptive statistics
 Describe the basic features of data in a study
 Provide summaries about the sample and measures
 Inferential statistics
 Investigate questions, models, and hypotheses
 Infer population characteristics based on sample
 Make judgments about what we observe
+
Some Concepts
Variable - any characteristic of an individual or entity. A variable can take
different values for different individuals. Variables can be categorical or
quantitative.
• Nominal - Categorical variables with no inherent order or ranking sequence such as
names or classes (e.g., gender). Value may be a numerical, but without numerical value (e.g.,
I, II, III). The only operation that can be applied to Nominal variables is enumeration.
• Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe. Can be
compared for equality, or greater or less, but not how much greater or less.
• Interval - Values of the variable are ordered as in Ordinal, and additionally, differences
between values are meaningful, however, the scale is not absolutely anchored. Calendar
dates and temperatures on the Fahrenheit scale are examples. Addition and subtraction, but
not multiplication and division are meaningful operations.
• Continuous- Variables with all properties of Interval plus an absolute, non-arbitrary zero
point, e.g. age, weight, temperature (Kelvin). Addition, subtraction, multiplication, and division
are all meaningful operations.
+
Sampling
 What is your population of interest?
 To whom do you want to generalize your results?
 All students (18 and over)
 Undergraduates only
 Arts students
 Athletes
 Other
 Can you sample the entire population?
 A sample is “a smaller (but hopefully representative) collection of
units from a population used to determine truths about that
population” (Field, 2005)
 Why sample?
 Resources (time, money) and workload
 Gives results with known accuracy that can be calculated mathematically
+
6
Descriptive Statistics
 Descriptive statistics are used to summarize or condense a group of
scores
 They include measures of central tendency and measures of
variability
 Mode
 Median
 Mean
 Range
 Variance
 Standard Deviation
+
7
Central Tendency
 Measures of central tendency describe the average or common score
of a group of scores
 Common measures of central tendency include the mean, median,
and mode
+
8
Mean
 The mean is the arithmetic average of the scores
 The calculation of the mean considers both the number of scores and
their value
 The formula for the mean of the variable X is:
+
9
Mean
 Six men with high serum cholesterol participated in a study to examine
the effects of diet on cholesterol
 At the beginning of the study, their serum cholesterol levels (mg/dL)
were:
366, 327, 274, 292, 274, 230
 What is the mean?
+
10
Median
 The median is the middle point in an ordered distribution at which an
equal number of scores lie on each side of it
 It is also known as the 50th percentile (P50), or 2nd quartile (Q2)
 The position of the median (Mdn) can be calculated as follows:
+
11
Median
 Example: Calculate the median for the following measurements for height:
71”, 73”, 74”, 75”, 72”
 Step Two: Calculate the position of the median using the following formula:
 Step Three: Determine the value of the median by counting from either the
highest or the lowest score until the desired score is reached (in this case the
3rd score)
+
12
Median
 Suppose that in our previous distribution we had a sixth score as
follows:
71”, 72”, 73”, 74”, 74”, 75”
 What are the position and value of the median?
?
+
13
Median
 Consider the following example: Nine people each perform 40 sit-ups,
and one does 1,000
 The median score for the group is 40, and the mean (arithmetic average)
is 136
 The median would still be 40 even if the highest score were 2,000
instead of 40
 What can you learn from this?
The Median is Unaffected by Extreme Scores
+
14
Mode
 The mode is the most frequently occurring score
 Which of the following scores is the mode?
3, 7, 3, 9, 9, 3, 5, 1, 8, 5
 Similarly, for another data set (2, 4, 9, 6, 4, 6, 6, 2, 8, 2), there are two
modes; What are they?
+
15
Mode
 A distribution with a single mode is said to be unimodal
 A distribution with more than one mode is said to be bimodal,
trimodal, etc., or in general, multimodal
+
16
Variability
 Measures of variability describe the extent of similarity or difference in
a set of scores
 These measures include the range, standard deviation, and
variance
+
17
Standard Deviation (SD)
 Standard Deviation – a measure of the variability, or spread, of a set
of scores around the mean
 Intuitively, the sum of the differences between each score and the mean
(known as deviation scores) appears to be a good approach for
measuring variability around the mean
+
18
SD
 Symbolically, we can write this as
 Let’s use the scores 1, 2, 6, 6, and 15, where
M = 6
19
SD
Now let’s calculate the sum of the deviation scores:
= (1-6) + (2-6) + (6-6) + (6-6) + (15-6)
= (-5) + (-4) + (0) + (0) + (9)
= = -9 + 9 = 0
+
20
SD
 We can avoid this problem (deviation scores sum to 0) by
squaring each deviation score before summing them
 This would be written symbolically as
 Substituting our X scores again,
= (1-6)2 + (2-6)2 + (6-6)2 + (6-6)2 + (15-6)2
= (-5)2 + (-4)2 + (0)2 + (0)2 + (9)2
= 25 + 16 + 0 + 0 + 81
= 122
+
21
SD
 We then divide this value by n-1 to arrive at the mean squared
deviation
122/4 = 30.5
 We then take the square root of this value to bring the units
back to the raw score units
+
22
Variance
 The variance is the square of the standard deviation
 It is used most commonly with more advanced statistical procedures
such as regression analysis, analysis of variance (ANOVA), and the
determination of the reliability of a test
 The variance show of how far each value in the data set is from
the mean. Here is how it is defined:
+ Example calculation of variance and standard deviation on strength scores.
Subj Score (x) Deviation (x)2
1 216 22.7 515.29
2 144 -49.3 2430.49
3 183 -10.3 106.09
4 138 -55.3 3058.09
5 212 18.7 349.69
6 180 -13.3 176.89
7 200 6.7 44.89
8 264 70.7 4998.49
9 203 9.7 94.09
=1740 =0 =11774.01
3
.
193
9
1740
n
X
=
X 


s
x X
n
2
2
1
1177401
8
147175




 
( ) .
.
s
x X
n




( )
.
2
1
384
+
24
Range
 The range of a set of data is the difference between the highest
and lowest values in the set. To find the range, first order the
data from least to greatest. Then subtract the smallest value
from the largest value in the set.
Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9.
So the range is 9-3 = 6.
+
Quantiles
 one of the class of values of a variable that divides
the total frequency of a sample or population into a
given number of equal proportions
 Examples:
 Percentile
 Decile
 Quartile
 Quintile
+
Quantiles
 The 100-quantiles are called percentiles.
 The 10-quantiles are called deciles.
 The 5-quantiles are call quintiles.
 The 4-quantiles are called quartiles.
+
Percentiles
 The kth percentile is a scale value for a data series equal to the
p/100 quantile
 The 1st percentile cuts off lowest 1% of data
 The 98th percentile cuts off lowest 98% of data
 The 25th percentiles is the first quartile
 The 50th percentile is the median
+
Deciles
 each of ten equal groups into which a population can be
divided according to the distribution of values of a particular
variable.
 Represents 1/10 of the total population
 The 1st decile cuts off the lowest 10% of data
 The 9th decile cuts off the lowest 90% of data
+
Quartiles
 The quartiles divide the distribution into four
equal parts
 called fourths
 The total of 100% is broken into four equal parts: 25%,
50%, 75%, 100%.
 Lower Quartile is the 25th percentile. (0.25)
 Median Quartile is the 50th percentile. (0.50)
 Upper Quartile is the 75th percentile. (0.75)
+
Quintiles
 any of five equal groups into which a population can be divided
according to the distribution of values of a particular variable.
 Represents 20% or 1/5 of the given amount
+
Box Plot
 A visual tool that illustrates the distribution of a
univariate dataset.
 It illustrates the median, upper and lower
quantiles, upper and lower deciles, and any
outliers.
 Using R
 boxplot(dataset)
 quantile(dataset)
Box Plot
Practice
 One hundred randomly selected students were asked the number of movies they watched
the previous week. The results are as follows:
 Find the sample mean, median, and range of the sample.
 Find the standard deviation and the variance.
 Construct a barplot of the data.
 Find the first quartile.
 Find the second quartile. To which value it corresponds?
 Find the third quartile.
 Construct a box plot of the data.
 What percent of the students saw fewer than three movies?
 Find the 40th percentile.
 Find the 90th percentile.
# of Movies Frequency
0 20
1 36
2 24
3 16
4 4
20
36
24
16
4
0
5
10
15
20
25
30
35
40
0 1 2 3 4
Histogram for films frequencies
Frequency

More Related Content

Basics of Stats (2).pptx

  • 3. + Statistics  Definition  The practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.  Descriptive statistics  Describe the basic features of data in a study  Provide summaries about the sample and measures  Inferential statistics  Investigate questions, models, and hypotheses  Infer population characteristics based on sample  Make judgments about what we observe
  • 4. + Some Concepts Variable - any characteristic of an individual or entity. A variable can take different values for different individuals. Variables can be categorical or quantitative. • Nominal - Categorical variables with no inherent order or ranking sequence such as names or classes (e.g., gender). Value may be a numerical, but without numerical value (e.g., I, II, III). The only operation that can be applied to Nominal variables is enumeration. • Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe. Can be compared for equality, or greater or less, but not how much greater or less. • Interval - Values of the variable are ordered as in Ordinal, and additionally, differences between values are meaningful, however, the scale is not absolutely anchored. Calendar dates and temperatures on the Fahrenheit scale are examples. Addition and subtraction, but not multiplication and division are meaningful operations. • Continuous- Variables with all properties of Interval plus an absolute, non-arbitrary zero point, e.g. age, weight, temperature (Kelvin). Addition, subtraction, multiplication, and division are all meaningful operations.
  • 5. + Sampling  What is your population of interest?  To whom do you want to generalize your results?  All students (18 and over)  Undergraduates only  Arts students  Athletes  Other  Can you sample the entire population?  A sample is “a smaller (but hopefully representative) collection of units from a population used to determine truths about that population” (Field, 2005)  Why sample?  Resources (time, money) and workload  Gives results with known accuracy that can be calculated mathematically
  • 6. + 6 Descriptive Statistics  Descriptive statistics are used to summarize or condense a group of scores  They include measures of central tendency and measures of variability  Mode  Median  Mean  Range  Variance  Standard Deviation
  • 7. + 7 Central Tendency  Measures of central tendency describe the average or common score of a group of scores  Common measures of central tendency include the mean, median, and mode
  • 8. + 8 Mean  The mean is the arithmetic average of the scores  The calculation of the mean considers both the number of scores and their value  The formula for the mean of the variable X is:
  • 9. + 9 Mean  Six men with high serum cholesterol participated in a study to examine the effects of diet on cholesterol  At the beginning of the study, their serum cholesterol levels (mg/dL) were: 366, 327, 274, 292, 274, 230  What is the mean?
  • 10. + 10 Median  The median is the middle point in an ordered distribution at which an equal number of scores lie on each side of it  It is also known as the 50th percentile (P50), or 2nd quartile (Q2)  The position of the median (Mdn) can be calculated as follows:
  • 11. + 11 Median  Example: Calculate the median for the following measurements for height: 71”, 73”, 74”, 75”, 72”  Step Two: Calculate the position of the median using the following formula:  Step Three: Determine the value of the median by counting from either the highest or the lowest score until the desired score is reached (in this case the 3rd score)
  • 12. + 12 Median  Suppose that in our previous distribution we had a sixth score as follows: 71”, 72”, 73”, 74”, 74”, 75”  What are the position and value of the median? ?
  • 13. + 13 Median  Consider the following example: Nine people each perform 40 sit-ups, and one does 1,000  The median score for the group is 40, and the mean (arithmetic average) is 136  The median would still be 40 even if the highest score were 2,000 instead of 40  What can you learn from this? The Median is Unaffected by Extreme Scores
  • 14. + 14 Mode  The mode is the most frequently occurring score  Which of the following scores is the mode? 3, 7, 3, 9, 9, 3, 5, 1, 8, 5  Similarly, for another data set (2, 4, 9, 6, 4, 6, 6, 2, 8, 2), there are two modes; What are they?
  • 15. + 15 Mode  A distribution with a single mode is said to be unimodal  A distribution with more than one mode is said to be bimodal, trimodal, etc., or in general, multimodal
  • 16. + 16 Variability  Measures of variability describe the extent of similarity or difference in a set of scores  These measures include the range, standard deviation, and variance
  • 17. + 17 Standard Deviation (SD)  Standard Deviation – a measure of the variability, or spread, of a set of scores around the mean  Intuitively, the sum of the differences between each score and the mean (known as deviation scores) appears to be a good approach for measuring variability around the mean
  • 18. + 18 SD  Symbolically, we can write this as  Let’s use the scores 1, 2, 6, 6, and 15, where M = 6
  • 19. 19 SD Now let’s calculate the sum of the deviation scores: = (1-6) + (2-6) + (6-6) + (6-6) + (15-6) = (-5) + (-4) + (0) + (0) + (9) = = -9 + 9 = 0
  • 20. + 20 SD  We can avoid this problem (deviation scores sum to 0) by squaring each deviation score before summing them  This would be written symbolically as  Substituting our X scores again, = (1-6)2 + (2-6)2 + (6-6)2 + (6-6)2 + (15-6)2 = (-5)2 + (-4)2 + (0)2 + (0)2 + (9)2 = 25 + 16 + 0 + 0 + 81 = 122
  • 21. + 21 SD  We then divide this value by n-1 to arrive at the mean squared deviation 122/4 = 30.5  We then take the square root of this value to bring the units back to the raw score units
  • 22. + 22 Variance  The variance is the square of the standard deviation  It is used most commonly with more advanced statistical procedures such as regression analysis, analysis of variance (ANOVA), and the determination of the reliability of a test  The variance show of how far each value in the data set is from the mean. Here is how it is defined:
  • 23. + Example calculation of variance and standard deviation on strength scores. Subj Score (x) Deviation (x)2 1 216 22.7 515.29 2 144 -49.3 2430.49 3 183 -10.3 106.09 4 138 -55.3 3058.09 5 212 18.7 349.69 6 180 -13.3 176.89 7 200 6.7 44.89 8 264 70.7 4998.49 9 203 9.7 94.09 =1740 =0 =11774.01 3 . 193 9 1740 n X = X    s x X n 2 2 1 1177401 8 147175       ( ) . . s x X n     ( ) . 2 1 384
  • 24. + 24 Range  The range of a set of data is the difference between the highest and lowest values in the set. To find the range, first order the data from least to greatest. Then subtract the smallest value from the largest value in the set. Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9. So the range is 9-3 = 6.
  • 25. + Quantiles  one of the class of values of a variable that divides the total frequency of a sample or population into a given number of equal proportions  Examples:  Percentile  Decile  Quartile  Quintile
  • 26. + Quantiles  The 100-quantiles are called percentiles.  The 10-quantiles are called deciles.  The 5-quantiles are call quintiles.  The 4-quantiles are called quartiles.
  • 27. + Percentiles  The kth percentile is a scale value for a data series equal to the p/100 quantile  The 1st percentile cuts off lowest 1% of data  The 98th percentile cuts off lowest 98% of data  The 25th percentiles is the first quartile  The 50th percentile is the median
  • 28. + Deciles  each of ten equal groups into which a population can be divided according to the distribution of values of a particular variable.  Represents 1/10 of the total population  The 1st decile cuts off the lowest 10% of data  The 9th decile cuts off the lowest 90% of data
  • 29. + Quartiles  The quartiles divide the distribution into four equal parts  called fourths  The total of 100% is broken into four equal parts: 25%, 50%, 75%, 100%.  Lower Quartile is the 25th percentile. (0.25)  Median Quartile is the 50th percentile. (0.50)  Upper Quartile is the 75th percentile. (0.75)
  • 30. + Quintiles  any of five equal groups into which a population can be divided according to the distribution of values of a particular variable.  Represents 20% or 1/5 of the given amount
  • 31. + Box Plot  A visual tool that illustrates the distribution of a univariate dataset.  It illustrates the median, upper and lower quantiles, upper and lower deciles, and any outliers.  Using R  boxplot(dataset)  quantile(dataset)
  • 33. Practice  One hundred randomly selected students were asked the number of movies they watched the previous week. The results are as follows:  Find the sample mean, median, and range of the sample.  Find the standard deviation and the variance.  Construct a barplot of the data.  Find the first quartile.  Find the second quartile. To which value it corresponds?  Find the third quartile.  Construct a box plot of the data.  What percent of the students saw fewer than three movies?  Find the 40th percentile.  Find the 90th percentile. # of Movies Frequency 0 20 1 36 2 24 3 16 4 4
  • 34. 20 36 24 16 4 0 5 10 15 20 25 30 35 40 0 1 2 3 4 Histogram for films frequencies Frequency