Exploring Statistics

Describing and defining important concepts in statistics.

Posted by Aitor Gutierrez Valero on Thursday, September 1, 2022

Foundational Concepts


Population, Sample, and Sample Size

A population is a defined set of individuals, and a sample is a randomly selected sub-set of a population. For example: As of 2022, Spain has 4.7M people. A sample of the Spanish population would be a random selection of 1 - to - 4.69M people in Spain.

Sample size determines the accuracy of any data calculated from a sample. This accuracy can be split into two concepts:

Confidence Level -

Certainty that our sample represents the population

Error Margin -

Certainty that our statistics reflect the population

To elaborate: We find that the average weight of all Spaniards is 70 kg with an error margin of 10% and a confidence level of 95%. The error margin means that the weight is actually somewhere between 63 and 77 kg, and the confidence level means that we are 95% sure that the average is between 63 and 77 kg.

Below is a table comparing the error margin and confidence level for different sample sizes of a population of 1 million people.

Sample Size Error Margin Confidence Level
31 15 90
97 10 95
664 5 99
16317 1 99

Statistics vs. Parameters

Statistics -

The calculated values from a sample

Parameters -

The true values from a population (Calculable if your sample is the population)

Variable Types used in Statistics (Types of data)

Qualitative (Categorical)

Ordinal -

An ordered set of variables: Always, frequently, sometimes, etc.

Nominal -

An unordered variable: Gender, Colour, Country, etc.

Quantitative

Continuous -

Numerical: Infinite values like height and weight

Discrete -

Numerical: Countable values like number of people or flowers

Three Types of Statistical Analysis

Statistics analysis is used to interpret, represent, and extrapolate data and its various trends.

1. Bias

Defines how error influences the final results.

Random Error -

 Uncontrollable error in measurement that varies randomly
 For example: A cup's length is between the smallest increment on your ruler; you round up or down

Systematic Error -

Error in measurement that skews values consistently
For example: Scale consistently measures too high by 1 kg.

2. Descriptive Statistics

Provides an idea of the differences or similarities between data by defining averages and spread.

Below is a data set of numbers for which we will calculate descriptive statistics. First are the averages, next is the spread.

-7, 1, 2, 2, 3, 4, 5, 5, 6, 22
Type Mean Median Mode
Description Sum of all of the numbers divided by the number of numbers Value separating the upper and lower halves Most frequent value in a dataset
Example (-7 + 1 + 2 + … + 22)/10 Mean of 3 and 4 Bimodal
Result 4.3 3.5 2 and 5
Type Range Standard_Deviation Interquartile_Range
Description Difference between the highest and lowest numbers Amount of dispersion from the mean Difference between the 25th (Q1) and 75th (Q3) percentiles
Example 22 - (-7) Q1 = 2 Q3 = 5
Result 29 6.84 3

3. Inferential Statistics

After exploring data using descriptive statistics, the relationships between variables (eg. height vs. weight) and their consistency can be determined using inferential statistics. To understand how these relationships are tested, the null hypothesis and alternative hypothesis are defined. After which, statistical tests are performed.

Null Hypothesis

The expected idea based on current knowledge: Population 1 is equivalent to Population 2.

Alternative Hypothesis

A new idea which could nullify the null hypothesis: Population 1 is different from Population 2

Statistical Tests

Every statistical test produces:

A test statistic that indicates how closely the data match the null hypothesis

and

A P value for the probability of obtaining said result if the null hypothesis is true

Every parametric statistical test assumes:

1. Independence of observations:
   Many measurements of one test subject are not independent,
   while measurements of many different test subjects are independent
   
2. Homogeneity of variance:
   That variance within each group is similar

3. Normality of data:
   That quantitative data follows a normal distribution

Choosing a test for parametric data can be done by following the steps in the image below. Credit to Scribbr.