Foundational Concepts

Population, Sample, and Sample Size

A population is a defined set of individuals, and a sample is a randomly selected sub-set of a population. For example: As of 2022, Spain has 4.7M people. A sample of the Spanish population would be a random selection of 1 - to - 4.69M people in Spain.

Sample size determines the accuracy of any data calculated from a sample. This accuracy can be split into two concepts:

Confidence Level -

Certainty that our sample represents the population

Error Margin -

Certainty that our statistics reflect the population

To elaborate: We find that the average weight of all Spaniards is 70 kg with an error margin of 10% and a confidence level of 95%. The error margin means that the weight is actually somewhere between 63 and 77 kg, and the confidence level means that we are 95% sure that the average is between 63 and 77 kg.

Below is a table comparing the error margin and confidence level for different sample sizes of a population of 1 million people.

Sample Size	Error Margin	Confidence Level
31	15	90
97	10	95
664	5	99
16317	1	99

Statistics vs. Parameters

Statistics -

The calculated values from a sample

Parameters -

The true values from a population (Calculable if your sample is the population)

Variable Types used in Statistics (Types of data)

Qualitative (Categorical)

Ordinal -

An ordered set of variables: Always, frequently, sometimes, etc.

Nominal -

An unordered variable: Gender, Colour, Country, etc.

Quantitative

Continuous -

Numerical: Infinite values like height and weight

Discrete -

Numerical: Countable values like number of people or flowers

Three Types of Statistical Analysis

Statistics analysis is used to interpret, represent, and extrapolate data and its various trends.

1. Bias

Defines how error influences the final results.

Random Error -

 Uncontrollable error in measurement that varies randomly
 For example: A cup's length is between the smallest increment on your ruler; you round up or down

Systematic Error -

Error in measurement that skews values consistently
For example: Scale consistently measures too high by 1 kg.

2. Descriptive Statistics

Provides an idea of the differences or similarities between data by defining averages and spread.

Below is a data set of numbers for which we will calculate descriptive statistics. First are the averages, next is the spread.

-7, 1, 2, 2, 3, 4, 5, 5, 6, 22

Type	Mean	Median	Mode
Description	Sum of all of the numbers divided by the number of numbers	Value separating the upper and lower halves	Most frequent value in a dataset
Example	(-7 + 1 + 2 + … + 22)/10	Mean of 3 and 4	Bimodal
Result	4.3	3.5	2 and 5

Type	Range	Standard_Deviation	Interquartile_Range
Description	Difference between the highest and lowest numbers	Amount of dispersion from the mean	Difference between the 25th (Q1) and 75th (Q3) percentiles
Example	22 - (-7)		Q1 = 2 Q3 = 5
Result	29	6.84	3

3. Inferential Statistics

After exploring data using descriptive statistics, the relationships between variables (eg. height vs. weight) and their consistency can be determined using inferential statistics. To understand how these relationships are tested, the null hypothesis and alternative hypothesis are defined. After which, statistical tests are performed.

Null Hypothesis

The expected idea based on current knowledge: Population 1 is equivalent to Population 2.

Alternative Hypothesis

A new idea which could nullify the null hypothesis: Population 1 is different from Population 2

Statistical Tests

Every statistical test produces:

A test statistic that indicates how closely the data match the null hypothesis

and

A P value for the probability of obtaining said result if the null hypothesis is true

Every parametric statistical test assumes:

1. Independence of observations:
   Many measurements of one test subject are not independent,
   while measurements of many different test subjects are independent
   
2. Homogeneity of variance:
   That variance within each group is similar

3. Normality of data:
   That quantitative data follows a normal distribution

Choosing a test for parametric data can be done by following the steps in the image below. Credit to Scribbr.

Exploring Statistics

Describing and defining important concepts in statistics.