Population and Sampling
- Population → Everyone. This is the entire group you want to learn about.
- Sample → Subset of the population. It should be representative of the entire population → You will use the sample to infer about the population!
- Statistic → Numerical value derived from a sample of data
Sampling Overview and Caveats
- To pick a representative sample, we have to be careful so we can actually make inference on our population.
- For example, if I pick male NBA players only to find out the average height of an American… I’d get the average height to be around 6.5 ft! (Where’as realistically its around 5’8) .
- So what are some things we need to keep in mind of? Sampling Techniques (Independent vs Dependent Sampling, SRS, Stratified Sampling, Clustering..Etc)
Independent vs Dependent Sampling
- Independent → Selection of a sample DOES NOT affect the selection of another sample.
- Independent → Foundation of Simple Random Sampling, where every unit in the population has an equal change of being selected and this probability remains constant.
- Independent → Choose this if your groups are distinct and not related. For instance, comparing performance between two separate groups: one trained with method A and another with method B.
- Dependent → sampling involves selecting samples where members are related in some way → Picking 1 person means you cannot pick them again, so the probability of picking someone else is now different. It’s different PER observation you pull out!
- Dependent → Might be useful for some use-cases such as pre-post test scores of individuals or measurements from pairs, like siblings, helps us account for variability (less randomness)
- Dependent → Choose this if you're examining changes within the same group or comparing groups with matched pairs. For instance, studying the impact of a training program by measuring a group's performance before and after the program.
Identically Distributed Samples
- Make sure the samples are IID → What ever rule you used to pick rule 1, it should be used to pick the second one .. and so on aswell.
- If you go to a town that has higher or shorter people, then your sample won’t be a good representation
When we say that samples are "independently and identically distributed" (often abbreviated as i.i.d.), we're making two specific assumptions: