Sampling of data in python

A ready-to-run code with different data sampling techniques to create a random and representative sample in Python

Sampling of data in python

Image Source:Pexels/christian-diokno

Data Sampling forms the essential part of the majority of research, scientific and data experiments. It is one of the most important factors which determines the accuracy of your research or survey result. If your sample has not been accurately sampled then this might impact significantly the final results and conclusions. There are many sampling techniques that can be used to gather a data sample depending upon the need and situation. In this blog post, I will cover the following data sampling techniques:

- Terminology: Population and Sampling 
- Random Sampling
- Systematic Sampling
- Cluster Sampling
- Weighted Sampling
- Stratified Sampling

Introduction to Population and Sample

To start with, let’s have a look at some basic terminology. It is important to learn the concepts of population and sample. Thepopulation is the set of all observations (individuals, objects, events, or procedures) and is usually very large and diverse, whereas a sample is a subset of observations from the population that ideally is a true representation of the population.

Sampling of data in python

Image Source: The Author

Given that experimenting with an entire population is either impossible or simply too expensive, researchers or analysts use samples rather than the entire population in their experiments or trials. To make sure that the experimental results are reliable and hold for the entire population, the sample needs to be a true representation of the population. That is, the sample needs to be unbiased.

Random Sampling

The simplest data sampling technique that creates a random sample from the original population is Random Sampling. In this approach, every sampled observation has the same probability of getting selected during the sample generation process. Random Sampling is usually used when we don’t have any kind of prior information about the target population.

For example random selection of 3 individuals from a population of 10 individuals. Here, each individual has an equal chance of getting selected to the sample with a probability of selection of 1/10.

Sampling of data in python

Image Source: The Author

Random Sampling: Python Implementation

First, we generate random data that will serve as population data. We will, therefore, randomly sample 10K data points from Normal distribution with mean mu = 10 and standard deviation std = 2. After this, we create a Python function called random_sampling() that takes population data and desired sample size and produces as output a random sample.

Systematic Sampling

Systematic sampling is defined as a probability sampling approach where the elements from a target population are selected from a random starting point and after a fixed sampling interval.

Stated differently, systematic sampling is an extended version of probability sampling techniques in which each member of the group is selected at regular periods to form a sample. We calculate the sampling interval by dividing the entire population size by the desired sample size.

Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample.

Sampling of data in python

Image Source: The Author

Systematic Sampling: Python Implementation

We generate data that serve as population data as in the previous case. We then create a Python function called systematic_sample() that takes population data and interval for the sampling and produces as output a systematic sample.

Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample.

Cluster Sampling

Cluster sampling is a probability sampling technique where we divide the population into multiple clusters(groups) based on certain clustering criteria. Then we select a random cluster(s) with simple random or systematic sampling techniques. So, in cluster sampling, the entire population is divided into clusters or segments and then cluster(s) are randomly selected.

For example, if you want to conduct an experience evaluating the performance of sophomores in business education across Europe. It is impossible to conduct an experiment that involves a student in every university across the EU. Instead, by using Cluster Sampling, we can group the universities from each country into one cluster. These clusters then define all the sophomore student population in the EU. Next, you can use simple random sampling or systematic sampling and randomly select cluster(s) for the purposes of your research study.

Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample.

Sampling of data in python

Image Source: The Author

Cluster Sampling: Python Implementation

First, we generate data that will serve as population data with 10K observations, and this data consists of the following 4 variables:

  • Price: generated using Uniform distribution,
  • Id
  • event_type: which is a categorical variable with 3 possible values {type1, type2, type3}
  • click: binary variable taking values {0: no click, 1: click}
         id     price event_type  click
0 0 1.767794 type2 0
1 1 2.974360 type2 0
2 2 2.903518 type2 0
3 3 3.699454 type2 1
4 4 1.416739 type1 0
... ... ... ... ...
9995 9995 3.689656 type2 1
9996 9996 1.929186 type3 0
9997 9997 2.393509 type3 1
9998 9998 1.276473 type2 1
9999 9999 3.959585 type2 1
[10000 rows x 4 columns]

Then the function get_clustered_Sample() takes as inputs the original data, the amount of observations per cluster, and a number of clusters you want to select, and produces as output a clustered sample.

        id     price   event_type  click  cluster
4847 4847 3.813680 type3 0 17
567 567 1.642347 type2 0 17
8982 8982 3.741744 type3 1 17
2321 2321 2.192724 type3 0 17
5045 5045 3.645671 type2 0 17
... ... ... ... ... ...
5681 5681 3.175308 type1 0 90
882 882 2.676477 type2 1 90
2090 2090 3.861775 type3 1 90
907 907 1.947100 type3 0 90
2723 2723 2.557626 type1 0 90
[200 rows x 5 columns]

Note that, Cluster Sampling usually produces a random sample but is not addressing the bias in the created sample.

Weighted Sampling

In some experiments, you might need items sampling probabilities to be according to weights associated with each item, that’s when the proportions of the type of observations should be taken into account. For example, you might need a sample of queries in a search engine with weight as a number of times these queries have been performed so that the sample can be analyzed for overall impact on the user experience. In this case, Weighted Sampling is much more preferred compared to Random Sampling or Systematic Sampling.

Weighted Sampling is a data sampling method with weights, that intends to compensate for the selection of specific observations with unequal probabilities (oversampling), non-coverage, non-responses, and other types of bias. If a biased data set is not adjusted and a simple random sampling type of approach is used instead, then the population descriptors (e.g., mean, median) will be skewed and they will fail to correctly represent the population’s proportion to the population.

Weighted Sampling addresses the bias in the sample, by creating a sample that takes into account the proportions of the type of observations in the population. Hence, Weighted Sampling usually produces a random and unbiased sample.

Sampling of data in python

Image Source: The Author

Then the function get_clustered_Sample() takes as inputs the original data, the amount of observations per cluster, and a number of clusters you want to select, and produces as output a clustered sample.

Weighted Sampling: Python Implementation

The function get_weighted_sample() takes as inputs the original data, and the desired sample size, and produces as output a weighted sample. Note that, the proportions, in this case, are defined based on the click event. That is, we compute the proportion of data points that had click events of 1 (let’s say X%) and 0 (Y%, where Y% = 100-X%), then we generate a random sample such that, the sample will also contain X% observations with click = 1 and Y% observations with click = 0.

                id     price    event_type  click
event_type
type1 0 6780 1.200188 type1 1
1 8830 2.990630 type1 1
2 8997 3.483728 type1 0
3 7541 2.402993 type1 1
4 4460 2.959203 type1 0
... ... ... ... ...
type3 29 5058 3.426289 type3 1
30 5855 3.852197 type3 0
31 6295 2.679898 type3 0
32 8978 1.115072 type3 1
33 7730 1.208441 type3 1
[100 rows x 4 columns]

Weighted Sampling usually produces a random and unbiased sample.

Stratified Sampling

Stratified Sampling is a data sampling approach, where we divide a population into homogeneous subpopulations called strata based on specific characteristics (e.g., age, race, gender identity, location, event type etc.).

Every member of the population studied should be in exactly one stratum. Each stratum is then sampled using Cluster Sampling, allowing data scientists to estimate statistical measures for each sub-population. We rely on Stratified Sampling when the populations’ characteristics are diverse and we want to ensure that every characteristic is properly represented in the sample.

So, Stratified Sampling, is simply, the combination of Clustered Sampling and Weighted Sampling.

Sampling of data in python

Image Source: The Author

Stratified Sampling: Python Implementation

The function get_stratified_sample() takes as inputs the original data, the desired sample size, the number of clusters needed, and it produces as output a stratified sample. Note that, this function, firstly performs weighted sampling using the click event. Secondly, it performs clustered sampling using the event_type.

       id     price event_type  click  cluster
0 5131 2.707995 type1 0 45
1 5102 1.677190 type1 0 45
2 7370 1.893061 type1 0 45
3 4207 2.491246 type1 0 45
4 8909 3.252655 type1 1 45
.. ... ... ... ... ...
96 3254 2.637625 type3 0 85
97 1555 1.196040 type3 1 85
98 7627 3.240507 type3 1 85
99 6405 1.607379 type3 0 85
100 1075 2.471806 type3 0 85
[202 rows x 5 columns]

Stratified Sampling, is basically, the combination of Clustered Sampling and Weighted Sampling.

If you liked this article, here are some other articles you may enjoy:

Thanks for the read

I encourage you to join Medium today to havecomplete access to all of the great locked content published across Medium and on my feed where I publish about various Data Science, Machine Learning, and Deep Learning topics.

Follow me up on Mediumto read more articles about various Data Science and Data Analytics topics. For more hands-on applications of Machine Learning, Mathematical and Statistical concepts check out my Githubaccount.
I welcome feedback and can be reached out on LinkedIn.

Happy learning!

How do you use sampling in Python?

Stratified Sampling: Python Implementation The function get_stratified_sample() takes as inputs the original data, the desired sample size, the number of clusters needed, and it produces as output a stratified sample. Note that, this function, firstly performs weighted sampling using the click event.

What is sampling the data?

In data analysis, sampling is the practice of analyzing a subset of all data in order to uncover the meaningful information in the larger data set.

How do you create a sample dataset in Python?

Enter Data Manually in Editor Window. The first step is to load pandas package and use DataFrame function. ... .
Read Data from Clipboard. ... .
Entering Data into Python like SAS. ... .
Prepare Data using sequence of numeric and character values. ... .
Generate Random Data. ... .
Create Categorical Variables. ... .
Import CSV or Excel File..

How do you take a sample of a DataFrame in Python?

4 Ways to Randomly Select Rows from Pandas DataFrame.
(1) Randomly select a single row: df = df.sample().
(2) Randomly select a specified number of rows. ... .
(3) Allow a random selection of the same row more than once (by setting replace=True): df = df.sample(n=3,replace=True).