Sampling of data in python

Nội dung chính Show

A ready-to-run code with different data sampling techniques to create a random and representative sample in Python
Introduction to Population and Sample
Random Sampling
Random Sampling: Python Implementation
Systematic Sampling
Systematic Sampling: Python Implementation
Cluster Sampling
Cluster Sampling: Python Implementation
Weighted Sampling
Weighted Sampling: Python Implementation
Stratified Sampling
Stratified Sampling: Python Implementation
If you liked this article, here are some other articles you may enjoy:
How do you use sampling in Python?
What is sampling the data?
How do you create a sample dataset in Python?
How do you take a sample of a DataFrame in Python?

A ready-to-run code with different data sampling techniques to create a random and representative sample in Python

Image Source:Pexels/christian-diokno

Data Sampling forms the essential part of the majority of research, scientific and data experiments. It is one of the most important factors which determines the accuracy of your research or survey result. If your sample has not been accurately sampled then this might impact significantly the final results and conclusions. There are many sampling techniques that can be used to gather a data sample depending upon the need and situation. In this blog post, I will cover the following data sampling techniques:

- Terminology: Population and Sampling 
- Random Sampling
- Systematic Sampling
- Cluster Sampling
- Weighted Sampling
- Stratified Sampling

Introduction to Population and Sample

To start with, let’s have a look at some basic terminology. It is important to learn the concepts of population and sample. Thepopulation is the set of all observations (individuals, objects, events, or procedures) and is usually very large and diverse, whereas a sample is a subset of observations from the population that ideally is a true representation of the population.

Image Source: The Author

Given that experimenting with an entire population is either impossible or simply too expensive, researchers or analysts use samples rather than the entire population in their experiments or trials. To make sure that the experimental results are reliable and hold for the entire population, the sample needs to be a true representation of the population. That is, the sample needs to be unbiased.

Random Sampling

The simplest data sampling technique that creates a random sample from the original population is Random Sampling. In this approach, every sampled observation has the same probability of getting selected during the sample generation process. Random Sampling is usually used when we don’t have any kind of prior information about the target population.

For example random selection of 3 individuals from a population of 10 individuals. Here, each individual has an equal chance of getting selected to the sample with a probability of selection of 1/10.

Image Source: The Author

Random Sampling: Python Implementation

First, we generate random data that will serve as population data. We will, therefore, randomly sample 10K data points from Normal distribution with mean mu = 10 and standard deviation std = 2. After this, we create a Python function called random_sampling() that takes population data and desired sample size and produces as output a random sample.

Systematic Sampling

Systematic sampling is defined as a probability sampling approach where the elements from a target population are selected from a random starting point and after a fixed sampling interval.

Stated differently, systematic sampling is an extended version of probability sampling techniques in which each member of the group is selected at regular periods to form a sample. We calculate the sampling interval by dividing the entire population size by the desired sample size.

Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample.

Image Source: The Author

Systematic Sampling: Python Implementation

We generate data that serve as population data as in the previous case. We then create a Python function called systematic_sample() that takes population data and interval for the sampling and produces as output a systematic sample.

Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample.

Cluster Sampling

Cluster sampling is a probability sampling technique where we divide the population into multiple clusters(groups) based on certain clustering criteria. Then we select a random cluster(s) with simple random or systematic sampling techniques. So, in cluster sampling, the entire population is divided into clusters or segments and then cluster(s) are randomly selected.

For example, if you want to conduct an experience evaluating the performance of sophomores in business education across Europe. It is impossible to conduct an experiment that involves a student in every university across the EU. Instead, by using Cluster Sampling, we can group the universities from each country into one cluster. These clusters then define all the sophomore student population in the EU. Next, you can use simple random sampling or systematic sampling and randomly select cluster(s) for the purposes of your research study.

Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample.

Image Source: The Author

Cluster Sampling: Python Implementation

First, we generate data that will serve as population data with 10K observations, and this data consists of the following 4 variables:

Price: generated using Uniform distribution,
Id
event_type: which is a categorical variable with 3 possible values {type1, type2, type3}
click: binary variable taking values {0: no click, 1: click}

         id     price event_type  click
0        0  1.767794      type2      0
1        1  2.974360      type2      0
2        2  2.903518      type2      0
3        3  3.699454      type2      1
4        4  1.416739      type1      0
...    ...       ...        ...    ...
9995  9995  3.689656      type2      1
9996  9996  1.929186      type3      0
9997  9997  2.393509      type3      1
9998  9998  1.276473      type2      1
9999  9999  3.959585      type2      1[10000 rows x 4 columns]

Then the function get_clustered_Sample() takes as inputs the original data, the amount of observations per cluster, and a number of clusters you want to select, and produces as output a clustered sample.

        id     price   event_type  click  cluster
4847  4847  3.813680      type3      0       17
567    567  1.642347      type2      0       17
8982  8982  3.741744      type3      1       17
2321  2321  2.192724      type3      0       17
5045  5045  3.645671      type2      0       17
...    ...       ...        ...    ...      ...
5681  5681  3.175308      type1      0       90
882    882  2.676477      type2      1       90
2090  2090  3.861775      type3      1       90
907    907  1.947100      type3      0       90
2723  2723  2.557626      type1      0       90[200 rows x 5 columns]

Note that, Cluster Sampling usually produces a random sample but is not addressing the bias in the created sample.

Weighted Sampling

In some experiments, you might need items sampling probabilities to be according to weights associated with each item, that’s when the proportions of the type of observations should be taken into account. For example, you might need a sample of queries in a search engine with weight as a number of times these queries have been performed so that the sample can be analyzed for overall impact on the user experience. In this case, Weighted Sampling is much more preferred compared to Random Sampling or Systematic Sampling.

Weighted Sampling is a data sampling method with weights, that intends to compensate for the selection of specific observations with unequal probabilities (oversampling), non-coverage, non-responses, and other types of bias. If a biased data set is not adjusted and a simple random sampling type of approach is used instead, then the population descriptors (e.g., mean, median) will be skewed and they will fail to correctly represent the population’s proportion to the population.

Weighted Sampling addresses the bias in the sample, by creating a sample that takes into account the proportions of the type of observations in the population. Hence, Weighted Sampling usually produces a random and unbiased sample.

Image Source: The Author

Weighted Sampling: Python Implementation

The function get_weighted_sample() takes as inputs the original data, and the desired sample size, and produces as output a weighted sample. Note that, the proportions, in this case, are defined based on the click event. That is, we compute the proportion of data points that had click events of 1 (let’s say X%) and 0 (Y%, where Y% = 100-X%), then we generate a random sample such that, the sample will also contain X% observations with click = 1 and Y% observations with click = 0.

                id     price    event_type  click
event_type                                     
type1      0   6780  1.200188      type1      1
           1   8830  2.990630      type1      1
           2   8997  3.483728      type1      0
           3   7541  2.402993      type1      1
           4   4460  2.959203      type1      0
...             ...       ...        ...    ...
type3      29  5058  3.426289      type3      1
           30  5855  3.852197      type3      0
           31  6295  2.679898      type3      0
           32  8978  1.115072      type3      1
           33  7730  1.208441      type3      1[100 rows x 4 columns]

Weighted Sampling usually produces a random and unbiased sample.

Stratified Sampling

Stratified Sampling is a data sampling approach, where we divide a population into homogeneous subpopulations called strata based on specific characteristics (e.g., age, race, gender identity, location, event type etc.).

Every member of the population studied should be in exactly one stratum. Each stratum is then sampled using Cluster Sampling, allowing data scientists to estimate statistical measures for each sub-population. We rely on Stratified Sampling when the populations’ characteristics are diverse and we want to ensure that every characteristic is properly represented in the sample.

So, Stratified Sampling, is simply, the combination of Clustered Sampling and Weighted Sampling.

Image Source: The Author

Stratified Sampling: Python Implementation

The function get_stratified_sample() takes as inputs the original data, the desired sample size, the number of clusters needed, and it produces as output a stratified sample. Note that, this function, firstly performs weighted sampling using the click event. Secondly, it performs clustered sampling using the event_type.

       id     price event_type  click  cluster
0    5131  2.707995      type1      0       45
1    5102  1.677190      type1      0       45
2    7370  1.893061      type1      0       45
3    4207  2.491246      type1      0       45
4    8909  3.252655      type1      1       45
..    ...       ...        ...    ...      ...
96   3254  2.637625      type3      0       85
97   1555  1.196040      type3      1       85
98   7627  3.240507      type3      1       85
99   6405  1.607379      type3      0       85
100  1075  2.471806      type3      0       85[202 rows x 5 columns]

Stratified Sampling, is basically, the combination of Clustered Sampling and Weighted Sampling.

If you liked this article, here are some other articles you may enjoy:

Thanks for the read

I encourage you to join Medium today to havecomplete access to all of the great locked content published across Medium and on my feed where I publish about various Data Science, Machine Learning, and Deep Learning topics.

Follow me up on Mediumto read more articles about various Data Science and Data Analytics topics. For more hands-on applications of Machine Learning, Mathematical and Statistical concepts check out my Githubaccount.
I welcome feedback and can be reached out on LinkedIn.

Happy learning!

How do you use sampling in Python?

Stratified Sampling: Python Implementation The function get_stratified_sample() takes as inputs the original data, the desired sample size, the number of clusters needed, and it produces as output a stratified sample. Note that, this function, firstly performs weighted sampling using the click event.

What is sampling the data?

In data analysis, sampling is the practice of analyzing a subset of all data in order to uncover the meaningful information in the larger data set.

How do you create a sample dataset in Python?

Enter Data Manually in Editor Window. The first step is to load pandas package and use DataFrame function. ... .

Read Data from Clipboard. ... .

Entering Data into Python like SAS. ... .

Prepare Data using sequence of numeric and character values. ... .

Generate Random Data. ... .

Create Categorical Variables. ... .

Import CSV or Excel File..

How do you take a sample of a DataFrame in Python?

4 Ways to Randomly Select Rows from Pandas DataFrame.

(1) Randomly select a single row: df = df.sample().

(2) Randomly select a specified number of rows. ... .

(3) Allow a random selection of the same row more than once (by setting replace=True): df = df.sample(n=3,replace=True).

Sampling of data in python

A ready-to-run code with different data sampling techniques to create a random and representative sample in Python

Introduction to Population and Sample

Random Sampling

Random Sampling: Python Implementation

Systematic Sampling

Systematic Sampling: Python Implementation

Cluster Sampling

Cluster Sampling: Python Implementation

Weighted Sampling

Weighted Sampling: Python Implementation

Stratified Sampling

Stratified Sampling: Python Implementation

If you liked this article, here are some other articles you may enjoy:

How do you use sampling in Python?

What is sampling the data?

How do you create a sample dataset in Python?

How do you take a sample of a DataFrame in Python?

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội