Sampling of data in python
Data Sampling forms the essential part of the majority of research, scientific and data experiments. It is one of the most important factors which determines the accuracy of your research or survey result. If your sample has not been accurately sampled then this might impact significantly the final results and conclusions. There are many sampling techniques that can be used to gather a data sample depending upon the need and situation. In this blog post, I will cover the following data sampling techniques: - Terminology: Population and Sampling Introduction to Population and SampleTo start with, let’s have a look at some basic terminology. It is important to learn the concepts of population and sample. Thepopulation is the set of all observations (individuals, objects, events, or procedures) and is usually very large and diverse, whereas a sample is a subset of observations from the population that ideally is a true representation of the population. Image Source: The AuthorGiven that experimenting with an entire population is either impossible or simply too expensive, researchers or analysts use samples rather than the entire population in their experiments or trials. To make sure that the experimental results are reliable and hold for the entire population, the sample needs to be a true representation of the population. That is, the sample needs to be unbiased. Random SamplingThe simplest data sampling technique that creates a random sample from the original population is Random Sampling. In this approach, every sampled observation has the same probability of getting selected during the sample generation process. Random Sampling is usually used when we don’t have any kind of prior information about the target population. For example random selection of 3 individuals from a population of 10 individuals. Here, each individual has an equal chance of getting selected to the sample with a probability of selection of 1/10. Random Sampling: Python ImplementationFirst, we generate random data that will serve as population data. We will, therefore, randomly sample 10K data points from Normal distribution with mean mu = 10 and standard deviation std = 2. After this, we create a Python function called random_sampling() that takes population data and desired sample size and produces as output a random sample. Systematic SamplingSystematic sampling is defined as a probability sampling approach where the elements from a target population are selected from a random starting point and after a fixed sampling interval. Stated differently, systematic sampling is an extended version of probability sampling techniques in which each member of the group is selected at regular periods to form a sample. We calculate the sampling interval by dividing the entire population size by the desired sample size. Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample. Image Source: The AuthorSystematic Sampling: Python ImplementationWe generate data that serve as population data as in the previous case. We then create a Python function called systematic_sample() that takes population data and interval for the sampling and produces as output a systematic sample.
Cluster SamplingCluster sampling is a probability sampling technique where we divide the population into multiple clusters(groups) based on certain clustering criteria. Then we select a random cluster(s) with simple random or systematic sampling techniques. So, in cluster sampling, the entire population is divided into clusters or segments and then cluster(s) are randomly selected. For example, if you want to conduct an experience evaluating the performance of sophomores in business education across Europe. It is impossible to conduct an experiment that involves a student in every university across the EU. Instead, by using Cluster Sampling, we can group the universities from each country into one cluster. These clusters then define all the sophomore student population in the EU. Next, you can use simple random sampling or systematic sampling and randomly select cluster(s) for the purposes of your research study. Note that, Systematic Sampling usually produces a random sample but is not addressing the bias in the created sample. Cluster Sampling: Python ImplementationFirst, we generate data that will serve as population data with 10K observations, and this data consists of the following 4 variables:
id price event_type click Then the function get_clustered_Sample() takes as inputs the original data, the amount of observations per cluster, and a number of clusters you want to select, and produces as output a clustered sample. id price event_type click cluster
Weighted SamplingIn some experiments, you might need items sampling probabilities to be according to weights associated with each item, that’s when the proportions of the type of observations should be taken into account. For example, you might need a sample of queries in a search engine with weight as a number of times these queries have been performed so that the sample can be analyzed for overall impact on the user experience. In this case, Weighted Sampling is much more preferred compared to Random Sampling or Systematic Sampling. Weighted Sampling is a data sampling method with weights, that intends to compensate for the selection of specific observations with unequal probabilities (oversampling), non-coverage, non-responses, and other types of bias. If a biased data set is not adjusted and a simple random sampling type of approach is used instead, then the population descriptors (e.g., mean, median) will be skewed and they will fail to correctly represent the population’s proportion to the population. Weighted Sampling addresses the bias in the sample, by creating a sample that takes into account the proportions of the type of observations in the population. Hence, Weighted Sampling usually produces a random and unbiased sample. Image Source: The AuthorThen the function get_clustered_Sample() takes as inputs the original data, the amount of observations per cluster, and a number of clusters you want to select, and produces as output a clustered sample. Weighted Sampling: Python ImplementationThe function get_weighted_sample() takes as inputs the original data, and the desired sample size, and produces as output a weighted sample. Note that, the proportions, in this case, are defined based on the click event. That is, we compute the proportion of data points that had click events of 1 (let’s say X%) and 0 (Y%, where Y% = 100-X%), then we generate a random sample such that, the sample will also contain X% observations with click = 1 and Y% observations with click = 0. id price event_type click
Stratified SamplingStratified Sampling is a data sampling approach, where we divide a population into homogeneous subpopulations called strata based on specific characteristics (e.g., age, race, gender identity, location, event type etc.). Every member of the population studied should be in exactly one stratum. Each stratum is then sampled using Cluster Sampling, allowing data scientists to estimate statistical measures for each sub-population. We rely on Stratified Sampling when the populations’ characteristics are diverse and we want to ensure that every characteristic is properly represented in the sample. So, Stratified Sampling, is simply, the combination of Clustered Sampling and Weighted Sampling. Image Source: The AuthorStratified Sampling: Python ImplementationThe function get_stratified_sample() takes as inputs the original data, the desired sample size, the number of clusters needed, and it produces as output a stratified sample. Note that, this function, firstly performs weighted sampling using the click event. Secondly, it performs clustered sampling using the event_type. id price event_type click cluster
If you liked this article, here are some other articles you may enjoy:Thanks for the read I encourage you to join Medium today to havecomplete access to all of the great locked content published across Medium and on my feed where I publish about various Data Science, Machine Learning, and Deep Learning topics. Follow me up on
Mediumto read more articles about various Data Science and Data Analytics topics. For more hands-on applications of Machine Learning, Mathematical and Statistical concepts check out my Githubaccount. Happy learning! How do you use sampling in Python?Stratified Sampling: Python Implementation
The function get_stratified_sample() takes as inputs the original data, the desired sample size, the number of clusters needed, and it produces as output a stratified sample. Note that, this function, firstly performs weighted sampling using the click event.
What is sampling the data?In data analysis, sampling is the practice of analyzing a subset of all data in order to uncover the meaningful information in the larger data set.
How do you create a sample dataset in Python?Enter Data Manually in Editor Window. The first step is to load pandas package and use DataFrame function. ... . Read Data from Clipboard. ... . Entering Data into Python like SAS. ... . Prepare Data using sequence of numeric and character values. ... . Generate Random Data. ... . Create Categorical Variables. ... . Import CSV or Excel File.. How do you take a sample of a DataFrame in Python?4 Ways to Randomly Select Rows from Pandas DataFrame. (1) Randomly select a single row: df = df.sample(). (2) Randomly select a specified number of rows. ... . (3) Allow a random selection of the same row more than once (by setting replace=True): df = df.sample(n=3,replace=True). |