Home / Python / 104.3.1 Data Sampling in Python

104.3.1 Data Sampling in Python

Many a times the dataset we are dealing with can be too large to be handled in python. A work around is to take random sample out of the dataset and work on it.
There are situations where sampling is appropriate, as it gives a near representations of the underlying population.

Sampling in Python

  • We need to use sample() function
In [1]:
import pandas as pd 

Online_Retail=pd.read_csv("datasets\\Online Retail Sales Data\\Online Retail.csv", encoding = "ISO-8859-1")
Online_Retail.shape
Out[1]:
(541909, 8)
In [2]:
sample_data=Online_Retail.sample(n=1000,replace="False")
sample_data.shape
Out[2]:
(1000, 8)

Using function .sample() on our data set we have taken a random sample of 1000 rows out of total 541909 rows of full data.

Practice : Sampling in Python

  • Import “Census Income Data/Income_data.csv”
  • Create a new dataset by taking a random sample of 5000 records
In [3]:
Income_Data=pd.read_csv("datasets\\Census Income Data\\Income_data.csv", encoding = "ISO-8859-1")
Income_Data.shape
Out[3]:
(32561, 15)
In [4]:
sample_Income_Data=Income_Data.sample(n=5000,replace="False")
sample_Income_Data.shape
Out[4]:
(5000, 15)

About admin

Check Also

Partitioning Data in KNIME

In a typical data mining project, it is a good practice to evaluate the performance …

Leave a Reply

Your email address will not be published. Required fields are marked *