Home / Python / 104.2.1 Importing data in Python

104.2.1 Importing data in Python

 In this post we will try to understand how to import the datasets into python.

Data import from CSV files

  • Need to use the function read.csv
  • Need to use “/” or “\” in the path. The windows style of path “\” doesn’t work

Importing from CSV files

In [2]:
import pandas as pd     # importing library pandas

Sales = pd.read_csv("datasets\\Superstore Sales Data\\Sales_sample.csv")

print(Sales)
    custId            custName                   custCountry productSold  \
0    23262        Candice Levy                         Congo     SUPA101   
1    23263        Xerxes Smith                        Panama     DETA200   
2    23264        Levi Douglas  Tanzania, United Republic of     DETA800   
3    23265        Uriel Benton                  South Africa     SUPA104   
4    23266        Celeste Pugh                         Gabon     PURA200   
5    23267        Vance Campos          Syrian Arab Republic     PURA100   
6    23268        Latifah Wall                    Guadeloupe     DETA100   
7    23269      Jane Hernandez                     Macedonia     PURA100   
8    23270         Wanda Garza                    Kyrgyzstan     SUPA103   
9    23271  Athena Fitzpatrick                       Reunion     SUPA103   
10   23272       Anjolie Hicks      Turks and Caicos Islands     DETA200   

   salesChannel  unitsSold   dateSold  
0        Retail        117   8/9/2012  
1        Online         73   7/6/2012  
2        Online        205  8/18/2012  
3        Online         14   8/5/2012  
4        Retail        170  8/11/2012  
5        Retail        129  7/11/2012  
6        Retail         82  7/12/2012  
7        Retail        116   6/3/2012  
8        Online         67   6/7/2012  
9        Retail        125  7/27/2012  
10       Retail         71  7/31/2012  

Data import from Excel files

  • Need to use pandas again
In [2]:
import pandas as pd

wb_data = pd.read_excel("datasets\\World Bank Data\\World Bank Indicators.xlsx" , "Data by country",index_col=None, na_values=['NA'])

wb_data.head(5)
Out[2]:
Country Name Date Transit: Railways, (million passenger-km) Transit: Passenger cars (per 1,000 people) Business: Mobile phone subscribers Business: Internet users (per 100 people) Health: Mortality, under-5 (per 1,000 live births) Health: Health expenditure per capita (current US$) Health: Health expenditure, total (% GDP) Population: Total (count) Population: Urban (count) Population:: Birth rate, crude (per 1,000) Health: Life expectancy at birth, female (years) Health: Life expectancy at birth, male (years) Health: Life expectancy at birth, total (years) Population: Ages 0-14 (% of total) Population: Ages 15-64 (% of total) Population: Ages 65+ (% of total) Finance: GDP (current US$) Finance: GDP per capita (current US$)
0 Afghanistan 2000-07-01 0.0 NaN 0.0 NaN 151.0 11.0 8.0 25950816 5527524.0 51.0 45.0 45.0 45.0 48.0 50.0 2.0 NaN NaN
1 Afghanistan 2001-07-01 0.0 NaN 0.0 0.0 150.0 11.0 9.0 26697430 5771984.0 50.0 46.0 45.0 46.0 48.0 50.0 2.0 2.461666e+09 92.0
2 Afghanistan 2002-07-01 0.0 NaN 25000.0 0.0 150.0 22.0 7.0 27465525 6025936.0 49.0 46.0 46.0 46.0 48.0 50.0 2.0 4.338908e+09 158.0
3 Afghanistan 2003-07-01 0.0 NaN 200000.0 0.0 151.0 25.0 8.0 28255719 6289723.0 48.0 46.0 46.0 46.0 48.0 50.0 2.0 4.766127e+09 169.0
4 Afghanistan 2004-07-01 0.0 NaN 600000.0 0.0 150.0 30.0 9.0 29068646 6563700.0 47.0 46.0 46.0 46.0 48.0 50.0 2.0 5.704203e+09 196.0

Basic Commands on Datasets

  • Is the data imported correctly? Are the variables imported in right format? Did we import all the rows?
  • Once the dataset is inside Python, we would like to do some basic checks to get an idea on the dataset.
  • Just printing the data is not a good option, always.
  • Is a good practice to check the number of rows, columns, quick look at the variable structures, a summary and data snapshot

Check list after Import

Data: Superstore Sales Data\Sales_sample.csv

Code Description
Sales.shape To check the number of rows and columns
Sales.columns.values What are the column names?, Sometimes import doesn’t consider column names while importing
Sales.head(10) First few observations of data
Sales.tail(10) Last few observations of the data
Sales.dtypes Data types of all variables

Quick Summary

Code Description
Sales.describe() Summary of all variables
Sales[‘custId’].describe() Summary of a variable
Sales.salesChannel.value_counts() Get frequency table for a given variable
table(Sales$custCountry) Get frequency tables for categorical variables
sum(Sales.custId.isnull()) Missing value count in a variable
Sales.sample(n=10) Take a random sample of size 10

About admin

Check Also

204.7.5 The Random Forest

Random Forest Like many trees form a forest, many decision tree model together form a …

Leave a Reply

Your email address will not be published. Required fields are marked *