Memory Optimization using Pandas Library
The performance of Python program with respect large dataset can be enhanced by using Pandas library.
Pandas native techniques can be practiced to optimize the performance by appropriately utilze the memory and optimize the performance.
In this blog we shall learn few basic practices which allow us to handle large datasets in an efficient way.
Learn the core concepts of Data Science Course video on Youtube:
While we are working with large datasets, one of the simplest approach to apply is to identify the specific variables that are of interest and load only those variables for processing rather importing the entire dataset into memory.
The option = deep in info() function is used to perform a real memory usage. The calculation is performed at the cost of computational resources. If we do not use deep option then the memory usage is based on the column dtype and number of rows. Assuming values consume the same memory amount for corresponding dtypes are calculated.
The memory consumption for the sample dataset is 120.2 MB.
The question we need to ask ourselfs is:
Q. Do we require the entire data for processing?
Optimize the Memory Usage
Assume our interest is only for the columns: Age and Status.
Why not import only these two columns?
This approach will reduce the memory consuption drastically.
The usecols parameter of read_csv() function can filter out all other columns and import only the required fields.
This will allow us to utilize the memory in an optimized manner which can enhance the performance of your code.
Choose the Appropriate Data types.
In Python programming the standard data types are used. Every column is automatically infered for the data types based on the data it holds.
Each of these standard data types have predefined structure and storage defined.
Lets discuss the numerical data types and their memory consumption.
Importing the entire sample data consumes 120.2 MB of memory. To optimize the memory utilization we can alter the data types from int64 to int32, int16, or int8 as appropriate.
For Example: Age column consists of positive 2 digit numbers, so we can typecast Age to int8 or uint8.
Observe the difference in the size of the data post typecasting it to int8.
From 2.28 MB it has reduced to 0.28 MB.
Typecasting all the numeric columns will reduce the overall memory consumption for the dataframe.
Applying this strategy has brought down the size of the data from 120.2 MB to 100.2 MB
Further we can also convert the values into boolean type if the data is binary in nature. Use unique() or value_counts() functions to verify the object columns.
Reduce the memory consumption of a catergorical values by renameing the values.
The 'DayOfTheWeek' column is of (19276698 bytes) 18.38 MB, this is due to the full form of the weekday. The size can be reduced by altering the full form with a short form represetation. This can be achived by using a datatype called category . The data which is repeated in non-numerical column is stored in a comparatively compact representation.
The memory consumption has reduced to 0.28 MB from 18.38 MB, this is a huge compression, especially if we have a big data in terms of the number of rows.
Conversion of Date columns to Datetime will impact the memory usage very effectively. The values are inferred as string/object type by pandas library by default.
The column AppointmentRegistration is inferred as object type by default. Changing this column into datetime will bring the memory consmuption drastically down. The current memory usage is rounded to 22 MB. Lets convert this data to datetime and measure the consmption.
Post the conversion to datetime the memory consumption has come down to 2.28 MB from 22 MB.
Let us implement all the techniques that we have discussed in this blog on the sample data and see the over all effect in memory utilization.
Typecasting to 'int8' has reduced the memory usage by 20 MB.
Lets target the date columns now.
The typecasting technique can help reduce the memory usage to some extent.
We have successfully shrinked the memory usage from 120.2 MB to 42.6 MB.
There are certain limitations though with Pandas library especially if the dataset is very large when compared to the machines RAM capacity.
Pandas as is not an idea library to handle large datasets. There are supportive packages that can help Pandas to scale up to deal with large datasets.
Data Science Placement Success Story
Data Science Training Institutes in Other Locations
Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Visakhapatnam, Tirunelveli, Aurangabad
Data Analyst Courses in Other Locations
ECIL, Jaipur, Pune, Gurgaon, Salem, Surat, Agra, Ahmedabad, Amritsar, Anand, Anantapur, Andhra Pradesh, Anna Nagar, Aurangabad, Bhilai, Bhopal, Bhubaneswar, Borivali, Calicut, Cochin, Chengalpattu , Dehradun, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Guduvanchery, Gwalior, Hebbal, Hoodi , Indore, Jabalpur, Jaipur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Kanpur, Khammam, Kochi, Kolhapur, Kolkata, Kothrud, Ludhiana, Madurai, Mangalore, Meerut, Mohali, Moradabad, Pimpri, Pondicherry, Porur, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thoraipakkam , Tiruchirappalli, Tirunelveli, Trichur, Trichy, Udaipur, Vijayawada, Vizag, Warangal, Chennai, Coimbatore, Delhi, Dilsukhnagar, Hyderabad, Kalyan, Nagpur, Noida, Thane, Thiruvananthapuram, Uppal, Kompally, Bangalore, Chandigarh, Chromepet, Faridabad, Guntur, Guwahati, Kharadi, Lucknow, Mumbai, Mysore, Nashik, Navi Mumbai, Patna, Pune, Raipur, Vadodara, Varanasi, Yelahanka
Navigate to Address
360DigiTMG - Data Science, Data Scientist Course Training in Bangalore
No 23, 2nd Floor, 9th Main Rd, 22nd Cross Rd, 7th Sector, HSR Layout, Bengaluru, Karnataka 560102