Home / Blog / Data Science / Memory Optimization using Pandas Library

Memory Optimization using Pandas Library

  • February 20, 2023
  • 3629
  • 76
Author Images

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Read More >

The performance of Python program with respect large dataset can be enhanced by using Pandas library.

Pandas native techniques can be practiced to optimize the performance by appropriately utilze the memory and optimize the performance.

In this blog we shall learn few basic practices which allow us to handle large datasets in an efficient way.

Learn the core concepts of Data Science Course video on Youtube:

Tip 1

While we are working with large datasets, one of the simplest approach to apply is to identify the specific variables that are of interest and load only those variables for processing rather importing the entire dataset into memory.

Memory Optimization using Pandas Library

Memory Optimization using Pandas Library

The option = deep in info() function is used to perform a real memory usage. The calculation is performed at the cost of computational resources. If we do not use deep option then the memory usage is based on the column dtype and number of rows. Assuming values consume the same memory amount for corresponding dtypes are calculated.

The memory consumption for the sample dataset is 120.2 MB.

The question we need to ask ourselfs is:
Q. Do we require the entire data for processing?

Optimize the Memory Usage

Assume our interest is only for the columns: Age and Status.

Why not import only these two columns?

This approach will reduce the memory consuption drastically.

Memory Optimization using Pandas Library

The usecols parameter of read_csv() function can filter out all other columns and import only the required fields.

This will allow us to utilize the memory in an optimized manner which can enhance the performance of your code.

Memory Optimization using Pandas Library

Tip 2.

Choose the Appropriate Data types.

In Python programming the standard data types are used. Every column is automatically infered for the data types based on the data it holds.

Each of these standard data types have predefined structure and storage defined.

Lets discuss the numerical data types and their memory consumption.

Memory Optimization using Pandas Library

Memory Optimization using Pandas Library

Memory Optimization using Pandas Library

Importing the entire sample data consumes 120.2 MB of memory. To optimize the memory utilization we can alter the data types from int64 to int32, int16, or int8 as appropriate.

For Example: Age column consists of positive 2 digit numbers, so we can typecast Age to int8 or uint8.

Memory Optimization using Pandas Library

Observe the difference in the size of the data post typecasting it to int8.

From 2.28 MB it has reduced to 0.28 MB.

Typecasting all the numeric columns will reduce the overall memory consumption for the dataframe.

Memory Optimization using Pandas Library

Memory Optimization using Pandas Library

Applying this strategy has brought down the size of the data from 120.2 MB to 100.2 MB

Further we can also convert the values into boolean type if the data is binary in nature. Use unique() or value_counts() functions to verify the object columns.

Tip3.

Reduce the memory consumption of a catergorical values by renameing the values.

Memory Optimization using Pandas Library

Memory Optimization using Pandas Library

The 'DayOfTheWeek' column is of (19276698 bytes) 18.38 MB, this is due to the full form of the weekday. The size can be reduced by altering the full form with a short form represetation. This can be achived by using a datatype called category . The data which is repeated in non-numerical column is stored in a comparatively compact representation.

Memory Optimization using Pandas Library

The memory consumption has reduced to 0.28 MB from 18.38 MB, this is a huge compression, especially if we have a big data in terms of the number of rows.

Tip4.

Conversion of Date columns to Datetime will impact the memory usage very effectively. The values are inferred as string/object type by pandas library by default.

Memory Optimization using Pandas Library

Memory Optimization using Pandas Library

The column AppointmentRegistration is inferred as object type by default. Changing this column into datetime will bring the memory consmuption drastically down. The current memory usage is rounded to 22 MB. Lets convert this data to datetime and measure the consmption.

Memory Optimization using Pandas Library

Post the conversion to datetime the memory consumption has come down to 2.28 MB from 22 MB.

Conclusion:

Let us implement all the techniques that we have discussed in this blog on the sample data and see the over all effect in memory utilization.

Memory Optimization using Pandas Library

Memory Optimization using Pandas Library

Memory Optimization using Pandas Library

Typecasting to 'int8' has reduced the memory usage by 20 MB.

Lets target the date columns now.

Memory Optimization using Pandas Library

Memory Optimization using Pandas Library

Memory Optimization using Pandas Library

Memory Optimization using Pandas Library

Memory Optimization using Pandas Library

The typecasting technique can help reduce the memory usage to some extent.

We have successfully shrinked the memory usage from 120.2 MB to 42.6 MB.

There are certain limitations though with Pandas library especially if the dataset is very large when compared to the machines RAM capacity.

Pandas as is not an idea library to handle large datasets. There are supportive packages that can help Pandas to scale up to deal with large datasets.

Memory Optimization using Pandas Library

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore,

Data Science Placement Success Story

Data Science Training Institutes in Other Locations

Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Visakhapatnam, Tirunelveli, Aurangabad

Data Analyst Courses in Other Locations

ECIL, Jaipur, Pune, Gurgaon, Salem, Surat, Agra, Ahmedabad, Amritsar, Anand, Anantapur, Andhra Pradesh, Anna Nagar, Aurangabad, Bhilai, Bhopal, Bhubaneswar, Borivali, Calicut, Cochin, Chengalpattu , Dehradun, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Guduvanchery, Gwalior, Hebbal, Hoodi , Indore, Jabalpur, Jaipur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Kanpur, Khammam, Kochi, Kolhapur, Kolkata, Kothrud, Ludhiana, Madurai, Mangalore, Meerut, Mohali, Moradabad, Pimpri, Pondicherry, Porur, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thoraipakkam , Tiruchirappalli, Tirunelveli, Trichur, Trichy, Udaipur, Vijayawada, Vizag, Warangal, Chennai, Coimbatore, Delhi, Dilsukhnagar, Hyderabad, Kalyan, Nagpur, Noida, Thane, Thiruvananthapuram, Uppal, Kompally, Bangalore, Chandigarh, Chromepet, Faridabad, Guntur, Guwahati, Kharadi, Lucknow, Mumbai, Mysore, Nashik, Navi Mumbai, Patna, Pune, Raipur, Vadodara, Varanasi, Yelahanka

Navigate to Address

360DigiTMG - Data Science, Data Scientist Course Training in Bangalore

No 23, 2nd Floor, 9th Main Rd, 22nd Cross Rd, 7th Sector, HSR Layout, Bengaluru, Karnataka 560102

1800-212-654321

View on Google Maps: Data Science Course

Make an Enquiry
Call Us