Login
Congrats in choosing to up-skill for your bright career! Please share correct details.
Home / Blog / Data Science / Memory Optimization using Pandas Library
Sharat Chandra is the head of analytics at 360DigiTMG as well as one of the founders and directors of AiSPRY. With more than 17 years of work experience in the IT sector and Worked as a Data scientist for 14+ years across several industry domains, Sharat Chandra has a wide range of expertise in areas like retail, manufacturing, medical care, etc. With over ten years of expertise as the head trainer at 360DigiTMG, Sharat Chandra has been assisting his pupils in making the move to the IT industry simple. Along with the Oncology team, he made a contribution to the field of LSHC, especially to the field of cancer therapy, which was published in the British magazine of Cancer research magazine.
The performance of Python program with respect large dataset can be enhanced by using Pandas library.
Pandas native techniques can be practiced to optimize the performance by appropriately utilze the memory and optimize the performance.
In this blog we shall learn few basic practices which allow us to handle large datasets in an efficient way.
While we are working with large datasets, one of the simplest approach to apply is to identify the specific variables that are of interest and load only those variables for processing rather importing the entire dataset into memory.
The option = deep in info() function is used to perform a real memory usage. The calculation is performed at the cost of computational resources. If we do not use deep option then the memory usage is based on the column dtype and number of rows. Assuming values consume the same memory amount for corresponding dtypes are calculated.
The memory consumption for the sample dataset is 120.2 MB.
The question we need to ask ourselfs is: Q. Do we require the entire data for processing?
Assume our interest is only for the columns: Age and Status.
Why not import only these two columns?
This approach will reduce the memory consuption drastically.
The usecols parameter of read_csv() function can filter out all other columns and import only the required fields.
This will allow us to utilize the memory in an optimized manner which can enhance the performance of your code.
In Python programming the standard data types are used. Every column is automatically infered for the data types based on the data it holds.
Each of these standard data types have predefined structure and storage defined.
Lets discuss the numerical data types and their memory consumption.
Importing the entire sample data consumes 120.2 MB of memory. To optimize the memory utilization we can alter the data types from int64 to int32, int16, or int8 as appropriate.
For Example: Age column consists of positive 2 digit numbers, so we can typecast Age to int8 or uint8.
Observe the difference in the size of the data post typecasting it to int8.
From 2.28 MB it has reduced to 0.28 MB.
Typecasting all the numeric columns will reduce the overall memory consumption for the dataframe.
Further we can also convert the values into boolean type if the data is binary in nature. Use unique() or value_counts() functions to verify the object columns.
Reduce the memory consumption of a catergorical values by renameing the values.
The 'DayOfTheWeek' column is of (19276698 bytes) 18.38 MB, this is due to the full form of the weekday. The size can be reduced by altering the full form with a short form represetation. This can be achived by using a datatype called category . The data which is repeated in non-numerical column is stored in a comparatively compact representation.
The memory consumption has reduced to 0.28 MB from 18.38 MB, this is a huge compression, especially if we have a big data in terms of the number of rows.
Conversion of Date columns to Datetime will impact the memory usage very effectively. The values are inferred as string/object type by pandas library by default.
The column AppointmentRegistration is inferred as object type by default. Changing this column into datetime will bring the memory consmuption drastically down. The current memory usage is rounded to 22 MB. Lets convert this data to datetime and measure the consmption.
Post the conversion to datetime the memory consumption has come down to 2.28 MB from 22 MB.
Let us implement all the techniques that we have discussed in this blog on the sample data and see the over all effect in memory utilization.
Typecasting to 'int8' has reduced the memory usage by 20 MB.
Lets target the date columns now.
The typecasting technique can help reduce the memory usage to some extent.
We have successfully shrinked the memory usage from 120.2 MB to 42.6 MB.
There are certain limitations though with Pandas library especially if the dataset is very large when compared to the machines RAM capacity.
Pandas as is not an idea library to handle large datasets. There are supportive packages that can help Pandas to scale up to deal with large datasets.
Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore,
Agra, Ahmedabad, Amritsar, Anand, Anantapur, Bangalore, Bhopal, Bhubaneswar, Chengalpattu, Chennai, Cochin, Dehradun, Malaysia, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Gwalior, Hebbal, Hyderabad, Jabalpur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Khammam, Kolhapur, Kothrud, Ludhiana, Madurai, Meerut, Mohali, Moradabad, Noida, Pimpri, Pondicherry, Pune, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thane, Thiruvananthapuram, Tiruchchirappalli, Trichur, Udaipur, Yelahanka, Andhra Pradesh, Anna Nagar, Bhilai, Borivali, Calicut, Chandigarh, Chromepet, Coimbatore, Dilsukhnagar, ECIL, Faridabad, Greater Warangal, Guduvanchery, Guntur, Gurgaon, Guwahati, Hoodi, Indore, Jaipur, Kalaburagi, Kanpur, Kharadi, Kochi, Kolkata, Kompally, Lucknow, Mangalore, Mumbai, Mysore, Nagpur, Nashik, Navi Mumbai, Patna, Porur, Raipur, Salem, Surat, Thoraipakkam, Trichy, Uppal, Vadodara, Varanasi, Vijayawada, Visakhapatnam, Tirunelveli, Aurangabad
ECIL, Jaipur, Pune, Gurgaon, Salem, Surat, Agra, Ahmedabad, Amritsar, Anand, Anantapur, Andhra Pradesh, Anna Nagar, Aurangabad, Bhilai, Bhopal, Bhubaneswar, Borivali, Calicut, Cochin, Chengalpattu , Dehradun, Dombivli, Durgapur, Ernakulam, Erode, Gandhinagar, Ghaziabad, Gorakhpur, Guduvanchery, Gwalior, Hebbal, Hoodi , Indore, Jabalpur, Jaipur, Jalandhar, Jammu, Jamshedpur, Jodhpur, Kanpur, Khammam, Kochi, Kolhapur, Kolkata, Kothrud, Ludhiana, Madurai, Mangalore, Meerut, Mohali, Moradabad, Pimpri, Pondicherry, Porur, Rajkot, Ranchi, Rohtak, Roorkee, Rourkela, Shimla, Shimoga, Siliguri, Srinagar, Thoraipakkam , Tiruchirappalli, Tirunelveli, Trichur, Trichy, Udaipur, Vijayawada, Vizag, Warangal, Chennai, Coimbatore, Delhi, Dilsukhnagar, Hyderabad, Kalyan, Nagpur, Noida, Thane, Thiruvananthapuram, Uppal, Kompally, Bangalore, Chandigarh, Chromepet, Faridabad, Guntur, Guwahati, Kharadi, Lucknow, Mumbai, Mysore, Nashik, Navi Mumbai, Patna, Pune, Raipur, Vadodara, Varanasi, Yelahanka
360DigiTMG - Data Science, Data Scientist Course Training in Bangalore
No 23, 2nd Floor, 9th Main Rd, 22nd Cross Rd, 7th Sector, HSR Layout, Bengaluru, Karnataka 560102
1800-212-654321
Didn’t receive OTP? Resend
Let's Connect! Please share your details here