Thinning your data out with pandas

When compiling large datasets, you may end up with values in some columns that make up only a small contribution of the total. These values may be important and should not just be discarded without careful consideration. Equally, if they van be discarded, the below helper function can assist - given a dataframe, a specified column (of categorical data) and a threshold cutoff, any values within the column that make up less than the threshold contribution to the total will result in that row of the dataframe being dropped.

This is especially useful when using the input dataframe for plotting for further analysis.

def thin_data(data, variable_to_consider, percentage_cutoff=10, verbose=True):
	"""
	Takes in a dataframe and removes rows where the contribution of categorical variables in a specified column make up 
	less than a defined percentage of the total 

	data=pandas dataframe
	variable_to_consider=name of column
	percentage_cutoff=where entries in the variable_to_consider column make up less than this percentage, they will be dropped

	RETURNS dataframe 
	"""

	if verbose:
		print("*****")
		print("Trimming dataframe")
		print("Column: %s " %variable_to_consider)
		print("Dropping rows where %s variables make up less than %i %% of the total column composition." %(variable_to_consider,percentage_cutoff))
		print(" ")
		print("%s : %% of total" %variable_to_consider)
		print("----------------")
		print((data[variable_to_consider].value_counts()/data[variable_to_consider].count())*100)
	
	class_percents=(data[variable_to_consider].value_counts()/data[variable_to_consider].count())*100
	class_percents_lessThan_CUTOFF=class_percents[class_percents<percentage_cutoff]
	class_percents_greaterThan_CUTOFF=class_percents.drop(class_percents_lessThan_CUTOFF.index)

	# drop rows where CLASS is > specified percentage 
	#help: https://stackoverflow.com/questions/12065885/filter-dataframe-rows-if-value-in-column-is-in-a-set-list-of-values
	
	mask_gtCUTOFF=data[variable_to_consider].isin(list(class_percents_greaterThan_CUTOFF.index))
	data_trimmed=data[mask_gtCUTOFF]

	return(data_trimmed)

For a full example, check out this jupyter notebook on Github - feel free to adapt it as required.

Written on January 9, 2020