Skip to content Skip to sidebar Skip to footer

How To Serialize Pyspark Groupeddata Object?

I am running a groupBy() on a dataset having several millions of records and want to save the resulting output (a PySpark GroupedData object) so that I can de-serialize it later an

Solution 1:

There is none because GroupedData is not really a thing. It doesn't perform any operations on data at all. It only describes how actual aggregation should proceed when you execute an action on the results of a subsequent agg.

You could probably serialize underlaying JVM object and restore it later but it is a waste of time. Since groupBy only describes what has to be done the cost of recreating GroupedData object from scratch should be negligible.

Post a Comment for "How To Serialize Pyspark Groupeddata Object?"