How To Serialize Pyspark Groupeddata Object?
I am running a groupBy() on a dataset having several millions of records and want to save the resulting output (a PySpark GroupedData object) so that I can de-serialize it later an
Solution 1:
There is none because GroupedData
is not really a thing. It doesn't perform any operations on data at all. It only describes how actual aggregation should proceed when you execute an action on the results of a subsequent agg
.
You could probably serialize underlaying JVM object and restore it later but it is a waste of time. Since groupBy
only describes what has to be done the cost of recreating GroupedData
object from scratch should be negligible.
Post a Comment for "How To Serialize Pyspark Groupeddata Object?"