How To Serialize Pyspark Groupeddata Object?

March 09, 2024 Post a Comment

I am running a groupBy() on a dataset having several millions of records and want to save the resulting output (a PySpark GroupedData object) so that I can de-serialize it later an

Solution 1:

There is none because GroupedData is not really a thing. It doesn't perform any operations on data at all. It only describes how actual aggregation should proceed when you execute an action on the results of a subsequent agg.

You could probably serialize underlaying JVM object and restore it later but it is a waste of time. Since groupBy only describes what has to be done the cost of recreating GroupedData object from scratch should be negligible.

Baca Juga

Pyspark Sql Compare Records On Each Day And Report The Differences
Replace Column Values In Spark Dataframe Based On Dictionary Similar To Np.where
Python Requests Login - The Login Page Returned Without An Error

Python Playground

How To Serialize Pyspark Groupeddata Object?

Solution 1:

Post a Comment for "How To Serialize Pyspark Groupeddata Object?"