Skip to content Skip to sidebar Skip to footer

Update The Nested Json With Another Nested Json Using Python

For example, I have one full set of nested JSON, I need to update this JSON with the latest values from another nested JSON. Can anyone help me with this? I want to implement this

Solution 1:

You can load the 2 JSON files into Spark data frames and do a left_join to get updates from the latest JSON data :

from pyspark.sql import functions as F

full_json_df = spark.read.json(full_json_path, multiLine=True)
latest_json_df = spark.read.json(latest_json_path, multiLine=True)

updated_df = full_json_df.alias("full").join(
    latest_json_df.alias("latest"),
    F.col("full.id") == F.col("latest.id"),
    "left"
).select(
    F.col("full.id"),
    *[
        F.when(F.col("latest.id").isNotNull(), F.col(f"latest.{c}")).otherwise(F.col(f"full.{c}")).alias(c)
        for c in full_json_df.columns if c != 'id'
    ]
)

updated_df.show(truncate=False)

#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+#|id  |email       |firstName|layer01                                                                                              |surname |#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+#|6304|test@xxx.com|name01   |[value1, value2, value3, value4, [value1_changedData, value2], [[inner value01,], [, inner_value02]]]|Optional|#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+

Update:

If the schema changes between full and latest JSONs, you can load the 2 files into the same data frame (this way the schemas are being merged) and then deduplicate per id:

from pyspark.sql import Windowfrom pyspark.sql import functions as F

merged_json_df = spark.read.json("/path/to/{full_json.json,latest_json.json}", multiLine=True)

# order priority: latest file thenfull
w = Window.partitionBy(F.col("id")).orderBy(F.when(F.input_file_name().like('%latest%'), 0).otherwise(1))

updated_df = merged_json_df.withColumn("rn", F.row_number().over(w))\
    .filter("rn = 1")\
    .drop("rn")

updated_df.show(truncate=False)

Post a Comment for "Update The Nested Json With Another Nested Json Using Python"