mengubah dataframe pyspark menjadi struktur json bersarang

Jan 08 2021

Saya mencoba untuk mengubah dataframe di bawah ini menjadi json bersarang (string)

memasukkan:

+---+---+-------+------+
| id|age| name  |number|
+---+---+-------+------+
|  1| 12|  smith|  uber|
|  2| 13|    jon| lunch|
|  3| 15|jocelyn|rental|
|  3| 15|  megan|   sds|
+---+---+-------+------+

keluaran:-

+---+---+--------------------------------------------------------------------+
|id |age|values                                                              
|
+---+---+--------------------------------------------------------------------+
|1  |12 |[{"number": "uber", "name": "smith"}]                                   
|
|2  |13 |[{"number": "lunch", "name": "jon"}]                                     
|
|3  |15 |[{"number": "rental", "name": "megan"}, {"number": "sds", "name": "jocelyn"}]|
+---+---+--------------------------------------------------------------------+

kode saya

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType
# List
data = [(1,12,"smith", "uber"),
        (2,13,"jon","lunch"),(3,15,"jocelyn","rental")
        ,(3,15,"megan","sds")
        ]

# Create a schema for the dataframe
schema = StructType([
  StructField('id', IntegerType(), True),
  StructField('age', IntegerType(), True),
  StructField('number', StringType(), True),
    StructField('name', StringType(), True)])

# Convert list to RDD
rdd = spark.sparkContext.parallelize(data)

# Create data frame
df = spark.createDataFrame(rdd,schema)

Saya mencoba menggunakan collect_list dan collect_set, ternyata tidak mendapatkan ouput yang diinginkan.

Jawaban

2 mck Jan 08 2021 at 14:24

Anda dapat menggunakan collect_listdan to_jsonmengumpulkan array jsons untuk setiap grup:

import pyspark.sql.functions as F

df2 = df.groupBy(
    'id', 'age'
).agg(
    F.collect_list(
        F.to_json(
            F.struct('number', 'name')
        )
    ).alias('values')
).orderBy(
    'id', 'age'
)

df2.show(truncate=False)
+---+---+-----------------------------------------------------------------------+
|id |age|values                                                                 |
+---+---+-----------------------------------------------------------------------+
|1  |12 |[{"number":"smith","name":"uber"}]                                     |
|2  |13 |[{"number":"jon","name":"lunch"}]                                      |
|3  |15 |[{"number":"jocelyn","name":"rental"}, {"number":"megan","name":"sds"}]|
+---+---+-----------------------------------------------------------------------+