Sometimes, the data frame which we get by reading/parsing JSON, cannot be used as-is for our processing or analysis.
Explode function to the rescue.
When our df.printSchema( ) , returns as an array of structs, then using explode function is little tricky compared to using array of elements
Sample script which worked for me to solve the explode for array of structs:
"""python
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder.appName('test-explode').getOrCreate()
sqlContext = SQLContext(spark)
df = sqlContext.read.json("<json file name>")
exploded_df = df.select("id", explode("names")).select("id", "col.first_name", "col.middle_name", "col.last_name")
exploded_df.show()
"""
To filter out based on a condition:
male_names_list = exploded_df.filter(exploded_df.GENDER=='M').select("names").collect()
female_names_list = exploded_df.filter(exploded_df.GENDER == 'F').select("names").collect()
# to get names that are common in both males and females:
compare_names = return_matches(male_names_list, female_names_list)
# compare the matches in both the lists:
Explode function to the rescue.
When our df.printSchema( ) , returns as an array of structs, then using explode function is little tricky compared to using array of elements
Sample script which worked for me to solve the explode for array of structs:
"""python
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder.appName('test-explode').getOrCreate()
sqlContext = SQLContext(spark)
df = sqlContext.read.json("<json file name>")
exploded_df = df.select("id", explode("names")).select("id", "col.first_name", "col.middle_name", "col.last_name")
exploded_df.show()
"""
To filter out based on a condition:
male_names_list = exploded_df.filter(exploded_df.GENDER=='M').select("names").collect()
female_names_list = exploded_df.filter(exploded_df.GENDER == 'F').select("names").collect()
# to get names that are common in both males and females:
compare_names = return_matches(male_names_list, female_names_list)
# compare the matches in both the lists:
def return_matches(list1, list2): return list(set(list1) & set(list2))
No comments:
Post a Comment