Tuesday, January 21, 2020

Explode function using PySpark

Sometimes, the data frame which we get by reading/parsing JSON, cannot be used as-is for our processing or analysis.

Explode function to the rescue.

When our df.printSchema( ) , returns as an array of structs, then using explode function is little tricky compared to using array of elements

Sample script which worked for me to solve the explode for array of structs:


"""python

from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.functions import explode

spark = SparkSession.builder.appName('test-explode').getOrCreate()

sqlContext = SQLContext(spark)
df = sqlContext.read.json("<json file name>")


exploded_df = df.select("id", explode("names")).select("id", "col.first_name", "col.middle_name", "col.last_name")

exploded_df.show()

"""


To filter out based on a condition:

male_names_list = exploded_df.filter(exploded_df.GENDER=='M').select("names").collect()
female_names_list = exploded_df.filter(exploded_df.GENDER == 'F').select("names").collect()


# to get names that are common in both males and females:
compare_names = return_matches(male_names_list, female_names_list)

# compare the matches in both the lists:

def return_matches(list1, list2):
    return list(set(list1) & set(list2))