Python Pandas | SQL with Manoj

Python – Delete/remove unwanted rows from a DataFrame

April 15, 2021 Leave a comment

As you start using Python you will fall in love with it, as its very easy to solve problems by writing complex logic in very simple, short and quick way. Here we will see how to remove rows from a DataFrame based on an invalid List of items.

Let’s create a sample Pandas DataFrame for our demo purpose:

import pandas as pd

sampleData = {
  'CustId': list(range(101, 111)),
  'CustomerName': ['Cust'+str(x) for x in range(101, 111)]}

cdf = pd.DataFrame(sampleData)

invalidList = [102, 103, 104]

The above logic in line # 4 & 5 creates 10 records with CustID ranging from 101 to 110 and respective CustomerNames like Cust101, Cust102, etc.

In below code we will use isin() function to get us only the records present in Invalid list. So this will fetch us only 3 invalid records from the DataFrame:

df = cdf[cdf.CustId.isin(invalidList)]

df

And to get the records not present in InvalidList we just need to use the “~” sign to do reverse of what we did in above step using the isin() function. So this will fetch us other 7 valid records from the DataFrame:

df = cdf[~cdf.CustId.isin(invalidList)]

df

Categories: Python Tags: PySpark, Python, Python Pandas

Python error: while converting Pandas Dataframe or Python List to Spark Dataframe (Can not merge type)

April 8, 2021 Leave a comment

Data typecasting errors are common when you are working with different DataFrames across different languages, like here in this case I got datatype mixing error between Pandas & Spark dataframe:

import pandas as pd
pd_df = pd.DataFrame([(101, 'abc'), 
                      ('def', 201), 
                      ('xyz', 'pqr')], 
                     columns=['col1', 'col2'])

df = spark.createDataFrame(pd_df)
display(df)

TypeError:
field col1: Can not merge type <class 'pyspark.sql.types.longtype'> and 
<class 'pyspark.sql.types.stringtype'>

While converting the Pandas DataFrame to Spark DataFrame its throwing error as Spark is not able to infer correct data type for the columns due to mix type of data in columns.

In this case you just need to explicitly tell Spark to use a correct datatype by creating a new schema and using it in createDataFrame() definition shown below:

import pandas as pd
pd_df = pd.DataFrame([(101, 'abc'), 
                      ('def', 201), 
                      ('xyz', 'pqr')], 
                     columns=['col1', 'col2'])

from pyspark.sql.types import *
df_schema = StructType([StructField("col1", StringType(), True)\
                       ,StructField("col2", StringType(), True)])

df = spark.createDataFrame(pd_df, schema=df_schema)
display(df)