Data profiling in Pandas using Python - GeeksforGeeks
Data profiling is a process in which you analyze and summarize a dataset to gain a better understanding of its structure, quality, and patterns. It helps in identifying potential issues, inconsistencies, and outliers in the data before performing any in-depth analysis or modeling. Pandas, a popular data manipulation library in Python, provides various functions to facilitate data profiling tasks. Let's go through the steps to perform data profiling using Pandas:
pd.read_csv()
or other relevant functions based on your data format.These are some of the key steps in data profiling using Pandas. Depending on your dataset's complexity and specific requirements, you may need to perform additional analysis and data preprocessing steps. Data profiling is an iterative process, and you can combine various Pandas functions with visualization libraries like Matplotlib and Seaborn to gain deeper insights into your data.
Full Code:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Step 2: Load the dataset
# Replace 'data.csv' with the actual filename and provide the correct path if needed
df = pd.read_csv('data.csv')
# Step 3: Get basic information about the dataset
print("First few rows of the DataFrame:")
print(df.head())
print("\\\\nDimensions of the DataFrame (rows, columns):")
print(df.shape)
print("\\\\nColumn names:")
print(df.columns)
print("\\\\nData type information of each column:")
print(df.dtypes)
print("\\\\nSummary of the DataFrame:")
print(df.describe())
print("\\\\nConcise summary of the DataFrame:")
print(df.info())
# Step 4: Handling missing values
print("\\\\nMissing values in each column:")
print(df.isnull().sum())
# Visualize missing values using a heatmap
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()
# Step 5: Data Distribution
print("\\\\nData Distribution (Histograms for numeric columns):")
df.hist(figsize=(10, 8))
plt.show()
# Step 6: Unique values and value counts
print("\\\\nUnique values of a specific column:")
print(df['column_name'].unique())
print("\\\\nValue counts of a specific column:")
print(df['column_name'].value_counts())
# Step 7: Correlation Analysis
correlation_matrix = df.corr()
print("\\\\nCorrelation Matrix:")
print(correlation_matrix)
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()
# Step 8: Outliers detection
for column in df.select_dtypes(include=np.number).columns:
sns.boxplot(x=df[column])
plt.show()
Replace 'data.csv'
with the filename and path of your dataset. Make sure you have installed the required libraries (Pandas, NumPy, Seaborn, and Matplotlib) before running this code. This code will provide you with an overview of the dataset, handle missing values, visualize data distributions, explore unique values and value counts, analyze correlations, and detect outliers. Remember that data profiling is an iterative process, and you can customize and expand this code based on your specific dataset and analysis needs.
Data profiling is a process of examining and analyzing datasets to identify various properties and attributes of the data, such as data types, null values, unique values, and outliers. It is a crucial step in data analysis and helps in improving the data quality and accuracy.