How to do a diff between 2 dataframes using Pandas and DataComPy
Pandas is a popular open-source Python library used for data manipulation, analysis, and visualization. It provides data structures and functions to efficiently handle structured data, including tabular data such as spreadsheets and SQL tables.
Key Features:
- Data Structures:
- Series (1-dimensional labeled array)
- DataFrame (2-dimensional labeled data structure with columns of potentially different types)
- Data Operations:
- Filtering, sorting, and grouping data
- Handling missing data
- Merging and joining datasets
- Data Input/Output:
- Reading from/writing to various file formats (e.g., CSV, Excel, JSON)
- Interacting with databases
Use Cases:
- Data analysis and science
- Data cleaning and preprocessing
- Data visualization
- Data mining and machine learning
DataCompy is a Python library used for comparing two datasets, typically two pandas DataFrames, and identifying the differences between them. It provides a simple and efficient way to detect discrepancies, making it a valuable tool for data analysis, data migration, and data quality assurance.
Key Features:
- Comparison of two datasets: DataCompy compares two datasets and highlights the differences.
- Flexible comparison options: Users can choose to compare datasets based on specific columns or rows.
- Detailed difference reports: DataCompy generates detailed reports showing the differences between the two datasets.
- Integration with pandas: DataCompy is designed to work seamlessly with pandas DataFrames.
Use Cases:
- Data migration: Verify data integrity after migration.
- Data quality assurance: Identify discrepancies in datasets.
- Data analysis: Compare datasets to identify trends and patterns.
Code Example
Sample dataset: https://github.com/datablist/sample-csv-files
import pandas as pd
# Load the CSV file into a DataFrame
df1 = pd.read_csv("customers.csv")
# Display the first few rows of the DataFrame
df1.head()
# Load the CSV file into a DataFrame
df2 = pd.read_csv("customers2.csv")
# Display the first few rows of the DataFrame
df2.head()
pip install datacompy
import datacompy
compare = datacompy.Compare(df1, df2, join_columns=['Customer Id', 'Customer Id'])
print(compare.report())
Output
DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 df1 12 100
1 df2 12 100
Column Summary
--------------
Number of columns in common: 12
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0
Row Summary
-----------
Matched on: customer id, customer id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 97
Number of rows in df1 but not in df2: 3
Number of rows in df2 but not in df1: 3
Number of rows with some compared columns unequal: 2
Number of rows with all compared columns equal: 95
Column Comparison
-----------------
Number of columns compared with some values unequal: 1
Number of columns compared with all values equal: 11
Total number of values which compare unequal: 2
Columns with Unequal Values or Types
------------------------------------
Column df1 dtype df2 dtype # Unequal Max Diff # Null Diff
0 last name object object 2 0.0 0
Sample Rows with Unequal Values
-------------------------------
customer id customer id last name (df1) last name (df2)
0 053d585Ab6b3159 053d585Ab6b3159 Bender Jones
1 17aD8e2dB3df03D 17aD8e2dB3df03D Garza Robinson
Sample Rows Only in df1 (First 10 Columns)
------------------------------------------
index customer id first name last name company city country phone 1 phone 2 email
0 2.0 1Ef7b82A4CAAD10 Preston Lozano Vega-Gentry East Jimmychester Djibouti 5153435776 686-620-1820x944 [email protected]
1 35.0 aA9BAFfBc3710fe Faith Moon Waters, Chase and Aguilar West Marthaburgh Bahamas +1-586-217-0359x6317 +1-818-199-1403 [email protected]
2 20.0 0F60FF3DdCd7aB0 Joanna Kirk Mays-Mccormick Jamesshire French Polynesia (266)131-7001x711 (283)312-5579x11543 [email protected]
Sample Rows Only in df2 (First 10 Columns)
------------------------------------------
index customer id first name last name company city country phone 1 phone 2 email
0 103.0 5354a0E336A91A4 Barry Allen Le, Nash and Cross Judymouth Honduras (753)813-6941 783.639.1472 [email protected]
1 102.0 4354a0E336A91A3 Clark Kent Le, Nash and Cross Judymouth Honduras (753)813-6941 783.639.1472 [email protected]
2 101.0 3354a0E336A91A2 Bruce Wayne Le, Nash and Cross Judymouth Honduras (753)813-6941 783.639.1472 [email protected]
Sample Rows with Unequal Values = Rows that were modified
Sample Rows Only in df1 = Rows that were deleted
Sample Rows Only in df2 (First 10 Columns) = Rows that were added