Generate comprehensive profile reports for pandas DataFrame with automated exploratory data analysis
Compare multiple data profiling reports to identify differences, changes over time, or variations between datasets. This functionality enables tracking data drift, validating data transformations, and understanding how datasets evolve.
Primary function for comparing multiple ProfileReport objects or BaseDescription objects to generate comparative analysis.
def compare(
reports: Union[List[ProfileReport], List[BaseDescription]],
config: Optional[Settings] = None,
compute: bool = False
) -> ProfileReport:
"""
Compare multiple profiling reports and generate comparison analysis.
Parameters:
- reports: list of ProfileReport or BaseDescription objects to compare
- config: optional Settings object for comparison configuration
- compute: whether to immediately compute comparison results
Returns:
ProfileReport containing comparison results and visualizations
Raises:
ValueError: if reports list is empty or contains incompatible types
"""Usage Example:
from ydata_profiling import ProfileReport, compare
import pandas as pd
# Create datasets
df1 = pd.read_csv('dataset_v1.csv')
df2 = pd.read_csv('dataset_v2.csv')
# Generate individual reports
report1 = ProfileReport(df1, title="Dataset Version 1")
report2 = ProfileReport(df2, title="Dataset Version 2")
# Compare reports
comparison_report = compare([report1, report2])
# Export comparison
comparison_report.to_file("comparison_analysis.html")
# Compare with custom configuration
from ydata_profiling.config import Settings
config = Settings()
config.title = "Dataset Evolution Analysis"
detailed_comparison = compare([report1, report2], config=config)Compare one ProfileReport directly with another using the instance method.
def compare(self, other: 'ProfileReport', config: Optional[Settings] = None) -> 'ProfileReport':
"""
Compare this report with another ProfileReport instance.
Parameters:
- other: another ProfileReport to compare against
- config: optional configuration for comparison analysis
Returns:
New ProfileReport containing comparison results
"""Usage Example:
# Create two reports
baseline_report = ProfileReport(baseline_df, title="Baseline Dataset")
current_report = ProfileReport(current_df, title="Current Dataset")
# Compare using instance method
drift_analysis = baseline_report.compare(current_report)
# Generate drift report
drift_analysis.to_file("data_drift_report.html")
# Access comparison results
comparison_description = drift_analysis.get_description()Configuration options for customizing comparison analysis behavior and output.
class Settings:
"""
Configuration class with comparison-specific settings.
"""
# Comparison-specific configuration attributes
title: str = "Report Comparison"
pool_size: int = 0
correlations: dict = {}
missing_diagrams: dict = {}Usage Example:
from ydata_profiling.config import Settings
# Create custom comparison configuration
comparison_config = Settings()
comparison_config.title = "Monthly Data Quality Comparison"
comparison_config.pool_size = 4
# Apply configuration to comparison
reports = [jan_report, feb_report, mar_report]
quarterly_comparison = compare(
reports,
config=comparison_config,
compute=True
)
quarterly_comparison.to_file("quarterly_data_quality.html")Compare multiple reports simultaneously to identify trends and patterns across multiple datasets or time periods.
Usage Example:
# Time series of datasets
monthly_reports = []
for month in ['jan', 'feb', 'mar', 'apr', 'may']:
df = pd.read_csv(f'data_{month}.csv')
report = ProfileReport(df, title=f"Data - {month.title()}")
monthly_reports.append(report)
# Compare all monthly reports
trend_analysis = compare(monthly_reports)
trend_analysis.to_file("monthly_trends.html")
# Access trend data
trends = trend_analysis.get_description()Different comparison scenarios and their specific use cases.
Data Drift Detection:
# Monitor data drift over time
production_baseline = ProfileReport(baseline_df, title="Production Baseline")
current_production = ProfileReport(current_df, title="Current Production")
drift_report = production_baseline.compare(current_production)
drift_report.to_file("drift_monitoring.html")A/B Testing Analysis:
# Compare control vs treatment datasets
control_report = ProfileReport(control_df, title="Control Group")
treatment_report = ProfileReport(treatment_df, title="Treatment Group")
ab_comparison = compare([control_report, treatment_report])
ab_comparison.to_file("ab_test_analysis.html")Data Pipeline Validation:
# Compare before and after data transformations
raw_data_report = ProfileReport(raw_df, title="Raw Data")
processed_data_report = ProfileReport(processed_df, title="Processed Data")
pipeline_validation = raw_data_report.compare(processed_data_report)
pipeline_validation.to_file("pipeline_validation.html")Key features available in comparison reports:
Install with Tessl CLI
npx tessl i tessl/pypi-ydata-profiling