tessl/pypi-ydata-profiling

Generate comprehensive profile reports for pandas DataFrame with automated exploratory data analysis

Overview

Eval results

Files

Console Interface

Name: tessl/pypi-ydata-profiling
Author: tessl

Command-line interface for generating profiling reports directly from CSV files without writing Python code. The console interface provides a convenient way to generate reports in automated workflows, CI/CD pipelines, and for users who prefer command-line tools.

Capabilities

Command-Line Executable

The ydata_profiling command-line tool allows direct profiling of CSV files with customizable options.

ydata_profiling [OPTIONS] INPUT_FILE OUTPUT_FILE

Parameters:

INPUT_FILE: Path to the CSV file to profile
OUTPUT_FILE: Path where to save the HTML report
--title TEXT: Report title (default: "YData Profiling Report")
--config_file PATH: Path to YAML configuration file
--minimal: Use minimal configuration for faster processing
--explorative: Use explorative configuration for comprehensive analysis
--sensitive: Use sensitive configuration for privacy-aware profiling
--pool_size INTEGER: Number of worker processes (default: 0, auto-detect CPU count)
--progress_bar / --no_progress_bar: Show/hide progress bar (default: show)
--help: Show help message and exit

Usage Examples:

# Basic usage
ydata_profiling data.csv report.html

# With custom title
ydata_profiling --title "Sales Data Analysis" sales.csv sales_report.html

# Using minimal mode for faster processing
ydata_profiling --minimal large_dataset.csv quick_report.html

# Using explorative mode for comprehensive analysis
ydata_profiling --explorative --title "Detailed Analysis" data.csv detailed_report.html

# With custom configuration file
ydata_profiling --config_file custom_config.yaml data.csv custom_report.html

# For sensitive data with privacy controls
ydata_profiling --sensitive customer_data.csv privacy_report.html

# With custom worker processes
ydata_profiling --pool_size 8 --title "Multi-threaded Analysis" data.csv report.html

Legacy Command Support

For backward compatibility, the deprecated pandas_profiling command is also available:

pandas_profiling [OPTIONS] INPUT_FILE OUTPUT_FILE

Note: The pandas_profiling command has identical functionality to ydata_profiling but is deprecated. Use ydata_profiling for new projects.

Configuration File Support

Use YAML configuration files with the console interface for complex customization:

Example config.yaml:

title: "Production Data Report"
pool_size: 4
progress_bar: true

dataset:
  description: "Customer transaction dataset"
  creator: "Data Engineering Team"

vars:
  num:
    low_categorical_threshold: 10
  cat:
    cardinality_threshold: 50
    redact: false

correlations:
  pearson:
    calculate: true
  spearman:
    calculate: true

plot:
  dpi: 300
  image_format: "png"

html:
  minify_html: true
  full_width: false

Usage:

ydata_profiling --config_file config.yaml data.csv production_report.html

Integration with Shell Scripts

Integrate the console interface into shell scripts and automation workflows:

Batch Processing Script:

#!/bin/bash

# Process multiple CSV files
for file in data/*.csv; do
    base_name=$(basename "$file" .csv)
    echo "Processing $file..."
    
    ydata_profiling \
        --title "Analysis of $base_name" \
        --explorative \
        --pool_size 4 \
        "$file" \
        "reports/${base_name}_report.html"
        
    echo "Report saved: reports/${base_name}_report.html"
done

echo "All files processed!"

CI/CD Pipeline Integration:

# In your CI/CD pipeline
ydata_profiling \
    --title "Data Quality Check - Build $BUILD_NUMBER" \
    --config_file .ydata_profiling_config.yaml \
    data/input.csv \
    artifacts/data_quality_report.html

# Check if report was generated successfully
if [ -f "artifacts/data_quality_report.html" ]; then
    echo "Data quality report generated successfully"
else
    echo "Failed to generate data quality report"
    exit 1
fi

Error Handling and Exit Codes

The console interface provides meaningful exit codes for automation:

0: Success - Report generated successfully
1: General error - Invalid arguments or processing failure
2: Input file error - File not found or not readable
3: Output file error - Cannot write to output location
4: Configuration error - Invalid configuration file or settings

Example Error Handling:

#!/bin/bash

ydata_profiling data.csv report.html
exit_code=$?

case $exit_code in
    0)
        echo "Success: Report generated"
        ;;
    1)
        echo "Error: General processing failure"
        exit 1
        ;;
    2)
        echo "Error: Cannot read input file"
        exit 1
        ;;
    3)
        echo "Error: Cannot write output file"
        exit 1
        ;;
    4)
        echo "Error: Invalid configuration"
        exit 1
        ;;
    *)
        echo "Error: Unknown error (code: $exit_code)"
        exit 1
        ;;
esac

Performance Considerations

For optimal performance with the console interface:

Large Files:

# Use minimal mode for files > 1GB
ydata_profiling --minimal --pool_size 8 large_file.csv quick_report.html

# Custom configuration for memory optimization
echo "
pool_size: 8
infer_dtypes: false
correlations:
  auto: false
missing_diagrams:
  matrix: false
  dendrogram: false
" > minimal_config.yaml

ydata_profiling --config_file minimal_config.yaml large_file.csv report.html

Multiple Files:

# Process files in parallel using background processes
for file in data/*.csv; do
    ydata_profiling --minimal "$file" "reports/$(basename "$file" .csv).html" &
done
wait  # Wait for all background processes to complete

Integration with Data Pipelines

Common integration patterns with data processing tools:

Apache Airflow DAG:

from airflow import DAG
from airflow.operators.bash import BashOperator

profiling_task = BashOperator(
    task_id='generate_data_profile',
    bash_command='''
    ydata_profiling \
        --title "Daily Data Quality Report - {{ ds }}" \
        --config_file /opt/airflow/configs/profiling_config.yaml \
        /data/daily_data_{{ ds }}.csv \
        /reports/daily_profile_{{ ds }}.html
    ''',
    dag=dag
)

Make Integration:

# Makefile for data profiling
.PHONY: profile-data

profile-data: data/processed.csv
	ydata_profiling \
		--title "Data Processing Report" \
		--explorative \
		data/processed.csv \
		reports/data_profile.html

reports/data_profile.html: data/processed.csv
	@mkdir -p reports
	ydata_profiling \
		--config_file config/profiling.yaml \
		$< $@

Install with Tessl CLI