Generate comprehensive profile reports for pandas DataFrame with automated exploratory data analysis
Command-line interface for generating profiling reports directly from CSV files without writing Python code. The console interface provides a convenient way to generate reports in automated workflows, CI/CD pipelines, and for users who prefer command-line tools.
The ydata_profiling command-line tool allows direct profiling of CSV files with customizable options.
ydata_profiling [OPTIONS] INPUT_FILE OUTPUT_FILEParameters:
INPUT_FILE: Path to the CSV file to profileOUTPUT_FILE: Path where to save the HTML report--title TEXT: Report title (default: "YData Profiling Report")--config_file PATH: Path to YAML configuration file--minimal: Use minimal configuration for faster processing--explorative: Use explorative configuration for comprehensive analysis--sensitive: Use sensitive configuration for privacy-aware profiling--pool_size INTEGER: Number of worker processes (default: 0, auto-detect CPU count)--progress_bar / --no_progress_bar: Show/hide progress bar (default: show)--help: Show help message and exitUsage Examples:
# Basic usage
ydata_profiling data.csv report.html
# With custom title
ydata_profiling --title "Sales Data Analysis" sales.csv sales_report.html
# Using minimal mode for faster processing
ydata_profiling --minimal large_dataset.csv quick_report.html
# Using explorative mode for comprehensive analysis
ydata_profiling --explorative --title "Detailed Analysis" data.csv detailed_report.html
# With custom configuration file
ydata_profiling --config_file custom_config.yaml data.csv custom_report.html
# For sensitive data with privacy controls
ydata_profiling --sensitive customer_data.csv privacy_report.html
# With custom worker processes
ydata_profiling --pool_size 8 --title "Multi-threaded Analysis" data.csv report.htmlFor backward compatibility, the deprecated pandas_profiling command is also available:
pandas_profiling [OPTIONS] INPUT_FILE OUTPUT_FILENote: The pandas_profiling command has identical functionality to ydata_profiling but is deprecated. Use ydata_profiling for new projects.
Use YAML configuration files with the console interface for complex customization:
Example config.yaml:
title: "Production Data Report"
pool_size: 4
progress_bar: true
dataset:
description: "Customer transaction dataset"
creator: "Data Engineering Team"
vars:
num:
low_categorical_threshold: 10
cat:
cardinality_threshold: 50
redact: false
correlations:
pearson:
calculate: true
spearman:
calculate: true
plot:
dpi: 300
image_format: "png"
html:
minify_html: true
full_width: falseUsage:
ydata_profiling --config_file config.yaml data.csv production_report.htmlIntegrate the console interface into shell scripts and automation workflows:
Batch Processing Script:
#!/bin/bash
# Process multiple CSV files
for file in data/*.csv; do
base_name=$(basename "$file" .csv)
echo "Processing $file..."
ydata_profiling \
--title "Analysis of $base_name" \
--explorative \
--pool_size 4 \
"$file" \
"reports/${base_name}_report.html"
echo "Report saved: reports/${base_name}_report.html"
done
echo "All files processed!"CI/CD Pipeline Integration:
# In your CI/CD pipeline
ydata_profiling \
--title "Data Quality Check - Build $BUILD_NUMBER" \
--config_file .ydata_profiling_config.yaml \
data/input.csv \
artifacts/data_quality_report.html
# Check if report was generated successfully
if [ -f "artifacts/data_quality_report.html" ]; then
echo "Data quality report generated successfully"
else
echo "Failed to generate data quality report"
exit 1
fiThe console interface provides meaningful exit codes for automation:
Example Error Handling:
#!/bin/bash
ydata_profiling data.csv report.html
exit_code=$?
case $exit_code in
0)
echo "Success: Report generated"
;;
1)
echo "Error: General processing failure"
exit 1
;;
2)
echo "Error: Cannot read input file"
exit 1
;;
3)
echo "Error: Cannot write output file"
exit 1
;;
4)
echo "Error: Invalid configuration"
exit 1
;;
*)
echo "Error: Unknown error (code: $exit_code)"
exit 1
;;
esacFor optimal performance with the console interface:
Large Files:
# Use minimal mode for files > 1GB
ydata_profiling --minimal --pool_size 8 large_file.csv quick_report.html
# Custom configuration for memory optimization
echo "
pool_size: 8
infer_dtypes: false
correlations:
auto: false
missing_diagrams:
matrix: false
dendrogram: false
" > minimal_config.yaml
ydata_profiling --config_file minimal_config.yaml large_file.csv report.htmlMultiple Files:
# Process files in parallel using background processes
for file in data/*.csv; do
ydata_profiling --minimal "$file" "reports/$(basename "$file" .csv).html" &
done
wait # Wait for all background processes to completeCommon integration patterns with data processing tools:
Apache Airflow DAG:
from airflow import DAG
from airflow.operators.bash import BashOperator
profiling_task = BashOperator(
task_id='generate_data_profile',
bash_command='''
ydata_profiling \
--title "Daily Data Quality Report - {{ ds }}" \
--config_file /opt/airflow/configs/profiling_config.yaml \
/data/daily_data_{{ ds }}.csv \
/reports/daily_profile_{{ ds }}.html
''',
dag=dag
)Make Integration:
# Makefile for data profiling
.PHONY: profile-data
profile-data: data/processed.csv
ydata_profiling \
--title "Data Processing Report" \
--explorative \
data/processed.csv \
reports/data_profile.html
reports/data_profile.html: data/processed.csv
@mkdir -p reports
ydata_profiling \
--config_file config/profiling.yaml \
$< $@Install with Tessl CLI
npx tessl i tessl/pypi-ydata-profiling