Data Cleaning Example
Shows how to clean messy raw data before visualization.
Problem
Raw CSV data has:
- Missing values
- Inconsistent column names
- Unnecessary columns
- Outliers or invalid values
Solution Pipeline
yaml
transformations:
# Step 1: Remove records with missing required fields
- type: filter
configuration:
where:
email: { ne: null }
name: { ne: null }
# Step 2: Keep only needed columns
- type: select
configuration:
columns: ["id", "firstName", "lastName", "emailAddress", "salary"]
# Step 3: Standardize column names
- type: rename
configuration:
mapping:
firstName: "first_name"
lastName: "last_name"
emailAddress: "email"
# Step 4: Filter out invalid salaries
- type: filter
configuration:
where:
salary: { gt: 0 }
# Step 5: Add computed display name
- type: derive
configuration:
columns:
display_name: "`${first_name} ${last_name}`"Step-by-Step Transformation
Input Data (8 records with issues)
| id | firstName | lastName | emailAddress | internalID | salary |
|----|------------|----------|------------------------|------------|--------|
| 1 | Alice | Smith | alice@example.com | INT-001 | 80000 |
| 2 | Bob | Jones | null | INT-002 | 75000 |
| 3 | Charlie | null | charlie@example.com | INT-003 | 0 |
| 4 | Diana | Prince | diana@example.com | INT-004 | 90000 |
| 5 | null | Brown | ebrown@example.com | INT-005 | 70000 |
| 6 | Frank | Garcia | fgarcia@example.com | INT-006 | -5000 |
| 7 | Grace | Wilson | grace@example.com | INT-007 | 85000 |
| 8 | Henry | Moore | henry@example.com | INT-008 | 95000 |Issues: Missing emails, names, and salaries; extra column; invalid salary values
After Step 1: Filter Missing Values
Removes rows missing email or name:
- Removes: Row 2 (no email), Row 3 (no name), Row 5 (no name)
- Keeps: Rows 1, 4, 6, 7, 8 (5 records)
After Step 2: Select Columns
Keeps only: id, firstName, lastName, emailAddress, salary
- Removes: internalID column
- Result: 5 records with 5 columns
After Step 3: Rename Columns
Standardizes names for consistency:
- firstName → first_name
- lastName → last_name
- emailAddress → email
After Step 4: Filter Invalid Salaries
Removes salary ≤ 0:
- Removes: Row 6 (salary = -5000)
- Keeps: Rows 1, 4, 7, 8 (4 records)
After Step 5: Derive Display Name
Adds computed column:
| id | first_name | last_name | email | salary | display_name |
|----|------------|-----------|---------------------|--------|-----------------|
| 1 | Alice | Smith | alice@example.com | 80000 | Alice Smith |
| 4 | Diana | Prince | diana@example.com | 90000 | Diana Prince |
| 7 | Grace | Wilson | grace@example.com | 85000 | Grace Wilson |
| 8 | Henry | Moore | henry@example.com | 95000 | Henry Moore |Final Result: 4 clean, validated records ready for visualization
Data Quality Summary
| Metric | Before | After | % Change |
|---|---|---|---|
| Total Records | 8 | 4 | -50% |
| Complete Records | 3 | 4 | +33% |
| Missing Values | 3 | 0 | -100% |
| Invalid Values | 1 | 0 | -100% |
| Extra Columns | 1 | 0 | -100% |
Best Practices
- Filter Early: Remove bad records first
- Keep Essentials: Remove unnecessary columns early
- Standardize Names: Consistent naming for readability
- Validate Values: Check ranges and constraints
- Add Computed Fields: Last step after validation
Common Cleaning Scenarios
Scenario 1: Remove Outliers
yaml
- type: filter
configuration:
where:
# Keep salaries between $30k-$300k
salary: { between: [30000, 300000] }Scenario 2: Handle Empty Strings
yaml
- type: filter
configuration:
where:
# Remove empty names (assuming empty string converts to falsy)
email: { ne: "" }Scenario 3: Normalize Text
yaml
- type: derive
configuration:
columns:
# Uppercase all categories
category_normalized: "category.toUpperCase()"Scenario 4: Date Validation
yaml
- type: filter
configuration:
where:
# Keep dates after Jan 1, 2024
created_date: { gt: "2024-01-01" }Error Handling
| Situation | Behavior |
|---|---|
| Filter matches nothing | Returns empty dataset (0 rows) |
| Column doesn't exist | Silently skipped in select |
| Invalid derive expression | Results in null value, doesn't break |
| Rename non-existent column | Silently skipped |
Performance Tips
- Filter to small dataset first - Reduces rows for downstream operations
- Select early - Fewer columns = faster processing
- Avoid complex derives on large datasets - Do filtering first
- Order matters - Filter → Select → Rename → Derive
Before/After Comparison
Before Cleaning:
- 8 total records
- Multiple missing values
- Inconsistent naming
- Invalid data included
- Extra columns cluttering datasetAfter Cleaning:
- 4 validated records
- No missing values
- Consistent naming (snake_case)
- All values valid
- Only needed columns
- Ready for visualization