Data Cleaning Example

Shows how to clean messy raw data before visualization.

Problem

Raw CSV data has:

Missing values
Inconsistent column names
Unnecessary columns
Outliers or invalid values

Solution Pipeline

yaml

transformations:
  # Step 1: Remove records with missing required fields
  - type: filter
    configuration:
      where:
        email: { ne: null }
        name: { ne: null }

  # Step 2: Keep only needed columns
  - type: select
    configuration:
      columns: ["id", "firstName", "lastName", "emailAddress", "salary"]

  # Step 3: Standardize column names
  - type: rename
    configuration:
      mapping:
        firstName: "first_name"
        lastName: "last_name"
        emailAddress: "email"

  # Step 4: Filter out invalid salaries
  - type: filter
    configuration:
      where:
        salary: { gt: 0 }

  # Step 5: Add computed display name
  - type: derive
    configuration:
      columns:
        display_name: "`${first_name} ${last_name}`"

Step-by-Step Transformation

Input Data (8 records with issues)

| id | firstName  | lastName | emailAddress           | internalID | salary |
|----|------------|----------|------------------------|------------|--------|
| 1  | Alice      | Smith    | alice@example.com      | INT-001    | 80000  |
| 2  | Bob        | Jones    | null                   | INT-002    | 75000  |
| 3  | Charlie    | null     | charlie@example.com    | INT-003    | 0      |
| 4  | Diana      | Prince   | diana@example.com      | INT-004    | 90000  |
| 5  | null       | Brown    | ebrown@example.com     | INT-005    | 70000  |
| 6  | Frank      | Garcia   | fgarcia@example.com    | INT-006    | -5000  |
| 7  | Grace      | Wilson   | grace@example.com      | INT-007    | 85000  |
| 8  | Henry      | Moore    | henry@example.com      | INT-008    | 95000  |

Issues: Missing emails, names, and salaries; extra column; invalid salary values

After Step 1: Filter Missing Values

Removes rows missing email or name:

Removes: Row 2 (no email), Row 3 (no name), Row 5 (no name)
Keeps: Rows 1, 4, 6, 7, 8 (5 records)

After Step 2: Select Columns

Keeps only: id, firstName, lastName, emailAddress, salary

Removes: internalID column
Result: 5 records with 5 columns

After Step 3: Rename Columns

Standardizes names for consistency:

firstName → first_name
lastName → last_name
emailAddress → email

After Step 4: Filter Invalid Salaries

Removes salary ≤ 0:

Removes: Row 6 (salary = -5000)
Keeps: Rows 1, 4, 7, 8 (4 records)

After Step 5: Derive Display Name

Adds computed column:

| id | first_name | last_name | email               | salary | display_name    |
|----|------------|-----------|---------------------|--------|-----------------|
| 1  | Alice      | Smith     | alice@example.com   | 80000  | Alice Smith     |
| 4  | Diana      | Prince    | diana@example.com   | 90000  | Diana Prince    |
| 7  | Grace      | Wilson    | grace@example.com   | 85000  | Grace Wilson    |
| 8  | Henry      | Moore     | henry@example.com   | 95000  | Henry Moore     |

Final Result: 4 clean, validated records ready for visualization

Data Quality Summary

Metric	Before	After	% Change
Total Records	8	4	-50%
Complete Records	3	4	+33%
Missing Values	3	0	-100%
Invalid Values	1	0	-100%
Extra Columns	1	0	-100%

Best Practices

Filter Early: Remove bad records first
Keep Essentials: Remove unnecessary columns early
Standardize Names: Consistent naming for readability
Validate Values: Check ranges and constraints
Add Computed Fields: Last step after validation

Common Cleaning Scenarios

Scenario 1: Remove Outliers

yaml

- type: filter
  configuration:
    where:
      # Keep salaries between $30k-$300k
      salary: { between: [30000, 300000] }

Scenario 2: Handle Empty Strings

yaml

- type: filter
  configuration:
    where:
      # Remove empty names (assuming empty string converts to falsy)
      email: { ne: "" }

Scenario 3: Normalize Text

yaml

- type: derive
  configuration:
    columns:
      # Uppercase all categories
      category_normalized: "category.toUpperCase()"

Scenario 4: Date Validation

yaml

- type: filter
  configuration:
    where:
      # Keep dates after Jan 1, 2024
      created_date: { gt: "2024-01-01" }

Error Handling

Situation	Behavior
Filter matches nothing	Returns empty dataset (0 rows)
Column doesn't exist	Silently skipped in select
Invalid derive expression	Results in null value, doesn't break
Rename non-existent column	Silently skipped

Performance Tips

Filter to small dataset first - Reduces rows for downstream operations
Select early - Fewer columns = faster processing
Avoid complex derives on large datasets - Do filtering first
Order matters - Filter → Select → Rename → Derive

Before/After Comparison

Before Cleaning:

- 8 total records
- Multiple missing values
- Inconsistent naming
- Invalid data included
- Extra columns cluttering dataset

After Cleaning:

- 4 validated records
- No missing values
- Consistent naming (snake_case)
- All values valid
- Only needed columns
- Ready for visualization

Data Cleaning Example ​

Problem ​

Solution Pipeline ​

Step-by-Step Transformation ​

Input Data (8 records with issues) ​

After Step 1: Filter Missing Values ​

After Step 2: Select Columns ​

After Step 3: Rename Columns ​

After Step 4: Filter Invalid Salaries ​

After Step 5: Derive Display Name ​

Data Quality Summary ​

Best Practices ​

Common Cleaning Scenarios ​

Scenario 1: Remove Outliers ​

Scenario 2: Handle Empty Strings ​

Scenario 3: Normalize Text ​

Scenario 4: Date Validation ​

Error Handling ​

Performance Tips ​

Before/After Comparison ​

Data Cleaning Example

Problem

Solution Pipeline

Step-by-Step Transformation

Input Data (8 records with issues)

After Step 1: Filter Missing Values

After Step 2: Select Columns

After Step 3: Rename Columns

After Step 4: Filter Invalid Salaries

After Step 5: Derive Display Name

Data Quality Summary

Best Practices

Common Cleaning Scenarios

Scenario 1: Remove Outliers

Scenario 2: Handle Empty Strings

Scenario 3: Normalize Text

Scenario 4: Date Validation

Error Handling

Performance Tips

Before/After Comparison