Skip to content

Data Cleaning Example

Shows how to clean messy raw data before visualization.

Problem

Raw CSV data has:

  • Missing values
  • Inconsistent column names
  • Unnecessary columns
  • Outliers or invalid values

Solution Pipeline

yaml
transformations:
  # Step 1: Remove records with missing required fields
  - type: filter
    configuration:
      where:
        email: { ne: null }
        name: { ne: null }

  # Step 2: Keep only needed columns
  - type: select
    configuration:
      columns: ["id", "firstName", "lastName", "emailAddress", "salary"]

  # Step 3: Standardize column names
  - type: rename
    configuration:
      mapping:
        firstName: "first_name"
        lastName: "last_name"
        emailAddress: "email"

  # Step 4: Filter out invalid salaries
  - type: filter
    configuration:
      where:
        salary: { gt: 0 }

  # Step 5: Add computed display name
  - type: derive
    configuration:
      columns:
        display_name: "`${first_name} ${last_name}`"

Step-by-Step Transformation

Input Data (8 records with issues)

| id | firstName  | lastName | emailAddress           | internalID | salary |
|----|------------|----------|------------------------|------------|--------|
| 1  | Alice      | Smith    | alice@example.com      | INT-001    | 80000  |
| 2  | Bob        | Jones    | null                   | INT-002    | 75000  |
| 3  | Charlie    | null     | charlie@example.com    | INT-003    | 0      |
| 4  | Diana      | Prince   | diana@example.com      | INT-004    | 90000  |
| 5  | null       | Brown    | ebrown@example.com     | INT-005    | 70000  |
| 6  | Frank      | Garcia   | fgarcia@example.com    | INT-006    | -5000  |
| 7  | Grace      | Wilson   | grace@example.com      | INT-007    | 85000  |
| 8  | Henry      | Moore    | henry@example.com      | INT-008    | 95000  |

Issues: Missing emails, names, and salaries; extra column; invalid salary values

After Step 1: Filter Missing Values

Removes rows missing email or name:

  • Removes: Row 2 (no email), Row 3 (no name), Row 5 (no name)
  • Keeps: Rows 1, 4, 6, 7, 8 (5 records)

After Step 2: Select Columns

Keeps only: id, firstName, lastName, emailAddress, salary

  • Removes: internalID column
  • Result: 5 records with 5 columns

After Step 3: Rename Columns

Standardizes names for consistency:

  • firstName → first_name
  • lastName → last_name
  • emailAddress → email

After Step 4: Filter Invalid Salaries

Removes salary ≤ 0:

  • Removes: Row 6 (salary = -5000)
  • Keeps: Rows 1, 4, 7, 8 (4 records)

After Step 5: Derive Display Name

Adds computed column:

| id | first_name | last_name | email               | salary | display_name    |
|----|------------|-----------|---------------------|--------|-----------------|
| 1  | Alice      | Smith     | alice@example.com   | 80000  | Alice Smith     |
| 4  | Diana      | Prince    | diana@example.com   | 90000  | Diana Prince    |
| 7  | Grace      | Wilson    | grace@example.com   | 85000  | Grace Wilson    |
| 8  | Henry      | Moore     | henry@example.com   | 95000  | Henry Moore     |

Final Result: 4 clean, validated records ready for visualization

Data Quality Summary

MetricBeforeAfter% Change
Total Records84-50%
Complete Records34+33%
Missing Values30-100%
Invalid Values10-100%
Extra Columns10-100%

Best Practices

  1. Filter Early: Remove bad records first
  2. Keep Essentials: Remove unnecessary columns early
  3. Standardize Names: Consistent naming for readability
  4. Validate Values: Check ranges and constraints
  5. Add Computed Fields: Last step after validation

Common Cleaning Scenarios

Scenario 1: Remove Outliers

yaml
- type: filter
  configuration:
    where:
      # Keep salaries between $30k-$300k
      salary: { between: [30000, 300000] }

Scenario 2: Handle Empty Strings

yaml
- type: filter
  configuration:
    where:
      # Remove empty names (assuming empty string converts to falsy)
      email: { ne: "" }

Scenario 3: Normalize Text

yaml
- type: derive
  configuration:
    columns:
      # Uppercase all categories
      category_normalized: "category.toUpperCase()"

Scenario 4: Date Validation

yaml
- type: filter
  configuration:
    where:
      # Keep dates after Jan 1, 2024
      created_date: { gt: "2024-01-01" }

Error Handling

SituationBehavior
Filter matches nothingReturns empty dataset (0 rows)
Column doesn't existSilently skipped in select
Invalid derive expressionResults in null value, doesn't break
Rename non-existent columnSilently skipped

Performance Tips

  1. Filter to small dataset first - Reduces rows for downstream operations
  2. Select early - Fewer columns = faster processing
  3. Avoid complex derives on large datasets - Do filtering first
  4. Order matters - Filter → Select → Rename → Derive

Before/After Comparison

Before Cleaning:

- 8 total records
- Multiple missing values
- Inconsistent naming
- Invalid data included
- Extra columns cluttering dataset

After Cleaning:

- 4 validated records
- No missing values
- Consistent naming (snake_case)
- All values valid
- Only needed columns
- Ready for visualization

Released under the MIT License. Built by Boundary Lab.