Tutorial: AWS Certified Data Engineer - Associate (DEA-C01)

When and how to apply cleansing techniques

Concepts

Before you dive into the specific techniques, it’s important to know when data cleansing should be incorporated into your workflow:

Data Acquisition: When you initially access or receive data from various sources.
Data Transformation and Loading: During ETL (Extract, Transform, Load) processes, especially before loading data into your data warehouse or analytical store.
Data Analysis Preparation: Before running data analytics or machine learning models, since the quality of your input data can greatly affect the outcomes.
Data Maintenance: As part of a scheduled maintenance plan to continually improve the data’s accuracy.

How to Apply Cleansing Techniques

Data Profiling

Start with data profiling to get an overview of the quality of your data. AWS Glue, for instance, can be used to perform this task.

Example: Use AWS Glue to crawl your datasets to understand their schema and detect anomalies.

import boto3

glue = boto3.client(‘glue’, region_name=’us-west-2′)

response = glue.start_crawler(
Name=’your-crawler-name’
)

print(response)

Null Value Handling

You can use AWS Glue to transform and clean your data by handling null values. They can either be replaced with a default value or removed, depending on the requirements.

Example: Transform a dataset using AWS Glue to substitute null values with a predefined default.

import boto3

glue = boto3.client(‘glue’)

# Define a transformation job
job = glue.create_job(
Name=’my-transform-job’,
Role=’GlueServiceRole’,
Command={
‘Name’: ‘glueetl’,
‘ScriptLocation’: ‘s3://path-to-your-script’
},
DefaultArguments={
‘–null-value-replace’: ‘DEFAULT_VALUE’ # Replace NULL with your default
}
)

# Start the job run
response = glue.start_job_run(
JobName=’my-transform-job’
)

Duplicate Removal

Duplicates can lead to skewed analysis and should be removed or consolidated. AWS Glue’s deduplication functions are handy for this.

Example: Write a PySpark script to deduplicate records that can be used in an AWS Glue job.

import sys
from awsglue.context import GlueContext
from awsglue.transforms import ApplyMapping
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

df = spark.read.format(“your-data-source”).load(“s3://your-bucket/your-data”)

deduped_df = df.dropDuplicates()

deduped_df.write.format(“your-target-format”).save(“s3://your-bucket/your-cleaned-data”)

Data Formatting and Normalization

Data often comes in various formats, which can be standardized using AWS services like AWS Glue or even Amazon S3 Select for simple transformations.

Example: Standardize a date format across your datasets using AWS Glue.

from pyspark.sql.functions import to_date
df = spark.read.table(“your_table”)
df = df.withColumn(“standardized_date”, to_date(df.date_column, “MMddyyyy”))
df.write.mode(“overwrite”).saveAsTable(“your_table_standardized”)

Quality Checks

Continuously implement quality checks as a part of your data pipeline. Amazon QuickSight offers ML Insights that can detect anomalies in your data which could be indicative of data quality issues.

Example: Use QuickSight to periodically review your data for anomalies and outliers.

Validation Rules

Validation rules ensure that incoming data conforms to certain standards or patterns. You can use AWS Lambda functions to implement custom validation rules.

Example: An AWS Lambda function triggered after data is loaded into an S3 bucket to validate data.

Conclusion

By applying these cleansing techniques, you’re ensuring that the datasets you work with are reliable, accurate, and prepared for meaningful analysis. Always remember that data quality is a continuous process, and as a data engineer, emphasizing best practices in data cleansing should be an integral part of your ETL pipelines. These practices are instrumental in acing the AWS Certified Data Engineer – Associate (DEA-C01) exam and paving the way for a successful data engineering career on AWS.

Answer the Questions in Comment Section

True or False: Data cleansing should be performed before data is ingested into a data lake or data warehouse.

(A) True
(B) False

Answer: A) True

Explanation: Data cleansing should generally be performed before data ingestion to ensure that the data stored is of high quality and ready for analysis.

When dealing with large datasets in AWS, which service is primarily used to cleanse and transform data?

(A) AWS Glue
(B) Amazon S3
(C) AWS Lambda
(D) Amazon Kinesis

Answer: A) AWS Glue

Explanation: AWS Glue is a managed ETL (Extract, Transform, Load) service that’s used to cleanse, transform, and prepare large datasets for analysis.

Which AWS service is NOT suitable for real-time data cleansing?

(A) AWS Glue
(B) Amazon Kinesis Data Analytics
(C) Amazon Kinesis Data Firehose
(D) Amazon QuickSight

Answer: D) Amazon QuickSight

Explanation: Amazon QuickSight is an analytics service used for creating visualizations and is not designed for real-time data cleansing.

True or False: AWS Lake Formation helps in cleansing and deduplicating data in a data lake.

(A) True
(B) False

Answer: A) True

Explanation: AWS Lake Formation has capabilities for cleaning and deduplicating data as part of the data lake management process.

In AWS, which service would you use for interactive data cleansing and preparation?

(A) AWS Data Pipeline
(B) AWS Glue DataBrew
(C) Amazon Redshift
(D) Amazon Athena

Answer: B) AWS Glue DataBrew

Explanation: AWS Glue DataBrew allows data analysts and data scientists to clean and normalize data using an interactive, visual interface.

When cleansing data, which AWS service would you use to remove personally identifiable information (PII) through tokenization or encryption?

(A) AWS KMS (Key Management Service)
(B) AWS Secrets Manager
(C) AWS Identity and Access Management (IAM)
(D) Amazon Macie

Answer: D) Amazon Macie

Explanation: Amazon Macie is a data security service that uses machine learning to automatically discover, classify, and protect sensitive data, including PII, in AWS.

True or False: AWS DataSync can be used to cleanse data during transfer between on-premises storage and AWS storage services.

(A) True
(B) False

Answer: B) False

Explanation: AWS DataSync is used to transfer data at high speed between on-premises storage and AWS storage services, but it’s not used for data cleansing.

Which of the following is an important consideration when applying data cleansing techniques?

(A) Data security and compliance
(B) Data storage costs
(C) Performance impact
(D) All of the above

Answer: D) All of the above

Explanation: Data security and compliance, storage costs, and the potential performance impact are all important considerations when applying data cleansing techniques.

True or False: You should perform data cleansing after data analysis to confirm findings.

(A) True
(B) False

Answer: B) False

Explanation: Data cleansing is typically performed before data analysis to ensure that the analysis is based on accurate and clean data.

Which AWS service enables you to transform streaming data before loading it into data stores?

(A) Amazon Kinesis Data Firehose
(B) AWS Glue
(C) Amazon EMR
(D) AWS Batch

Answer: A) Amazon Kinesis Data Firehose

Explanation: Amazon Kinesis Data Firehose can apply simple transformations to streaming data on the fly before loading it into destinations such as Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service.

Which AWS service is NOT directly involved in data cleansing activities?

(A) Amazon S3
(B) AWS Step Functions
(C) AWS Glue DataBrew
(D) Amazon SageMaker

Answer: B) AWS Step Functions

Explanation: AWS Step Functions is a service to coordinate multiple AWS services into serverless workflows, not for cleansing data. Data cleansing is performed in services like AWS Glue DataBrew.

True or False: Data cleansing is only necessary for structured data and not for unstructured data.

(A) True
(B) False

Answer: B) False

Explanation: Data cleansing is necessary for both structured and unstructured data, as both types can contain inaccuracies or inconsistencies which need to be addressed.

0 0 votes

Article Rating

37 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Eetu Lepisto

10 months ago

Great insights on when and how to apply the cleansing techniques during the data preparation phase for the AWS Certified Data Engineer – Associate exam!

Paula Garrett

10 months ago

Very helpful post. I’ve always found data cleansing a bit challenging, but this clarified a lot for me.

Diana Krohmalyuk

11 months ago

Can anyone explain the difference between data normalization and data cleansing?

Claudine Freitas

8 months ago

Reply to Diana Krohmalyuk

Sure! Data normalization is about organizing data to reduce redundancy and improve integrity, usually in a database context. Data cleansing, on the other hand, involves detecting and correcting errors and inconsistencies in the data to improve its quality.

Leon Mccoy

9 months ago

How crucial is data cleansing for the DEA-C01 exam specifically?

Ece Sadıklar

8 months ago

Reply to Leon Mccoy

Very crucial. Data quality is a major aspect of the exam. You need to understand various cleansing techniques and when to apply them to ensure accurate data analysis.

Lisa Roger

11 months ago

I found that using AWS Glue for data cleansing works really well. Anyone else?

Victoria Balderas

9 months ago

Reply to Lisa Roger

Yes, AWS Glue is incredibly powerful for data cleansing, especially because it integrates so well with other AWS services.

Zabava Malik

8 months ago

Reply to Lisa Roger

Agree. AWS Glue’s ETL capabilities make it easier to automate the data cleansing process.

Ratimir Sharko

9 months ago

Thank you for the blog post. This was very enlightening.

Addison Barnaby

11 months ago

Appreciate the detailed explanation in this post!

Lilly Leroy

8 months ago

Has anyone faced performance issues while using AWS Glue for large datasets?

Matthew Mackay

8 months ago

Reply to Lilly Leroy

Yes, AWS Glue can sometimes be slow with very large datasets. Tuning the job parameters and optimizing the S3 data storage can help mitigate this.

When and how to apply cleansing techniques

Concepts

How to Apply Cleansing Techniques

Data Profiling

Null Value Handling

Duplicate Removal

Data Formatting and Normalization

Quality Checks

Validation Rules

Conclusion

Answer the Questions in Comment Section

True or False: Data cleansing should be performed before data is ingested into a data lake or data warehouse.

When dealing with large datasets in AWS, which service is primarily used to cleanse and transform data?

Which AWS service is NOT suitable for real-time data cleansing?

True or False: AWS Lake Formation helps in cleansing and deduplicating data in a data lake.

In AWS, which service would you use for interactive data cleansing and preparation?

When cleansing data, which AWS service would you use to remove personally identifiable information (PII) through tokenization or encryption?

True or False: AWS DataSync can be used to cleanse data during transfer between on-premises storage and AWS storage services.

Which of the following is an important consideration when applying data cleansing techniques?

True or False: You should perform data cleansing after data analysis to confirm findings.

Which AWS service enables you to transform streaming data before loading it into data stores?

Which AWS service is NOT directly involved in data cleansing activities?

True or False: Data cleansing is only necessary for structured data and not for unstructured data.

Related Post

How to ensure accuracy and trustworthiness of data by using data lineage

Best practices for indexing, partitioning strategies, compression, and other data optimization techniques

How to model structured, semi-structured, and unstructured data