edocr - Data Transformation

Data transformation in IT 7123

About Jack Zheng

Faculty of IT at Kennesaw.edu

Tag Cloud

Data Cleaning and
Transformation
Jack G. Zheng
Spring 2024
https://idi.kennesaw.edu/it7123
IT 7123 BI
Overview
• Data problems and issues when preparing
data
• Data cleanse
– Techniques and approaches
• Data transformation
– Transformation tasks, tactics, and sequence
• Common cases and best practices
2
Data Cleaning and Scrubbing
• Data cleanse (data cleansing, or just data cleaning)
– The process of identifying and correcting data issues for
data quality (accuracy, completeness, consistency,
uniformity, and validity) – see next slide.
– Main reading: https://towardsdatascience.com/the-
ultimate-guide-to-data-cleaning-3969843991d4
• Data scrubbing
– “…the procedure of modifying or removing incomplete,
incorrect, inaccurately formatted, or repeated data in a
data set.”
– implies a more intense level of cleaning
– considered as a sub-area of data cleaning
– https://www.simplilearn.com/what-is-data-scrubbing-
article
3
Data is never clean.
Insights and analysis are
only as good as the data
you are using.
Data Quality Means
•
Validity - the degree to which the data conform to defined business rules or constraints.
– Data-Type Constraints: values in a particular column must be of a particular datatype, e.g., boolean, numeric, date, etc.
– Range Constraints: typically, numbers or dates should fall within a certain range.
– Mandatory Constraints: certain columns cannot be empty.
– Unique Constraints: a field, or a combination of fields, must be unique across a dataset.
–
Set-Membership constraints: values of a column come from a set of discrete values, e.g. enum values. For example, a
person’s gender may be male or female.
–
Foreign-key constraints: as in relational databases, a foreign key column can’t have a value that does not exist in the
referenced primary key.
– Regular expression patterns: text fields that have to be in a certain pattern. For example, phone numbers may be required to
have the pattern (999) 999–9999.
– Cross-field validation: certain conditions that span across multiple fields must hold. For example, a patient’s date of
discharge from the hospital cannot be earlier than the date of admission.
•
Accuracy - the degree to which the data is close to the true values and associated with the correct entity.
•
Completeness - the degree to which all required data is known.
– Missing data is going to happen for various reasons. One can mitigate this problem by questioning the original source if
possible, say re-interviewing the subject.
– Chances are, the subject is either going to give a different answer or will be hard to reach again.
•
Consistency - the degree to which the data is consistent, within the same data set or across multiple data sets.
–
Inconsistency occurs when two values in the data set contradict each other.
–
A valid age, say 10, mightn’t match with the marital status, say divorced. A customer is recorded in two different tables with
two different addresses.
•
Uniformity - the degree to which the data is specified using the same unit of measure.
– Measurement units: currency/money, length, weight, time, etc.
4
https://towardsdatascience.com/the-ultimate-
guide-to-data-cleaning-3969843991d4
Most Common Data Quality Issues
•
Redundancy and duplicates
– Affects data aggregation like sum (total) and count of records
•
Incomplete data; missing values; null values.
•
Inconsistency
– Data type inconsistency
– Row conflict
– Value inconsistency: e.g. SPSU, Southern Poly
– Name inconsistency
– Name/value inconsistency: age in address column, etc.
–
Text data in numeric fields
•
Errors
–
out-of-range values (e.g., age=1000);
–
impossible data combinations: e.g., gender = MALE but pregnant = TRUE;
•
Spelling and encoding issues
–
typo;
– System characters, line breaks, unwanted symbols, etc. leading/trailing spaces,
•
Format issues for composite values, like date, address, currency
•
Lack of identifiers
•
Noises (unwanted or undesired, outliers, or irrelevant, or interfering data or text, empty space, irregular
character, etc.)
5
Exercise: What are the issues?
• Some (anonymized) data got from a client to
build a dashboard
6
What caused the bad data?
• Extraction errors
– Especially in webpage scrapping
• Source errors
– Contains quality issues from the source
– Lack of data input control
– Mistakes or incorrect settings when exporting
7
A typical data format error from
Excel source – some values are
automatically recognized as
dates by Excel.
“8-Jun” is really “6-8”
Cleanse Process
• Inspection: Detect unexpected, incorrect, and
inconsistent data.
• Cleaning: Fix or remove the anomalies
discovered.
• Verifying: After cleaning, the results are
inspected to verify correctness. Document or
log the data cleaning process and activities for
trace and audit later.
• Reporting: A report about the changes made
and the quality of the currently stored data is
recorded.
8
Inspection
• Data examination is a necessary step to know the
data and find out problems
– examine data types, models, formats, size, condition
(quality)
– make sure the data is clean, good quality, and
meaningful.
• Approaches
– Visual scan with the help of some basic data operation
methods: sorting, filtering, query
– Data profiling or statistical profiling
– Visualization
•
Inspection check list
– Great to prepare a checklist (similar to a vehicle
multipoint check)
9
Data Profiling or Statistical Profiling
• Profiling is an automatic way to scan and
report data issues
– It is not a comprehensive check, so other
methods are still needed
• Most tools, including Power BI, provide this
function
10
https://learn.microsoft.com/en-us/power-query/data-profiling-tools
Cleaning: Dealing with missing data
• Drop observations that have missing values,
but doing this will drop or lose information,
so be mindful of this before you remove it.
• Replace missing values based on other
observations; again, there is an opportunity
to lose integrity of the data because you may
be operating from assumptions and not
actual observations.
• Alter the way the data is used to effectively
navigate null values.
11
Cleaning: Dealing with Duplicates
• What is duplicate
– The whole record is the same – maybe only the identifier is different
– Duplicated values in unique columns
•
It often happens when
– Data are combined from different sources
– The user may hit submit button twice thinking the form wasn’t actually submitted.
– A request to online booking was submitted twice correcting wrong information that was
entered accidentally in the first time.
• Detection
– Usually run a quick “group by” query or transformation to count rows
–
It may not be a duplication even all fields are the same – check definitions and contexts
• We had a case where two POS transaction records are completely the same. It turned out they
were from two registers, but the register information is omitted totally in the exported data.
–
In Power Query:
• https://www.techrepublic.com/article/how-to-use-power-query-to-display-a-list-of-duplicate-values-
or-records-in-excel/
• https://goodly.co.in/find-duplicate-values-dimension-table-powerbi/
12
Data Cleanse through Transformation
• Some cleaning are done manually using
Excel before the extraction phase –
especially when the source is a flat file.
• Data cleanse can be done during extraction
or after loading
– During extraction: using Power Query or M
language
– After loading: using DAX
• Data cleanse also often utilizes
transformation techniques
13
Data Transformation
• The transform step applies a set of rules to
transform the data from the source to the
target.
• Transformation
– Change the extracted data to a format and
structure that conform to the target data model.
– Some additional data cleanse can also be part
of the transformation process.
– Transformation also helps in analysis for
calculations based on data columns and rows.
14
Purposes of Transformation
• Extract desired raw data from a complex text
structure
• Transform to clean data set
• Transform to integrate/consolidate data sources
• Transform to desired data model
• Transform to analyze
– Some analytical methods or models require data
input to be in a specific structure.
• Transform to visualize
– Certain visualization forms/tools require data to be
formed in a specific structure.
15
Transform
•
The transform task applies a set of rules to transform the data from the source to the target.
• Common problems and issues can be categorized as: (Panos Vassiliadis and Alkis Simitsis,
book chapter “Extraction, Transformation, and Loading” from Encyclopedia of Database
Systems, pp 1095-1101)
16
Schema-level
problems
•
naming conflicts, where the same name is used for different objects (homonyms) or
different names are used for the same object (synonyms)
•
structural conflicts, different representations of the same object in different sources,
or converting data types between sources and the warehouse.
Record-level
problems
• Duplicated or contradicting records.
• Consistency problems concerning the granularity or timeliness of data occur (e.g.,
sales per day vs. sales per year) or reference to different points in time (e.g., current
sales as of yesterday for a certain source vs. as of last month for another source).
• Missing identifier/primary key
Value-level
problems
• Naming inconsistency: SPSU, Southern Poly, Southern Poly Technic
•
format masks, like for example, different value representations (e.g., for sex: ‘Male’,
‘M’, ‘1’), or different interpretation of the values (e.g., date/time formats: American
‘mm/dd/yy’ vs. European ‘dd/mm/yy’).
• Missing values (no ZIP code), truncated values
• Other value-level problems include assigning surrogate key management,
substituting constants, setting values to NULL or DEFAULT based on a condition, or
using frequent SQL operators like UPPER, TRUNC, and SUBSTR.
Common
problems
• Source data format and structure are changed
• No documentation of data definition, conflicting rules
Common Data Problems and Issues
Common data problems and issues during data preparation involves structural difference and semantic differences
between different data sources and the target, and data quality and integrity. They can be categorized as:
17
Category
Structural and semantic differences
Data quality and integrity issues
Transformation
Technique
Schema-
level
problems
• Mainly structural conflicts, e.g., different representations of the
same object in different sources,
• Data type mismatch between sources and the target.
• Naming conflicts, where the same name is used for different
objects (homonyms), or different names are used for the same
object (synonyms).
Column
transformation
or table
transformation
Record-
level
problems
• Consistency problems concerning the granularity or timeliness
of data occur (e.g., sales per day vs. sales per year) or
reference to different points in time (e.g., current sales as of
yesterday for a certain source vs. as of last month for another
source).
• Duplicated or contradicting records.
• Missing identifier/primary key.
Row
transformation
Value-
level
problems
• Naming inconsistency: SPSU, Southern Poly, Southern Poly
Technic
• Format difference, for example, e.g., date/time formats:
American ‘mm/dd/yy’ vs. European ‘dd/mm/yy’.
• Different value representations or different interpretation of the
values (e.g., for gender: ‘Male’, ‘M’, ‘1’),
• Other value-level problems include assigning surrogate key
management, substituting constants, setting values to NULL or
DEFAULT based on a condition, or using frequent SQL
operators like UPPER, TRUNC, and SUBSTR.
• Missing values (no ZIP code) or
truncated values
• Misspelling
• Typo
• Out of normal range
• Vague empty values
• Incorrect value
• Matching its column
• Data range inconsistency, like GPA
Value
transformation
General
problems
• Source data format and structure are changed
• No documentation of data definition,
conflicting rules
Transformation
Data cleanse, transformation may
solve part of the problems
Common Transform Operations and Functions
•
Column (attribute) based
– Column mapping
–
Split a column
– Derive column(s)
– Combine column and column formula
–
Projection (filtering of columns)
– Data type/format conversion
– Column-based calculations
•
Row based
–
Sorting
–
Filtering
– Removing duplicates
– Removing blanks
–
Aggregation
– Grouping
•
Value based
–
Translating values
–
Value encoding
–
Strings: getting rid of spaces, case transition
•
Table based
–
Assigning identifier or establish relationships foreign keys
–
Transposing and pivoting
–
Joining
– Disaggregation or normalization
–
Lookup and validate
18
Key reading:
Common transformation operations in Power BI
“Power Query Tips: 10 Most Useful Commands”
https://www.goskills.com/Excel/Resources/Powe
r-Query-tips
Transformation Design
• Design a strategy for extraction/transformation through analysis,
sampling, and testing
• Ensure the integration/transformation process correct without
errors, plan and validate transformation logic
• Transformation design
– Data analysis
– Source/target mapping
• Techniques
– Column level analysis and plan table.
•
Identify patterns and cases Transformation patterns/cases
– Get a list of unique values
– Split a flat table to joined tables
– Create a unique identifier
– See next few slides
19
Where to transform?
20
https://www.youtube.com/watch?v=OAlys79j81Q
Transform to a better model
• Cases
– Big flat table
– Multiple small tables
– Consolidated fact table (join)
– Aggregate rows – summarize to certain level
– Condensed time dimension
• https://learn.microsoft.com/en-us/power-
bi/guidance/import-modeling-data-reduction
21
Tools to Use
• Power BI or other self-service BI tools (with transformation tools, including
Excel with Power Query)
– You need to do the job quick and easy but with some kind of logging and tracing
– The data does not need too much manual cleaning or transformation
– You have more than 100K records, and more than 10 tables
• Excel (formula)
– Smaller data sets: < tens of columns, and < thousands of records
– You need to do the job quick and easy
– There is a logical pattern to cleaning the data and it’s easy enough to clean using Excel
functions
– The logical pattern to cleaning the data is hard to define, and you need to clean the data
manually
• Python or another scripting language:
– You need to document your process
– You plan on doing the job on a repeated basis (more than 10 files to process)
– There is a logical pattern to cleaning the data, but it is hard to implement with Excel
functions
22
Key Readings
• Data cleaning
– The Ultimate Guide to Data Cleaning
https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4
– What is Data Scrubbing: A Beginner's Guide To Cleaning Data Right Way
https://www.simplilearn.com/what-is-data-scrubbing-article
– Data cleaning in Excel
https://support.microsoft.com/en-us/office/top-ten-ways-to-clean-your-data-2844b620-
677c-47a7-ac3e-c2e157d1db19
• Transformation with Power Query (including Power Query in Excel which
shares the same engine as in Power BI Desktop). Find more Power Query
information in the lab 3 guide.
– Exploring Excel's Get & Transform Functionality
https://www.toptal.com/finance/excel-experts/excel-get-and-transform
– Excel Power Query Tutorial (Get & Transform) + Examples
https://excelchamps.com/power-query/
– Power Query 10 Most Useful Commands
https://www.goskills.com/Excel/Resources/Power-Query-tips
• Power Query Formula Language (M)
– Basics of M: Power Query Formula Language https://radacad.com/basics-of-m-power-
query-formula-language
– Differences between the M Language and DAX in Power BI
https://www.sqlshack.com/differences-between-the-m-language-and-dax-in-power-bi/
23
Additional Resources
• More cleaning and transformation
techniques:
– https://www.xplenty.com/blog/data-
transformation-explained/ (focus on the types of
data)
– https://www.myexcelonline.com/blog/50-things-
you-can-do-with-excel-power-query/
– https://www.softwareadvice.com/resources/excel
-data-cleaning-integration-techniques/
– https://www.digitalvidya.com/blog/data-cleaning-
techniques/
24