Data transformation in IT 7123
About Jack Zheng
Faculty of IT at Kennesaw.edu
Data
transform
Power query
Value
clean
Data clean
Data transformation in IT 7123
About Jack Zheng
Faculty of IT at Kennesaw.edu
Data
transform
Power query
Value
clean
Data clean
Data Cleaning and Transformation Jack G. Zheng Spring 2024 https://idi.kennesaw.edu/it7123 IT 7123 BI Overview • Data problems and issues when preparing data • Data cleanse – Techniques and approaches • Data transformation – Transformation tasks, tactics, and sequence • Common cases and best practices 2 Data Cleaning and Scrubbing • Data cleanse (data cleansing, or just data cleaning) – The process of identifying and correcting data issues for data quality (accuracy, completeness, consistency, uniformity, and validity) – see next slide. – Main reading: https://towardsdatascience.com/the- ultimate-guide-to-data-cleaning-3969843991d4 • Data scrubbing – “…the procedure of modifying or removing incomplete, incorrect, inaccurately formatted, or repeated data in a data set.” – implies a more intense level of cleaning – considered as a sub-area of data cleaning – https://www.simplilearn.com/what-is-data-scrubbing- article 3 Data is never clean. Insights and analysis are only as good as the data you are using. Data Quality Means • Validity - the degree to which the data conform to defined business rules or constraints. – Data-Type Constraints: values in a particular column must be of a particular datatype, e.g., boolean, numeric, date, etc. – Range Constraints: typically, numbers or dates should fall within a certain range. – Mandatory Constraints: certain columns cannot be empty. – Unique Constraints: a field, or a combination of fields, must be unique across a dataset. – Set-Membership constraints: values of a column come from a set of discrete values, e.g. enum values. For example, a person’s gender may be male or female. – Foreign-key constraints: as in relational databases, a foreign key column can’t have a value that does not exist in the referenced primary key. – Regular expression patterns: text fields that have to be in a certain pattern. For example, phone numbers may be required to have the pattern (999) 999–9999. – Cross-field validation: certain conditions that span across multiple fields must hold. For example, a patient’s date of discharge from the hospital cannot be earlier than the date of admission. • Accuracy - the degree to which the data is close to the true values and associated with the correct entity. • Completeness - the degree to which all required data is known. – Missing data is going to happen for various reasons. One can mitigate this problem by questioning the original source if possible, say re-interviewing the subject. – Chances are, the subject is either going to give a different answer or will be hard to reach again. • Consistency - the degree to which the data is consistent, within the same data set or across multiple data sets. – Inconsistency occurs when two values in the data set contradict each other. – A valid age, say 10, mightn’t match with the marital status, say divorced. A customer is recorded in two different tables with two different addresses. • Uniformity - the degree to which the data is specified using the same unit of measure. – Measurement units: currency/money, length, weight, time, etc. 4 https://towardsdatascience.com/the-ultimate- guide-to-data-cleaning-3969843991d4 Most Common Data Quality Issues • Redundancy and duplicates – Affects data aggregation like sum (total) and count of records • Incomplete data; missing values; null values. • Inconsistency – Data type inconsistency – Row conflict – Value inconsistency: e.g. SPSU, Southern Poly – Name inconsistency – Name/value inconsistency: age in address column, etc. – Text data in numeric fields • Errors – out-of-range values (e.g., age=1000); – impossible data combinations: e.g., gender = MALE but pregnant = TRUE; • Spelling and encoding issues – typo; – System characters, line breaks, unwanted symbols, etc. leading/trailing spaces, • Format issues for composite values, like date, address, currency • Lack of identifiers • Noises (unwanted or undesired, outliers, or irrelevant, or interfering data or text, empty space, irregular character, etc.) 5 Exercise: What are the issues? • Some (anonymized) data got from a client to build a dashboard 6 What caused the bad data? • Extraction errors – Especially in webpage scrapping • Source errors – Contains quality issues from the source – Lack of data input control – Mistakes or incorrect settings when exporting 7 A typical data format error from Excel source – some values are automatically recognized as dates by Excel. “8-Jun” is really “6-8” Cleanse Process • Inspection: Detect unexpected, incorrect, and inconsistent data. • Cleaning: Fix or remove the anomalies discovered. • Verifying: After cleaning, the results are inspected to verify correctness. Document or log the data cleaning process and activities for trace and audit later. • Reporting: A report about the changes made and the quality of the currently stored data is recorded. 8 Inspection • Data examination is a necessary step to know the data and find out problems – examine data types, models, formats, size, condition (quality) – make sure the data is clean, good quality, and meaningful. • Approaches – Visual scan with the help of some basic data operation methods: sorting, filtering, query – Data profiling or statistical profiling – Visualization • Inspection check list – Great to prepare a checklist (similar to a vehicle multipoint check) 9 Data Profiling or Statistical Profiling • Profiling is an automatic way to scan and report data issues – It is not a comprehensive check, so other methods are still needed • Most tools, including Power BI, provide this function 10 https://learn.microsoft.com/en-us/power-query/data-profiling-tools Cleaning: Dealing with missing data • Drop observations that have missing values, but doing this will drop or lose information, so be mindful of this before you remove it. • Replace missing values based on other observations; again, there is an opportunity to lose integrity of the data because you may be operating from assumptions and not actual observations. • Alter the way the data is used to effectively navigate null values. 11 Cleaning: Dealing with Duplicates • What is duplicate – The whole record is the same – maybe only the identifier is different – Duplicated values in unique columns • It often happens when – Data are combined from different sources – The user may hit submit button twice thinking the form wasn’t actually submitted. – A request to online booking was submitted twice correcting wrong information that was entered accidentally in the first time. • Detection – Usually run a quick “group by” query or transformation to count rows – It may not be a duplication even all fields are the same – check definitions and contexts • We had a case where two POS transaction records are completely the same. It turned out they were from two registers, but the register information is omitted totally in the exported data. – In Power Query: • https://www.techrepublic.com/article/how-to-use-power-query-to-display-a-list-of-duplicate-values- or-records-in-excel/ • https://goodly.co.in/find-duplicate-values-dimension-table-powerbi/ 12 Data Cleanse through Transformation • Some cleaning are done manually using Excel before the extraction phase – especially when the source is a flat file. • Data cleanse can be done during extraction or after loading – During extraction: using Power Query or M language – After loading: using DAX • Data cleanse also often utilizes transformation techniques 13 Data Transformation • The transform step applies a set of rules to transform the data from the source to the target. • Transformation – Change the extracted data to a format and structure that conform to the target data model. – Some additional data cleanse can also be part of the transformation process. – Transformation also helps in analysis for calculations based on data columns and rows. 14 Purposes of Transformation • Extract desired raw data from a complex text structure • Transform to clean data set • Transform to integrate/consolidate data sources • Transform to desired data model • Transform to analyze – Some analytical methods or models require data input to be in a specific structure. • Transform to visualize – Certain visualization forms/tools require data to be formed in a specific structure. 15 Transform • The transform task applies a set of rules to transform the data from the source to the target. • Common problems and issues can be categorized as: (Panos Vassiliadis and Alkis Simitsis, book chapter “Extraction, Transformation, and Loading” from Encyclopedia of Database Systems, pp 1095-1101) 16 Schema-level problems • naming conflicts, where the same name is used for different objects (homonyms) or different names are used for the same object (synonyms) • structural conflicts, different representations of the same object in different sources, or converting data types between sources and the warehouse. Record-level problems • Duplicated or contradicting records. • Consistency problems concerning the granularity or timeliness of data occur (e.g., sales per day vs. sales per year) or reference to different points in time (e.g., current sales as of yesterday for a certain source vs. as of last month for another source). • Missing identifier/primary key Value-level problems • Naming inconsistency: SPSU, Southern Poly, Southern Poly Technic • format masks, like for example, different value representations (e.g., for sex: ‘Male’, ‘M’, ‘1’), or different interpretation of the values (e.g., date/time formats: American ‘mm/dd/yy’ vs. European ‘dd/mm/yy’). • Missing values (no ZIP code), truncated values • Other value-level problems include assigning surrogate key management, substituting constants, setting values to NULL or DEFAULT based on a condition, or using frequent SQL operators like UPPER, TRUNC, and SUBSTR. Common problems • Source data format and structure are changed • No documentation of data definition, conflicting rules Common Data Problems and Issues Common data problems and issues during data preparation involves structural difference and semantic differences between different data sources and the target, and data quality and integrity. They can be categorized as: 17 Category Structural and semantic differences Data quality and integrity issues Transformation Technique Schema- level problems • Mainly structural conflicts, e.g., different representations of the same object in different sources, • Data type mismatch between sources and the target. • Naming conflicts, where the same name is used for different objects (homonyms), or different names are used for the same object (synonyms). Column transformation or table transformation Record- level problems • Consistency problems concerning the granularity or timeliness of data occur (e.g., sales per day vs. sales per year) or reference to different points in time (e.g., current sales as of yesterday for a certain source vs. as of last month for another source). • Duplicated or contradicting records. • Missing identifier/primary key. Row transformation Value- level problems • Naming inconsistency: SPSU, Southern Poly, Southern Poly Technic • Format difference, for example, e.g., date/time formats: American ‘mm/dd/yy’ vs. European ‘dd/mm/yy’. • Different value representations or different interpretation of the values (e.g., for gender: ‘Male’, ‘M’, ‘1’), • Other value-level problems include assigning surrogate key management, substituting constants, setting values to NULL or DEFAULT based on a condition, or using frequent SQL operators like UPPER, TRUNC, and SUBSTR. • Missing values (no ZIP code) or truncated values • Misspelling • Typo • Out of normal range • Vague empty values • Incorrect value • Matching its column • Data range inconsistency, like GPA Value transformation General problems • Source data format and structure are changed • No documentation of data definition, conflicting rules Transformation Data cleanse, transformation may solve part of the problems Common Transform Operations and Functions • Column (attribute) based – Column mapping – Split a column – Derive column(s) – Combine column and column formula – Projection (filtering of columns) – Data type/format conversion – Column-based calculations • Row based – Sorting – Filtering – Removing duplicates – Removing blanks – Aggregation – Grouping • Value based – Translating values – Value encoding – Strings: getting rid of spaces, case transition • Table based – Assigning identifier or establish relationships foreign keys – Transposing and pivoting – Joining – Disaggregation or normalization – Lookup and validate 18 Key reading: Common transformation operations in Power BI “Power Query Tips: 10 Most Useful Commands” https://www.goskills.com/Excel/Resources/Powe r-Query-tips Transformation Design • Design a strategy for extraction/transformation through analysis, sampling, and testing • Ensure the integration/transformation process correct without errors, plan and validate transformation logic • Transformation design – Data analysis – Source/target mapping • Techniques – Column level analysis and plan table. • Identify patterns and cases Transformation patterns/cases – Get a list of unique values – Split a flat table to joined tables – Create a unique identifier – See next few slides 19 Where to transform? 20 https://www.youtube.com/watch?v=OAlys79j81Q Transform to a better model • Cases – Big flat table – Multiple small tables – Consolidated fact table (join) – Aggregate rows – summarize to certain level – Condensed time dimension • https://learn.microsoft.com/en-us/power- bi/guidance/import-modeling-data-reduction 21 Tools to Use • Power BI or other self-service BI tools (with transformation tools, including Excel with Power Query) – You need to do the job quick and easy but with some kind of logging and tracing – The data does not need too much manual cleaning or transformation – You have more than 100K records, and more than 10 tables • Excel (formula) – Smaller data sets: < tens of columns, and < thousands of records – You need to do the job quick and easy – There is a logical pattern to cleaning the data and it’s easy enough to clean using Excel functions – The logical pattern to cleaning the data is hard to define, and you need to clean the data manually • Python or another scripting language: – You need to document your process – You plan on doing the job on a repeated basis (more than 10 files to process) – There is a logical pattern to cleaning the data, but it is hard to implement with Excel functions 22 Key Readings • Data cleaning – The Ultimate Guide to Data Cleaning https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4 – What is Data Scrubbing: A Beginner's Guide To Cleaning Data Right Way https://www.simplilearn.com/what-is-data-scrubbing-article – Data cleaning in Excel https://support.microsoft.com/en-us/office/top-ten-ways-to-clean-your-data-2844b620- 677c-47a7-ac3e-c2e157d1db19 • Transformation with Power Query (including Power Query in Excel which shares the same engine as in Power BI Desktop). Find more Power Query information in the lab 3 guide. – Exploring Excel's Get & Transform Functionality https://www.toptal.com/finance/excel-experts/excel-get-and-transform – Excel Power Query Tutorial (Get & Transform) + Examples https://excelchamps.com/power-query/ – Power Query 10 Most Useful Commands https://www.goskills.com/Excel/Resources/Power-Query-tips • Power Query Formula Language (M) – Basics of M: Power Query Formula Language https://radacad.com/basics-of-m-power- query-formula-language – Differences between the M Language and DAX in Power BI https://www.sqlshack.com/differences-between-the-m-language-and-dax-in-power-bi/ 23 Additional Resources • More cleaning and transformation techniques: – https://www.xplenty.com/blog/data- transformation-explained/ (focus on the types of data) – https://www.myexcelonline.com/blog/50-things- you-can-do-with-excel-power-query/ – https://www.softwareadvice.com/resources/excel -data-cleaning-integration-techniques/ – https://www.digitalvidya.com/blog/data-cleaning- techniques/ 24