How to transform data
Data transformation can increase the efficiency of analytic and business processes and enable better data-driven decision-making. The first phase of data transformations should include things like data type conversion and flattening of hierarchical data.
These operations shape data to increase compatibility with analytics systems. Data analysts and data scientists can implement further transformations additively as necessary as individual layers of processing.
Each layer of processing should be designed to perform a specific set of tasks that meet a known business or technical requirement. Data transformation serves many functions within the data analytics stack.
Explain Data Transformation Methods with appropriate example and sample calculations?
1. Extraction and parsing
In the modern ELT process, data ingestion begins with extracting information from a data source, followed by copying the data to its destination. Initial transformations are focused on shaping the format and structure of data to ensure its compatibility with both the destination system and the data already there. Parsing fields out of comma-delimited log data for loading to a relational database is an example of this type of data transformation.
2. Translation and mapping
Some of the most basic data transformations involve the mapping and translation of data. For example, a column containing integers representing error codes can be mapped to the relevant error descriptions, making that column easier to understand and more useful for display in a customer-facing application. Translation converts data from formats used in one system to formats appropriate for a different system. Even after parsing, web data might arrive in the form of hierarchical JSON or XML files but need to be translated into the row and column data for inclusion in a relational database.
3. Filtering, aggregation, and summarization
Data transformation is often concerned with whittling data down and making it more manageable. Data may be consolidated by filtering out unnecessary fields, columns, and records. Omitted data might include numerical indexes in data intended for graphs and dashboards or records from business regions that aren't of interest in a particular study. Data might also be aggregated or summarized. by, for instance, transforming a time series of customer transactions to hourly or daily sales counts. BI tools can do this filtering and aggregation, but it can be more efficient to do the transformations before a reporting tool accesses the data.
4. Enrichment and imputation
Data from different sources can be merged to create denormalized, enriched information. A customers transactions can be rolled up into a grand total and added into a customer information table for quicker reference or for use by customer analytics systems. Long or freeform fields may be split into multiple columns, and missing values can be imputed or corrupted data replaced as a result of these kinds of transformations.
5. Indexing and ordering
Data can be transformed so that its ordered logically or to suit a data storage scheme. In relational database management systems, for example, creating indexes can improve performance or improve the management of relationships between different tables.
6. Anonymization and encryption
Data containing personally identifiable information, or other information that could compromise privacy or security, should be anonymized before propagation. Encryption of private data is a requirement in many industries, and systems can perform encryption at multiple levels, from individual database cells to entire records or fields.
7. Modeling, typecasting, formatting, and renaming
Finally, a whole set of transformations can reshape data without changing content. This includes casting and converting data types for compatibility, adjusting dates and times with offsets and format localization, and renaming schemas, tables, and columns for clarity.