Category: Data Wrangling

Data Munging, Data Warehousing, Data Wrangling, SQL August 30, 2024July 29, 2024

SQL Tips: How to generate insert statements using the data from the output of a select statement using Toad

If you need to build insert statements for many rows of data for inserting that data into another table or into the same table in a different environment, there is a convenient way to do this in Toad. This is often needed to move data around.

In Toad, execute the appropriate select statement on the source table …

Select * from [my_schema].[my_simple_table];

I used a simple “select *” above, but your SQL statement can be any valid SQL statement that returns the data you want to insert into the other table. You may add specific columns, add filters, joins, and any other valid SQL operation.

Let’s say you want to insert the output into another table in a different schema.

Right-click on the output result data, and click “Export Dataset…”

From the “Export format” drop down menu, choose “Insert Statements”

In the Output section of the Export Dataset dialog box, enter the location and name of the Insert Script file that will be generated.

There are several other parameters that you could choose but we won’t cover them all here.

If you only wanted to generate inserts for some selected rows, select “Export only selected rows”.

If you need to specify the schema of the target table, select “Include schema name”

In the Table section, enter the name of the target schema and table

Note, there are data masking options available that can be very useful if, for example, you are moving some data from a Production environment to a Non-Production environment, and you do not want to expose the data there.

After you have set the parameters relevant to your scenario, Click “OK”.

The Insert Script file of all the data will be generated with the filename and at the location you specified. And the insert statements will include the name of the schema and table you specified.

Thanks for reading!

Analytics, Business Intelligence, Business Intelligence (BI) Development, Data Analysis, Data Cleaning, Data Development, Data Governance, Data Integration, Data Munging, Data Quality Assurance, Data Science, Data Warehousing, Data Wrangling, Machine Learning (ML), Reporting August 23, 2024July 29, 2024

Data Cleaning methods

Data cleaning is an essential step in the data preprocessing pipeline when preparing data for analytics or data science. It involves identifying and correcting or removing errors, inconsistencies, and inaccuracies in the dataset to improve its quality and reliability. It is essential that data is cleaned before being used in analyses, reporting, development or integration. Here are some common data cleaning methods:

Handling missing values:

Delete rows or columns with a high percentage of missing values if they don’t contribute significantly to the analysis.
Impute missing values by replacing them with a statistical measure such as mean, median, mode, or using more advanced techniques like regression imputation or k-nearest neighbors imputation.

Handling categorical variables:

Encode categorical variables into numerical representations using techniques like one-hot encoding, label encoding, or target encoding.

Removing duplicates:

Identify and remove duplicate records based on one or more key variables.
Be cautious when removing duplicates, as sometimes duplicated entries may be valid and intentional.

Handling outliers:

Identify outliers using statistical methods like z-scores, box plots, or domain knowledge.
Decide whether to remove outliers or transform them based on the nature of the data and the analysis goals.

Correcting inconsistent data:

Standardize data formats: Convert data into a consistent format (e.g., converting dates to a specific format).
Resolve inconsistencies: Identify and correct inconsistent values (e.g., correcting misspelled words, merging similar categories).

Dealing with irrelevant or redundant features:

Remove irrelevant features that do not contribute to the analysis or prediction task.
Identify and handle redundant features that provide similar information to avoid multicollinearity issues.

Data normalization or scaling:

Normalize numerical features to a common scale (e.g., min-max scaling or z-score normalization) to prevent certain features from dominating the analysis due to their larger magnitudes.

Data integrity issues:

Finally, you need to address data integrity issues.

Check for data integrity problems such as inconsistent data types, incorrect data ranges, or violations of business rules.
Resolve integrity issues by correcting or removing problematic data.

It’s important to note that the specific data cleaning methods that need to be applied to a dataset will vary depending on the nature of the dataset, the analysis goals, and domain knowledge. It’s recommended to thoroughly understand the data and consult with domain experts when preparing to perform data cleaning tasks.

Analytics, Business Intelligence, Data Development, Data Modeling, Data Munging, Data Wrangling, Power BI August 14, 2024July 29, 2024

Transpose vs. Pivot vs. Unpivot in Power BI

In this post, I will quickly show how to perform 3 key transformations in Power BI – Transpose, Pivot, and Unpivot – and what impact they have on the data they are applied to. These transformations are often needed to manipulate the original data to get it into the shape needed to create the Power BI model and reports. We will use a simple example so that you can see clearly the effect of each transformation.

This is the original dataset – Student Grades:

Transpose

Transpose the data by going to the Transform ribbon and clicking “Transpose”.

This is what the data looks like after the Transpose transformation is applied. The rows are converted into columns, and the columns are converted into rows. Default headers / column names are applied.

Pivot Column

Starting with the original dataset, we will now Pivot the dataset. Go to Transform ribbon, click on “Pivot Column”.

The Pivot Column dialog opens. As noted, it will use the values in the Names column to create new columns, and will use the Score column as the Values Column.

This is what the data looks like after Pivot is applied as described above. Pivot Column converts the selected Name column values into column headers.

Unpivot Columns

Starting with the original dataset, we will now Unpivot the dataset. First select all 4 name columns by using “Ctrl + Click”.

Then, go to Transform ribbon, click on “Unpivot Columns”. This is what the data looks like after the Unpivot Columns transformation is applied. Unpivot Columns converts the selected columns (the name columns in this case) into a column of values (Attribute) and the column values are stored in the Value column.