Data Profiling views in Power Query (Excel and Power BI)

avatar
(Edited)

Data profiling views in Power query were first introduced in Power BI and has since been rolled out to the Power query editor in Excel 365.  It offers a quick way of exploring data to identify bad data. Data profiling comes in useful when you are working with large sets of data and you want to quickly visualize the composition of that data.  And even add quick fixes and filters before you load the data for further use!!

Column Quality


When you load data into Power Query, you will note at the top of the column a turquoise bar.  By hovering over this bar, you will see a summary of the quality of the data contained in that column.  You can see how much of the data is valid and how many errors there are, if any. In addition to this you will also see details of empty cells contained within the column.

ata profiling views in power query

By selecting the 3 dots at the bottom of the profile information you can quickly make changes to your data, like removing errors or duplicates. 

If the column contains errors, this turquoise line will show broken with the error % in red.

You can get further Column quality details from the View tab on the ribbons and selecting Column quality.  This will give you an expanded view of the column quality allowing you to quickly spot potential problems in the column.

ata profiling views in power query

When you are working with a large set of data, the benefit of this is that it allows you quickly find errors, empty or duplicates. These would be difficult to spot by scroll though the data.  You can make the fixes to your data before you load it into your model and encounter problems.

In Power query, the load preview by default is 1000 row.  By default, the column quality also only looks at the first 1000 row.  You can verify this by the status bar at the bottom of the Power query window.

ata profiling views in power query

To change the profiling so it analyses the entire column of data, select the profiling status in the status bar. Then select Column profiling based on the entire data set.

ata profiling views in power query

Column Distribution


Column Distribution can be found on the View tab of the ribbons.

Data profiling views in power query

It is used to add an additional layer of information to the column quality.  Column Distribution shows you the number of distinct and unique values in a column.  You are also given a visualization showing the distribution of these values.

Data profiling views in power query

This Name column contains 5 distinct values, each value is represented by a bar in the chart.  From this chart we can see 4 of the distinct values have approximately the same frequency whereas the final value appears less often.

If a column contains an error, the column distribution will not show.  You must first correct any errors before the profiling will show up.

Column Distribution use cases


Although this provides a nice user experience, unless you know what data you are expecting and the expected distribution, it is hard to see what actions can be taken from this feature.  However, if you are expecting to see 5 unique values and you only have 4, then you know you are missing some data. 

Suppose each month the area sales managers add a file to a shared folder.  Its your role to consolidate and report on this data and you are expecting 100 files each month.  As long as each file has a unique identifier when merged in Power query, you can quickly check to ensure you have the correct number of unique values.  Without the need of visiting the folder.

Another use case would be when merging tables and you want to ensure you do not end up with duplicates in your final data set.  Column distribution and profiling can really help keep track of this at a quick glance and you can even remove duplicates from within the profiling.

Column Profile


The Column Profile returns the most detailed information on data within a column.  It can be found on the View tab of the ribbons.

Data profiling views in power query

Once selected, a new pane will open giving actionable details on a selected column.

Data profiling views in power query

On the left are column statistics.  Values such as Errors, Empty Distinct and Unique we were able to get from the column distribution so there is not much exciting here.  Except you can now also see the Min and the Max values in a column.  For me, I have found this useful when I am working with a Date column and I quickly want to establish the earliest date.

Data profiling views in power query

 On the right we have an interactive chart for Value distribution.  The default is to show by value, however you can change how the values of the distribution chart are grouped.

Data profiling views in power query

Depending on the nature of the column, the grouping options will differ.  The distribution chart above for dates looks considerably different when grouped by year.

Data profiling views in power query

In addition to just viewing different grouping distribution, by hovering over a column of the chart you can add a full filter to your data set.

Data profiling views in power query

By selecting Equals, all data that does not equal the value of that column will be filtered from your data.  By selecting Does Not Equal, all data that does not equal the values of the column will be filtered from your data set.

These filters are added as an applied step in the transformations pane and can be removed by deleting the step.

Data profiling views in power query

Unfortunately, there is a limitation with the Value Distribution.  You can not select more than one column at a time from the chart which means you can are limited on what you are filtering.  However good use of the grouping can overcome some of this.

Conclusion - Data profiling views in Power Query


Data profiling views are for sure a great user experience. The ability to quickly identify columns with errors and blanks and fix these before you import the data to a model for further analysis is a great benefit and increases efficiency when working with larger data sets.

For a bird’s eye view of the data, the distribution information is useful, but its not very actionable unless you know what results you are expecting. If you do know what results you are expecting, then the distribution comes in very handy.

The idea of Data profiling is to ensure data loaded to a model is as clean as possible. It helps you identify and resolved the issues from bad data before they get into your model and become a problem.  These new data profiling views in Power Query do their bit to make sure this process is as efficient as possible.

SIGN UP FOR OUR NEWSLETTER TODAY – GET EXCEL TIPS TRICKS AND LEARN AND EARN ACTIVITIES TO YOUR INBOX

SIGN UP

Cross posted from my blog with SteemPress : https://theexcelclub.com/data-profiling-views-in-power-query-excel-and-power-bi/



0
0
0.000
7 comments
avatar

This is awesome, I wish I knew about the first point for finding errors. I had to fix that on my own but using it that way would’ve been so much nicer and faster! I will check that out with one of my versions of my data before the final.

0
0
0.000
avatar

Have you been using Power query @complexty?

0
0
0.000
avatar
(Edited)

I use it routinely yes! I mainly use it for transforming large datasets which I guess is the intended purpose but I use it to add additional columns that weren't included in the data extract. I often use conditional column.

Having this little nugget in my pocket will definitely help make things easier! Often times I don't realize I have an error value until I go to put the information into a pivot table.

0
0
0.000
avatar

Conditional columns rock, power query in general rocks. Im so glad you found the article useful :-)

0
0
0.000
avatar

This looks pretty powerful. I've been using KNIME as an analytical platform for my day job lately, and one of its most valuable features (to me) is the. "Possible Value" lists it can generate for columns. It doesn't give you a quick snapshot or tally of distribution in the same way this does.

Sadly, Excel doesn't play all that well with 40,000 rows of data (and about 80 columns) for the data sets I've been playing with in my day job. At least, not on a machine with a lowly 8GB of RAM :D

0
0
0.000
avatar

40K row of data can be handled with ease if you are using Powe query to connect to the data source. I have used over 15m rows problem-free on an 8g

You should check out power query, I bet you would love it

0
0
0.000
avatar

Congratulations @theexcelclub! You have completed the following achievement on the Hive blockchain and have been rewarded with new badge(s) :

You distributed more than 30000 upvotes. Your next target is to reach 31000 upvotes.

You can view your badges on your board And compare to others on the Ranking
If you no longer want to receive notifications, reply to this comment with the word STOP

Do not miss the last post from @hivebuzz:

The Hive community is in mourning. Farewell @lizziesworld!
Support the HiveBuzz project. Vote for our proposal!
0
0
0.000