In this set of posts I will walk through a process for consuming SMAT output files into a Power BI report.
- Part 1 – importing the data and then querying / modeling the data
- Part 2 (not yet published) – designing the Power BI Report (and also provide sample files to use)
In a previous post I talked about the SharePoint Migration Assessment Tool for SharePoint 2013 being released to web. Please read that post for context on working with SMAT.
Disclaimer: I am by no means a Power BI expert. In fact this report is the first time I’ve spent any significant time developing a report. Most of the technical credit goes to my coworkers Ken Kilty and Gene Livshin.
Note: Power BI Desktop is being updated regularly. For the purposes of this article I am using the January 2017 build (version 2.42.4611.701 64-bit).
The SMAT tool outputs a summary CSV file along with 1 detailed CSV per risk being analyzed. Reading and interpreting hundred or thousands or tens of thousands of rows in a CSV file is not optimal. Instead it will be helpful to model and report on that data in an easier to consume fashion. At a high level our process will involve the following:
- Import Data – Import SMAT output files into Power BI
- Query Data – Create multiple queries on SMAT output files to filter, aggregate, and enhance into different perspectives
- Design Reports – Design dashboard pages to consume the query outputs
Importing data into Power BI is a fairly simple process. After launching Power BI Desktop click the Get Data link from the welcome screen (or click the Get Data button in top ribbon menu).
Select the CSV data type.
Navigate to the folder where the SMAT output files are located. In this walkthrough use the SiteAssessmentReport.csv file.
A preview of the data should appear. I have run a PowerShell script to clean up the data from any customer sensitive information (ex. server names, database names, URLs, users, etc.)
Note: At a later date I may blog out that PowerShell script and link it back to this article. If this is data scrubbing script is of interest please let me know in comments or contact me to prioritize this.
If everything was successful a list of fields should appear on the right hand of the screen. These fields will correspond to the comma separated fields in the CSV file.
As mentioned earlier the raw CSV data by itself is not easily readable to gain insights about the farm. The data needs to be filtered and queried differently to support the report that will be built. I can not stress enough that having a set of questions you are trying to answer is extremely important. It is not sufficient to start designing reports without some initial planning. Think of this step as modeling the data in different ways to get different perspectives on the data. Initially my personal goals with this process were to identify the following items:
- Content database statistics
- Number of sites per database?
- Size (storage) of databases?
- Web application statistics
- Number of site collections and subsites?
- Size (storage) of site collections?
- Web application risks
- Breakdown view of risks per web application?
- Slicer by web application and risk type
- Site collection risks
- Number of risks per site collection?
- Slicer by risk type
Having these goals / questions in mind I diagramed how to split the data into different queries that could then answer those questions.
Power BI has a nice set of features that allows taking a set of data, making some changes, and then creating a reference (shallow copy) or duplicate (deep copy) that altered set of data. As such I loaded the CSV data and then referenced that data into 3 separate queries, 1 of which was further split into multiple queries for risk data by web application and by site collection. The following are example steps to accomplish this.
After the data has been loaded from previous step click on Edit Queries from the ribbon menu.
Notice that the first query for the loaded CSV file includes a few modifications to promote the header row and change datatype of columns that should be numbers. Leave those steps in place.
Create three references from the source data load query (SiteAssessmentSummary) and rename the referenced queries appropriately. In this example I named them RiskFilteredSource, DatabaseMetrics, and WebApplicationMetrics (respectively the 3 boxes in the 2nd row on my hierarchy diagram above).
The next step is to start adding modifications to the queries to shape the data into the different results which will be reported on. Start with the RiskFilteredSource. Only the SiteURL and individual risk columns (Alerts, Apps, etc.) are necessary. Rather than removing the columns that aren’t needed it is recommended to select the columns that are needed (see reference for more information on why). Click Choose Columns from the ribbon menu. On the following dialog box select the columns for SiteUrl and all of the individual risk columns (Alerts, Apps, etc.) ending with WorkflowRunning2013.
Note: the set of individual risks in the CSV will be dependent on which ones were enabled when the SMAT tool was run. This examples uses the out of the box settings as of SMAT v1.1 release.
The query result should look like the following with only the columns selected still showing. That is all for the RiskFilteredSource query.
Next will be modifications to the DatabaseMetrics query. Similar to the previous query start by selecting only the columns relating to databases (e.g. ContentDbName, ContentDbServerName, and ContentDBSizeInMB) by using the Select Columns button from the ribbon menu.
There will likely be duplicate rows in the result because each row in the original CSV corresponds to a single site collection and each content database can contain 1 or more site collections. In order to clean up the duplicate rows select all three columns and then (with columns still selected) click Remove Rows –> Remove Duplicates from the ribbon menu.
The DatabaseMetrics query result will contain only a single record per content databases with the associated server name and database size.
The last of the original four queries, WebApplicationMetrics, will take extra steps to shape the data as intended. The SiteUrl column is not in the format needed since it contains full site collection URLs. For the purposes of this example assume that the authority (ex. subdomain.domain.com) portion of URL represents an individual web application. Continuing with that assumption the SiteUrl column can be manipulated to show data per web application. Click Choose Columns from the ribbon menu and select only the columns for SiteUrl, SiteCollectionSizeInMB, SiteCollectionCount, and SubWebCount.
Note: if the farm contains host named site collections (HNSCs) then the assumption regarding “URL authority = web application” may be invalid.
Next remove the scheme (ex. http:// or https:// most likely) from the SiteUrl column by clicking the Replace Values button from the ribbon menu. For Replace With specify an empty string (“”). In my example all sites are using https:// but a farm may have a mix of both http:// and https:// so an additional Replace Values step may be needed. I’m not aware of a way to combine both into one operation but if a reader knows please do share in the comments.
In order to separate out the authority portion of URL from the rest of the site collection (e.g. the path) click the Split Column –> By Delimiter button from the ribbon menu. Choose a custom delimiter of forward slash (/) and choose to only split at the left most delimiter.
To make it easier to read these newly split columns click “Rename column” from the ribbon menu and rename the first column to SiteUrlDomain and the second column to SiteUrlPath. Alternately the Rename column functionality can be found by right-clicking the column and choosing Rename from the list of actions.
The final modification for the WebApplicationMetrics query will be to group the data by multiple aggregations. Select the SiteUrlDomain column and click Group By in the ribbon menu. Make sure that SiteUrlDomain is selected for the top Group By box. Proceed to add additional aggregations for three columns:
|New Column Name||Operation||Column|
The results for the WebApplicationMetrics query should look similar to the following.
Referring back to the original planning hierarchy there are two final queries to be created: “Risks by Web Application” and “Risks by Site Collection”. These queries are children from the “Filter for Risk Related Columns”. This translates to creating 2 references from RiskFilteredSource query. Rename these referenced queries to RiskByWebApp and RiskBySiteCollection.
Starting with RiskByWebApp perform the similar steps used with WebApplicationMetrics.
- Replace the https:// or http:// scheme
- Split SiteUrl column on leftmost “/” delimiter
Since this query is focused on web application based information the “path” portion of URL (ex. “/sites/HRMarketing”) is not needed. Click Choose Columns and select all columns except “SiteUrl.2”.
After the columns have been chosen rename the “SiteUrl.1” column to “SiteUrlDomain”. The data should look similar to the following at this point.
Now is where some of the magic happens. Credit for this specifically goes to Ken Kilty and Gene Livshin. I want to aggregate the individual risk counts per web application while also maintaining the individual risk name. Welcome to Unpivot Columns (<insert fireworks and fanfare>). Think of the unpivot columns operation as transpose for a key-value pairing. Select all of the columns for the individual risks (Alerts, Apps, etc.) but be sure to not select SiteUrlDomain. Then click Unpivot Columns.
The resulting query will have SiteUrlDomain, Attribute, and Value for columns. Rename Attribute and Value to RiskName and RiskCount respectively.
Notice that RiskName values will repeat for if there are multiple site collections in that web application. Click Group By from the ribbon menu. Specify SiteUrlDomain and RiskName for the top grouping dropdowns (add a new grouping for RiskName since the default is one grouping.) For the New Column Name, Operation, and Column specify RiskCountAggregated, Sum, and RiskCount respectively.
The RiskByWebApp query result is finally in a usable format now. See the following for an example of the output.
The RiskBySiteCollection query will mirror the RiskByWebApp query in every step except there will not be a split column or the clean up of columns immediately afterwards. The order of steps will be as follows:
- Replaced value – remove scheme from SiteUrl
- Unpivoted Columns – select all columns except for SiteUrl for the unpivot
- Rename Columns – Attribute –> RiskName, Value –> RiskCount
- Grouped Rows – group by SiteUrl and RiskName, sum RiskCount into RiskCountAggregate.
The RiskBySiteCollection query should end up looking like the following:
Quite a bit of work goes into transforming and modeling the initial CSV data into more usable formats. Hopefully this post has been helpful in illustrating steps that can be taken to plan out the end goal as well as model the queries of the SMAT output data. In Part 2 of this post I will walk through creating a number of Power BI report pages that leverage what has been created thus far. I’ll also attempt to share out sample source CSV data and a Power BI PBIX file but no guarantees. Please share any feedback or questions in the comments.