Why a content overview is important for privacy information?

If you have many assets in your environment okay I’m going to clear my search and then I’ll go back to my catalog home and now we’re going to take a look at actually what we get back when our assets are cataloged so I’m going to flip back to my standard search and I’m going to look for customer tables again so I’m going to type in customer and we’re going to start by looking at the customer focus group table so notice that when i search I get a lot of information back even just in the nice search window i get the table name this star indicates whether the table is one of my favorites or not and i can add a favorite right here by clicking on the star next to the table we can see the status which we’ll set in just a minute but for example, the bank customers table is in review we can see the library the date modified the size columns rows whose modified by the dates created.

Notice that also we have many many tables from many Cars lives and SAS libraries I have the SAS information governance license which allows me to catalog both as libraries and cad lives if you just have the SAS information catalog license you can only catalog Cassius but in either case, we can catalog any can live or SAS library you have access to so for example notice that one of my libraries here is or alive that is a cad slide connecting to oracle even though it’s an oracle cad lab we can catalog that and we’re going to get metrics back for those tables just like we would for a SAS library or a path-based cad slide so as long as you have access to a library or a cad live you can catalog it but like I said we’re going to focus on the customer focus group table so I’ll click on that and we get a lot of information even just on the overview window.

Let’s take a look at the top here under our content overview the very first thing we see is the information privacy so the information privacy is here is marked as private now how’s that determined I’ll click on the i next to that and we can see it says analyzed using discovery locale united states English and then it has two columns marked as private age and gender and two columns marked as candidate state province and geographical point so what’s going on here is behind the scenes there’s something called the SAS quality knowledge base which has a bunch of different really smart algorithms that are doing a lot of work to classify the data in our libraries.

The SAS quality knowledge base is a collection of files that are organized into these algorithms that we call definitions that’s what you’ll see in the documentation those definitions can do cool things like identifying the semantic type of data like it knows what age looks like it knows what a name looks like they can break a larger data value down into smaller pieces like pull the city out of an address, so these definitions behind the scenes are looking at all of our data and they’re classifying the type of data so for example here it found age and gender and these definitions are smart enough to know that age and gender are things that are personally identifiable information so that’s something that I can identify me as an individual it also knows that state province and geographical point might be private information if it’s my address then that probably is private information if it’s something like the location of an earthquake that’s not private information.

Those are surface to us as candidates, so these definitions run behind the scenes and tell us what sorts of data they find in our environment the United States English locale means that we are analyzing this using definition that is made for English data from the United States so when we see how to catalog our libraries we’re going to want to pick the locale that corresponds to the region and language in our data to get the most accurate results, okay so we can see that we do have some private data here and that’s perfect information because we can use that to go and mask that data a little later we can also see the period covered, so this data ranges from January 1st, 2020 to September 30th, 2020. That’s nice because you can just see the range of dates in your data you can see if it’s recent if it’s not recent that’s good stuff we can also see the area covered.

Here we just have regions we have south mid-Atlantic great lakes pacific greater Texas if I click on the i next to that we can see the most common geographic values, so the definitions in the quality knowledge base are also surfacing this for us they were able to say oh I think the south is a region I think mid-Atlantic’s a region those look like geographic values and those were our most common geographic values in this data set they’re under state province because that was the closest match for the type of geographic value that we had here but we’ll see in a later table that if you have different levels of geographic data then they’ll be classified as such in this nice information window so if you have cities then they’ll show up under city if you have states they’ll show up under state and so on so we’ll see that a little later than we have our business description I’m going to add a description.

Let’s see this is tracking customers in our focus groups at the bottom of the screen we have our history and we can see our history for our road count and our completeness so row count we can see this was analyzed at 6 37 pm and it had 88 000 rows over time as rows are added or subtracted from the table this history graph will change to reflect that so you’ll get this pleasant line graph that will show you the row change over time we also have completeness, so completeness is measuring how many nulls blanks or missing are in the data so here we have about 100 completeness we have 99.6 completeness which is good but what if I add a bunch of data and it has a bunch of missing values that completeness might fall, and this graph is nice because I can see how that’s changed over time I can see if my completeness has been increased or if it’s been lowered and I can find data quality issues easily that way okay, so this table is looking pretty good so far.

I’m going to set a status for this table status is something you can filter on in your search results and I’m going to say it is in review because we’re still working with it but it’s looking pretty good I don’t need to set a status of warning or anything but I’ll make that in review and also favorite this table so that I have access to it later than I’ll click my column analysis tab and let’s take a closer look at all of our columns here so on our column analysis tab the first thing we can see is our descriptive measures here we can see all of our columns and all of the different statistics and metrics we get back for this table so we have our distinct values mean median minimum-maximum standard deviation missing blank outliers mismatched skewness Kurtis and then a bunch of different percentiles so a bunch of metrics here and we have that for every single column in the table it’s nice to see them all side by side.

Leave a Comment