I can see actual data values in my table I can see my customer ID I can see my longitude and latitude my region my gender my age all of that good stuff you can increase the sample rose to 5000 or decrease them to 10. Okay so like I said, this looks good I don’t see a lot of missing these look like good data values they don’t make sense I’m going to approve this table and then I’m going to go back to my search results, and let’s take a look at a different table so I’ll click on my bank customers table and we can see just similar metadata and similar metrics for this table but because we have different data we’re going to get different values back.
Let’s take a look at a few of the differences here for example my information privacy is also private but there’s a lot more private information in this table because this is cataloging bank customers personal information so we’ve got individual we’ve got gender family name email phone these are all private pieces of data because they can identify a specific person we’ve also got some candidate data delivery address city postal code county state province, in this case, i would say that that does classify as private because that’s a person’s address we can see the period covered and this is a much wider range we’ve got 1985 to 2012. and then we can see the areas covered and here we have stated.
If i click on we have state provinces we have counties and we have cities, sew these definitions are smart enough to break down the different geographic pieces by the type of geography they are it was able to identify counties it was able to identify cities states that’s nice okay I’m going to click on my column analysis and let’s take a look at our state column I’ll click on my state column notice we have 58 distinct values for the state which already bodes poorly because we only have 50 states in the United States so let me click on the state we can see here we have a semantic type of state province that is good that’s correct it’s a candidate for information privacy it’s not a primary key candidate because there are only fifth there are 58 distinct values for 483 rows it’s a string and we have 100 matching data so that looks good but notice that we do have some data quality issues here we have our states written in a few different ways so for example in our frequency distribution.
We can see can we have 53 rows with ca and we have 17 with California well we know that ca in California are the same state ca is the abbreviation for California, but the computer doesn’t because the text strings don’t exactly match we can see that again in our pattern frequency 436 of our values are in the capital letter format but we have a bunch of values that are not so we want to standardize that and put them all in the same format I’m going to mark this as flagged because we want to do some more work on this before we use an inner report and now if I want to update the values if I want to standardize that state I can go straight from SAS information catalog to other satisfy applications in the top right corner I have my actions’ menu.
I’m going to click on that and we can go to many satisfy applications I can go build models I can go explore my lineage i can go to SAS visual analytics this isn’t ready for a report though so I’m actually going to go to prepare data which is SAS data studio so let me click on that so the bank customers table is selected as my source for my plan and with the data preparation plan we can clean up this table and fix those data quality issues we found so i’m going to do a very simple little thing here which is to standardize that state column I’m going to add a data quality transform called standardize and I’m going to pick the state column I’m going to keep the locale English United States this is using those same definitions from the SAS quality knowledge base behind the scenes and this is going to put all of my states in the same format in the same casing all of that so I’m going to click on state province abbreviation and I’ll just replace my source column and I’ll run this and notice that at a glance with my sample data all of my states are in that two-letter country code what you’d want to do is safe this table and then when the table’s re-cataloged when it’s updated this new pattern will be reflected in the catalog but it’s really easy to go from SAS information catalog to other applications and fix those data quality issues.
If you have a table like customer focus group that’s already approved you can go straight to building a model or building a report it’s super, superior ITII I’m not going to save this title I’m going to go back to SAS information catalog and I’m going to show you how you actually can catalog a library so I’m going to go back to my applications’ menu then i’m going to go to discover information assets okay so those are our user tasks if you’re a user you probably won’t do the next things we’re going to talk about but your administrator will and this will help you know what to ask for so i’m going to go back to catalog home notice that when we got into our environment we had 317 assets cataloged that’s because our administrator had run data discovery agents on several cast lives and libraries before we got into the environment so let’s see how we can set those up in my toolbar on the left because I’m logged in as an administrator I have another icon called discovery agents i’ll click that and i have four discovery agents set up again.
If you’re a user you probably won’t have this option but your administrator will so we have four data discovery agents and each one of these crawls a different kazlib or sas library each discovery agent has a name the job status the server that the library they’re cataloging is on the library name the physical region the data resides in the date modified and a description so let’s click on one let me click on or alive this catalogs the contents of the oracle cast life or alive the library name is here that’s just or alive and the physical region is something your admin can enter here it’s us underscore west we also have our discovery locale so remember when we were looking at our assets we had all of those pop-ups saying hey we used the English united states discovery locale what this means is that we were using English united states definitions to analyze our data so if i had France data that was in french i would want to change my locale to France french.
If i had canada english data i’d want to change my locale to canada english we want to make sure our discovery locale matches the data that we’re using because the definitions are very specific to each region for example names are written differently in english than in spanish and they’re written differently in spain and in mexico so even if there’s a shared language between two regions we want to be really specific to let our algorithms do the best job possible so we’re going to keep the discover locale on united states english because we’re looking at english data from the united states but you want that to match your data because that’s where you’re going to get the best results so this is already created i can run this now and that will re-catalog all of the different assets in my or a live library okay but i’m going to close this and go back to my discovery agents and we’re going to make a new discovery agent i’ll click on the new discovery agent icon in the top right corner here so we can see all of our cavs lives and our sas libraries in this environment.
I see the sas libraries here because again i have the sas information governance license if you just have sas information catalog you’ll only see cad lives we’re going to actually catalog a cast lab called bank data so i’ll check the box next to bank data we could catalog multiple cas labs with one data agent i’m not going to do that here but for example if you had an oracle database with multiple libraries pointing to multiple schemas on the database and you wanted to just catalog the whole thing with one data discovery agent you could set that up we’re just going to catalog bank data here and i’m going to hit new discovery agent i’m going to leave the name as bank data that just matches the cast line and i’m going to give it a description so let’s say catalogs data about banking customers the physical region is where the data resides so let’s say u.s underscore east this is a field that we can filter on later when we’ve actually cataloged the assets and we want to keep our discovery locale united states english ian usa i’m going to hit run now and it says.
If you continue then we’re going to save these changes and we’re going to analyze and index the library bank data I’m going to hit continue and this is running so i can see this running up in the top right corner when this is finished I’m going to get a purple pop-up in the middle bottom of the screen that tells me it’s been finished okay great we get that pop up at the bottom of the screen that says the discovery agent bank data has finished running I’m going to go back to all of my agents so I’ll click on back to list and we see bank data here we can see the date modified and now if i go back to my catalog so let me click catalog in that left-hand toolbar now we have 321 assets cataloged there were four things in that new car live if I search bank data we can see that now we have some things coming back with the library of bank data bank customers investment customers and loan customers as well as home equity loans.
Now, what happens when everything in our environment changes right we’re making changes all the time and we want our metrics to be up to date instead of sitting here and re-running our discovery agents manually we can schedule these to run at some specific time so let’s take a look at how we can do that I’m going to navigate to a new application so let’s go to our applications menu in the top left corner and I’m going to go down to manage environment this again is something your administrator will do but it’s really useful because then when you get to work all of your assets will be cataloged and up to date and all your metrics will be correct.