How can you compare the metrics based on columns in SAS?

You can compare metrics across the columns but let’s take a look at one column, in particular, we’ll take a look at a few columns to let’s start with a numeric column called t since last purchase, so this is the time since last purchase in months for people in our focus group so click on that column and we get this nice view with a bunch of really helpful charts so in the top left corner we can see our distinct values our completeness and our uniqueness we only have 10 distinct values for all 88 000 rows so while we have 100 completeness.

No nulls we have zero percent uniqueness because we only have 10 values for 88 000 rows so at the top right we can see that this is not a primary key candidate which makes sense right primary keys have to uniquely identify the road that they’re in and this is not going to do that so we don’t have a primary key candidate right now we don’t see the semantic type or information privacy here because this is a numeric column in the bottom left we have a pleasant chart that charts the Kurtis versus the skewness so we can see the skewness the Kurtis and the standard deviation and if i click on my little dot on my chart i can see my skewness in my Kurtis in the top right we have a frequency distribution and this is nice because at a glance you can see that a lot of people here have between four and seven months.

Since their last purchase so right away i can say hey you know maybe that’s when we should be sending people mailers that’s when we should be sending people coupons that is a really common amount of time to wait between purchases so for example if i hover over my five here so five months twenty thousand of my customers have had five months since their last purchase for the frequency distribution we can see the top frequencies here if i click on the bottom it’s a little silly here because we only have three customers who have waited 12 months so that’s kind of just one bar but you can’t see the top or bottom frequencies you can also choose to hide outliers so if i click on that, for example, it will hide that 12 months because only three customers have 12 months since their last purchase and then finally this is maybe my favorite is the full distribution in the bottom right.

This is a box and whisker plot and we can see the minimum quartiles the mean the median standard deviation maximum and again it kind of reinforces the fact that most of our customers fall into that four to the seven-month range since their last purchase so we can see that’s our middle 50 percent easily and we can see that you know the maximum of 12 is a pretty far outlier, okay so that’s what a numeric variable looks like let’s take a look at a few more variables including some character variables so let’s take a look at region i have this left-hand toolbar here with all my columns so i can click on the region and i can see some similar information but it’s going to be slightly different because this is a character column so just like before i can see my distinct values my completeness my uniqueness.

Again we only have nine distinct values for 88 000 rows, so this is not going to be unique it’s also not going to be a primary key candidate but here we do have some more information like semantic type and information privacy, so the semantic type here is state province that’s not quite true but it’s close enough the algorithms did a pretty good job of guessing what that was to help us out and then we have a candidate for information privacy so like i said if this is part of my address that could be private information but if it’s part of you know the businesses location or an earthquake’s location or something like that it might not be private, so the definition said i don’t know I think might be private but I’m not sure then we have our mismatch area here, so this is taking a look at the actual type of the value and seeing how many values in each column are mismatched to the type of the column.

Here we have 100 matching but if we had a bunch of numbers in here if we had different data in here that didn’t fit the string type then we would see that we had some percent mismatched in the bottom left corner we have our frequency distribution and just like before we can see the top frequency and the bottom frequency and this is a nice little chart it shows you the frequency for the different bars here if you hover over them and also on the chart itself so south is the most common with 24 000 rows, okay we also have the most common and least common value listed in the bottom left and then this is my favorite part of the character column we have a pattern frequency which is really nice so here what’s going on is it’s saying regardless of the actual letter any capital letter is going to be represented by a capital any lowercase letter is going to be represented by a lowercase a and I’ll show you the patterns in your data any number will be represented by 9.

We don’t have any numbers here but you can take a look at those as well okay so 2764 rows have the pattern of capital letter and then four lowercase letters and so on this isn’t that helpful here because our values don’t have to fit the same pattern but for things like a social security number for things like a phone number for things like a zip code or a state with a two-letter code we want those all to have the same pattern and if they don’t, it’s really easy to see with the pattern frequency chart okay let’s take a look at a few more values here I’m going to click on long lat long lat is really cool because this is a geographical point data type, so this has a longitude comma a latitude in it and if you have a longitude and latitude in your data set the frequency distribution for that longitude and latitude will show up as a map which I think is just so neat so for example it looks like we have a focus group maybe in Atlanta and we had 24 000 folks attend that focus group okay and I can see that on my map which is really nice um and the bubbles correspond to the size of the focus group so the bigger the bubble the higher the frequency.

We can also see the pattern frequency again this is a little easier to see the value because we have two different ways of writing longitudes and latitudes so we have the double-digit longitude and latitude and then we have the double-digit longitude and the triple-digit latitude we have these two different ways of writing longitudes and latitudes and they’re both valid but for example, if you had a longitude with a different pattern that was not a valid longitude and latitude it would be really easy to see here right if you had letters in there it’d be easy to see here so we can see with our pattern frequency that these all conform to actual longitudes latitudes and not false values there was no truncation all that good stuff okay.

Now that we’ve seen some metrics in our columns I’m going to go back to my descriptive view let me click on that and let’s take a look at the other measures that we have here so let me click on metadata measures this is going to give you actual column metadata, so things like the type of the column the actual type the format the length you’re also going to get information like the primary key candidate in information privacy so we saw that in the actual column itself but here you can see for the entire table which of your information is private which of your primary key candidates exist.

I can also sort on that so, for example, i can sort information privacy and I can see right away i have only four things that are either private or candidate which is nice I’m going to click on data quality measures, and here again we can see some of the different information we saw in the column but with all of the columns in the table so i can compare between columns my completeness my uniqueness my most common least common value pattern count semantic type in information privacy so for example if my pattern can’t look too high in this table view i could click on the column and go take a look at that, okay, so this table is looking pretty good but i want to do one more thing before i mark it as approved.

Leave a Comment