Data Governance – Everything starts here. Well, not exactly; everything starts at determining first what needs to be measured. However, we can all agree that, as the saying goes, if we let garbage get in we’ll only churn out garbage. This why how we collect and manipulate data will have such a fundamental impact on the entire analysis building we are trying to erect. Fortunately, this is one piece of Analytics Governance where many companies have got really efficient with.
It is first very important to identify all sources of data used in your analysis framework. Technically, this is something you should have done at the KPI definition stage, when you validated metrics DNA (what data they are made of), and documented which systems generated them. Do you have one source of information that documents all data sources? You should, all the more so since several new sources are adding up every year. Someone somewhere in your organization knows about this; just validate with them that all is properly documented (when I say “documented” in this article and the next ones, I mean whatever means of formalizing knowledge you have).
With the variety of sources come differences in definitions. In short, do things mean the same in each system? Often they don’t. Typical variables are customer and sales, which can vary in as for which data point is used to identify them in a system (for example an ID in a system, and a credit card number in another). Do you have a centralized data definition for your variables? Is a “sale” the same thing in your campaign management system as in your CRM?
I once was involved in a project where it took a month for Marketing and Accounting to agree on what a sales was; yes, this was at the definition level, but each had a system in which the numbers differed quite significantly. We ended to use the accounting’s one (no surprise).
I remember a few years ago spending an entire day listening to an amazingly bright PhD from Hungary discuss best practices in data cleansing. Yes, seven hours. In Digital Analytics, or at least in traditional Web Analytics, this is not something we are really accustomed to. OK, yes, we spend endless hours tagging and tweaking the said tags to they will pass correctly the information we want, but we usually stay within the application data model. Rarely, if ever, do we have to reconcile records, for example. And, let’s be frank, the data can get pretty dirty, and we usually have no other means to try to make it clean than making sure the tag does it job properly. This is one reasons why I have always liked being able to get to the original data set (logs, meaning what is actually generated by the tags); in many occasions, we cleaned, corrected, massaged the data before the web analytics application would process them.
Needless to say that any data cleansing, data modification process must be well documented; changes at this level tend to be permanent, in the sense that no subsequent filtering or analytics application configuration will revert what was changed at the data level.
What I just discussed naturally pertains to what is known as the Extract, Transform, Load process, namely ETL, which addresses where you get the data, how your transform it, if necessary, for integrity and business rules, and where you end up loading it for analysis purposes. This is something most Digital Analytics people are really not accustomed to, and in which they will have to develop competency in the near future (at least those who handle data in their team). Aggregated traffic numbers coming from a black box whose mechanisms we don’t fully understand will be less and less attractive, and acceptable. This will become an even more important process with the current shift (finally!) from the visit-based paradigm to the visitor-based one, which will force us to do tons of record reconciliation (unless Google Universal ID proves to be miraculous!).
Whatever way you organize your Data Governance activities (Lean Six Sigma, TQM, Kaizen, etc.), formalizing processes is key here. Well, the whole concept of governance is not far from the formalization one, as we’ll see throughout this series, but I believe it to be of high importance with data.
As I said in the first installment (see Issue 16), most companies I know do fairly good work here, and pretty much control that part of Analytics Governance. However, I often see that the same discipline is very unfortunately not applied to Digital Analytics data as it is with other data in the business. I don’t know why. Maybe this is because people are more willing to accept as is the data the most popular products will give them. Lack of access is often the main hurdle I must say.
This is obviously a vast topic (remember my full day on data cleansing alone?), and one I will try to address more this year. In the next installment, I’ll discuss Application Governance.
Don’t forget to join the WAO/FACTOR LinkedIn group for further discussions.