That day was a particularly enlighting one for me. It was my first true exposure to hard core data cleansing. Boy! That scrape the eyelids off my face! Fifth instalment of the Big Integration series about the TDWI World Conference back in 2008. The sixth and last one to be published later this week.
I guess I was naive. I thought that Web analysts were the ones who had to live in an approximative world, because of the nature of our data. Boy, can data get dirty in the BI world! I attended the full-day course today on Data Conversion, Consolidation, and Cleansing – Practical Skills given by the passionate Arkady Maydanchik. Arkady can pack the most number of pieces of information in 10 seconds I have ever seen! He actually made what many would consider a rather dry topic, something exciting. I am not kidding; data cleansing is very complicated to execute well. I’m sure it can be very absorbing.
So, yes, data can get really really bad in that world, but at least, they have means to work on that and make it better. I realized today again how little control we have on our data, since we basically have to trust the vendors on that. What is a bad “record” in Web Analytics? Is getting back to analyzing IIS logs a solution? Well, I wouldn’t go that far; I am a true believer in tagging. But I wonder if we could not get rid of proprietary data format, and work as a community towards standardized structures in how the data is collected. This would mean that logs collected from tagging could be analyzed by whatever Web Analytics product you purchased (or didn’t). I guess this is similar to the meta data situation with BI application vendors. It was clear yesterday at the night course that none of them wanted to make their meta data readable by other vendors. I don’t know anything about that field, but it seems that meta data schemes are pretty much what they base in part their competitiveness on.
I am not sure that a call for standardized log structure (in tagging I mean; I know IIS or Apache are already vendor-neutral logs) would be such a terrible thing to vendors in our field. In the process of doing so, I guess we could also think about how we could re-structure that data to be easier to integrate with other systems in the business. Anyway, food for thoughts.
Tomorrow is the last day of the conference. I will attend another full-day class, this time on technology architecture for BI.