Just recently, I started myself on a regiment of SSIS self-education. I have a strong T-SQL background, so this seems a logical tool for me to learn. So far, I'm having lots of fun and can't quite understand why, at my work place, we are using Informatica PowerCenter to do what we could easily do with SSIS.
For the most part, our ETL is really just EL (meaning Extract and Load, not the ancient Hebrew word for God). In SQL Server 2008, creating a simple extract and load package is particularly easy. In fact, it is almost identical to importing data into Excel or Access. Microsoft has provided a very easy to use Export/Import wizard that generates a complete SSIS package. This can be run immediately or saved of and run later. As SSIS ships free with SQL Server, and as creating an import package this way takes only a couple of minutes, SSIS becomes cheaper to use for this kind of simple data movement even if another ETL tool has already been purchased!
Tuesday, December 1, 2009
Wednesday, July 15, 2009
Iterative Design using Kalido
One of Kalido's great strengths is its ability to rapidly prototype new marts*. Using file definitions to directly import a subset of data, you can very quickly spin off an Excel shop**. This means you don't do any ETL development or BI work until after you have done enough iterations of the design to know exactly what data you need to answer the questions the mart is intended answer, where source(s) for that that data is, and how to slice and dice it.
In an ideal world, each iteration of your design also allows you to take the Excel shop you've produced back to your customer and validate it with them. In practice, this creates a natural dialogue on the finer details of the business questions they are trying to answer, allowing you to document and address those requirement nuances before you dive into any development of code-intensive ETL. It also means your customer can confirm whether the numbers from your shop are actually correct.
Another benefit of this approach is that data needing to be brought under the scrutiny of your governance group is identified early on. Far better to have this in the light prior to building your ETL. That way you avoid costly re-work.
*In Kalido speak, a mart is basically a data set.
Best of all, as these iterations all happen within the scope of your design phase, the approach may be used (at least for small projects) in a Waterfall process. The value of chunking you Data Warehouse development into small projects focuse on specific business questions is a discussion I think I'll hold for a later posting.
*In Kalido speak, a mart is basically a data set.
*An Excel shop is a result set presented in a fomated, Excel workbook, typically including a cross-tab chart.
Wednesday, June 17, 2009
Bill ROLAP Vs. Ralph MOLAP: Bill Waterfall Vs. Ralph Agile
Perhaps this is old news, or maybe I’m simply misunderstanding something basic about the Inmon vs. Kimball debate. However, as I was reading a very brief description of the two this morning, it seemed to me that a large part of the difference in the two philosophies is the type of OLAP (On Line Analytical Processing) model preferred. Inmon, it seems, tends toward a ROLAP (Relational OLAP) model, while Kimball is all for a MOLAP (Multi-dimensional OLAP) model.
Despite the similarity in names, ROLAP and MOLAP are fundamentally different approaches to providing answers for business questions. In ROLAP, the data is maintained at the most atomic level in third normal form and is queried to provide the reports the business needs. In MOLAP, the data is preaggregated to provided summarized data to directly feed reports. The great strength of ROLAP is its flexibility. The great strength or MOLAP is its speed of delivery.
Now, it would not be true to say that Inmon’s model is a ROLAP model. After all, marts could be maintained in a MOLAP fashion while the warehouse itself is ROLAP (creating a distributed Hybrid-OLAP, or HOLAP, environment). But, on the face of it, (at least based on the brief detail provided in the descriptions I read this morning) Kimball’s approach is strictly MOLAP.
If this is the case, the big advantage of Inmon’s approach is its ability to easily answer any business question without significant development effort. On the other hand, Kimball’s model has the advantage of being intentionally “chunked.” That is to say, it appeals to an agile development approach because it consists of multiple small projects. Because Inmon’s model relies on one huge, central warehouse, it would appeal to a waterfall methodology.
What I am wondering is the extent to which the flavor of the two DW philosophies makes then more or less desirable to those two basic approaches to development. Do agile developers prefer Kimball because it looks more agile? Antithetically, do developers and corporate IT cultures that prefer a waterfall methodology (if, indeed, there are developers who prefer waterfall) tend toward the Inmon camp? Is the core of the issue, in fact, more to do with development philosophy than warehouse design philosophy?
Despite the similarity in names, ROLAP and MOLAP are fundamentally different approaches to providing answers for business questions. In ROLAP, the data is maintained at the most atomic level in third normal form and is queried to provide the reports the business needs. In MOLAP, the data is preaggregated to provided summarized data to directly feed reports. The great strength of ROLAP is its flexibility. The great strength or MOLAP is its speed of delivery.
Now, it would not be true to say that Inmon’s model is a ROLAP model. After all, marts could be maintained in a MOLAP fashion while the warehouse itself is ROLAP (creating a distributed Hybrid-OLAP, or HOLAP, environment). But, on the face of it, (at least based on the brief detail provided in the descriptions I read this morning) Kimball’s approach is strictly MOLAP.If this is the case, the big advantage of Inmon’s approach is its ability to easily answer any business question without significant development effort. On the other hand, Kimball’s model has the advantage of being intentionally “chunked.” That is to say, it appeals to an agile development approach because it consists of multiple small projects. Because Inmon’s model relies on one huge, central warehouse, it would appeal to a waterfall methodology.
What I am wondering is the extent to which the flavor of the two DW philosophies makes then more or less desirable to those two basic approaches to development. Do agile developers prefer Kimball because it looks more agile? Antithetically, do developers and corporate IT cultures that prefer a waterfall methodology (if, indeed, there are developers who prefer waterfall) tend toward the Inmon camp? Is the core of the issue, in fact, more to do with development philosophy than warehouse design philosophy?
Tuesday, June 16, 2009
How Complete is Your Data Warehouse?
Must-have and nice-to-have lists of business requirements for the DW provide a comparison point by which to measure the relative completion or fullness of the DW. Rather than calculating the "percent done" based on the how much of source system is represented in the warehouse, base it on the percentage of the must-have and nice-to-have items present there. After all, it would be possible to have 90% of all source system entities represented in a DW but have only a tiny percent of the data that people actually care about.
Wednesday, June 10, 2009
Agile Data Warehouse Development
Recently, I went to the TDWI chapter meeting in Denver where the topic being presented was "Agile Data Warehousing." This morning I read a TDWI publication titled Ten Mistakes to Avoid In Agile Data Warehouse Projects. The two seem to be in sharp disagreement.
Let me interject at this point that this does not betray a lack of clarity on the part of TDWI. Rather, it reflects the diversity of opinions within the pool of DW/BI experts.
As far as the 10 Mistakes publication goes, the author does not really seem to see much benefit in using an agile methodology. Her reasoning is that the bulk of the work in a DW project is actually data governance. Here is what the author says:
Working with Kalido, taking an agile development approach make great sense. Kalido makes rapid development with user interaction and testing along the way a comparatively simple task. Furthermore, Kalido's development model has the business heavily involved in designing the model and then prioritizing which parts get developed first.
At Compassion, corporate scorecards with executive sponsors are likely to be the first major focus for our DW/BI team. I am excited about this focus as a key item will most likely be data quality. Combining that DQ focus with the other scorecard's foci will mean that our DQ emphasis will be on the most significant data. There are two great advantages to this.
First, as Kalido's Lovan Chetty says, if you try to perfect the quality of all your data you inevitably end up spending the most time on the least important data. That's because no one has really been looking at it and that has allowed significant entropy. Spending time focused on poor quality and unimportant data is a poor use of resources. Focusing on key data first makes much more sense from a DQ stand point.
Secondly, the alternative to building a warehouse a piece at a time is building everything in one massive project. This delays the ROI by a massive amount. As most DW projects fail because they never gain acceptance from the business, that approach is akin to planning to fail.
So, from my perspective, an agile approach to DW development is the only one that makes sense. A waterfall approach almost seals the fate of the project at the outset.
Now, our mantra has been "Let's go build the data warehouse," and our IT culture has been heavily waterfall. Neither of those things is good news for our chances of succeeding in our jobs. Furthermore, as a DW reflects the business, it will never be "finished" (unless the business is static), so how can we give a clear definition of success with a goal like that? As long as a business is growing and developing, the DW will continue to change, which is another good reason to have a tool like Kalido rather than a traditional ETL centric DW.
Let me interject at this point that this does not betray a lack of clarity on the part of TDWI. Rather, it reflects the diversity of opinions within the pool of DW/BI experts.
As far as the 10 Mistakes publication goes, the author does not really seem to see much benefit in using an agile methodology. Her reasoning is that the bulk of the work in a DW project is actually data governance. Here is what the author says:
While this is true, it does not necessarily preclude using an agile approach. In fact, a focus on data quality and governance might be considered to be a significant argument for using agile practices.In contrast, most of the work done on a DW project is focused on the data, not on writing software. The source data must be profiled to determine its quality. The data has to be modeled from an enterprise perspective, and much of the business metadata is collected manually. Data must be standardized, and business people from different departments must agree on the data names, definitions, and business rules. The extracted source data is often dirty, and cleansing algorithms must be determined.
Working with Kalido, taking an agile development approach make great sense. Kalido makes rapid development with user interaction and testing along the way a comparatively simple task. Furthermore, Kalido's development model has the business heavily involved in designing the model and then prioritizing which parts get developed first.
At Compassion, corporate scorecards with executive sponsors are likely to be the first major focus for our DW/BI team. I am excited about this focus as a key item will most likely be data quality. Combining that DQ focus with the other scorecard's foci will mean that our DQ emphasis will be on the most significant data. There are two great advantages to this.
First, as Kalido's Lovan Chetty says, if you try to perfect the quality of all your data you inevitably end up spending the most time on the least important data. That's because no one has really been looking at it and that has allowed significant entropy. Spending time focused on poor quality and unimportant data is a poor use of resources. Focusing on key data first makes much more sense from a DQ stand point.
Secondly, the alternative to building a warehouse a piece at a time is building everything in one massive project. This delays the ROI by a massive amount. As most DW projects fail because they never gain acceptance from the business, that approach is akin to planning to fail.
So, from my perspective, an agile approach to DW development is the only one that makes sense. A waterfall approach almost seals the fate of the project at the outset.
Now, our mantra has been "Let's go build the data warehouse," and our IT culture has been heavily waterfall. Neither of those things is good news for our chances of succeeding in our jobs. Furthermore, as a DW reflects the business, it will never be "finished" (unless the business is static), so how can we give a clear definition of success with a goal like that? As long as a business is growing and developing, the DW will continue to change, which is another good reason to have a tool like Kalido rather than a traditional ETL centric DW.
Thursday, June 4, 2009
Clarabridge Webcast
I just got off a Clarabridge webcast. Clarabridge specializes in text analytics. As with many such best-in-class text analytics tools, the focus is on the voice of the customer (VOC) both in internal and external text sources.
At Compassion, we have a new Facebook like website for our customers. However, we have no text analytics tools, so while we are providing customers with a place to congregate and share their experiences, we have no realistic plan to leverage that information.
As many text analytics tools are provided as "on demand" software services, it is not only large corporations that can afford digging into the wealth of information hidden in unstructured text sources. At least, that's the theory. However, I cannot help but think there would be a need for a lot of interaction between your business subject matter experts (SMEs) and the tool providers professional services group while you work through the process of identifying and defining the entities within the text data. Entities, in this context, are typically the nouns from your business language, e.g., the names of specific products and services. Now, text mining can create a great list to start from, but I doubt it will be complete or semantically related. Those tasks will require heavy interaction, and probably heavy cost.
All that said, the cost is mostly up front, so ongoing expenses might be more manageable. Plus, any time you add a new product/service, these new nouns need to be incorporated into the text analytics vocabulary and place correctly in your taxonomy. And that will require more interaction with the tool provider. But then, being able to report on the emotional reactions your customers record in their blogs and on your call centers text notes may result in significant ROI.
Interestingly, one of the things noted in the webinar was the necessity of mandating the application of resulting actionable data to business decisions. The unspoken reality here is that simply having good VOC info does not mean adoption of the info will be a natural progression. As with all BI, you have to have executive backing -- active backing, not just financial backing.
So, text analytics: Is it worth the expense? What do you think?
Monday, June 1, 2009
Oh yeah. Duh!
Friday was frustrating, and I was irritated about work all weekend.
We had put into production a new Kalido mart that sourced an SSRS report on Internet payments to various funds. Our testing had been thorough, and we were pretty sure we had nailed it. So when we moved the new report to production and found that it was not giving correct results, we were not really thrilled. Neither were QA or the release team. Consequently, Joel and I spent much of Friday troubleshooting the issue, but all to no avail.
Come Monday morning, I began working on the issue early (5:30 AM). At 8:00 AM, I had gone into the office and was still working on identifying the issue. At 9:00 AM a few key players in our team (the Product Manager, the acting Tech Lead, myself -- the Data Architect -- and Joel, our lead Kalido developer) had a meeting to try an home in on the root cause of the problem. At 10:00 AM, the Web data SME (Subject Matter Expert) joined me, and we dove into the guts of the source data.
But the data was fine. The Kalido TDS (Transactional Data Set, i.e., fact table) load logic was fine. The CBE (Class of Business Entity, i.e., dimension table) loads' SQL queries were fine. And the Kalido query definitions were fine, having all the right filters, aggregation, and measures.
So at 11:45 AM, I re-ran the query definition to build the mart that was giving us so much grief. Over the weekend, Joel had re-loaded a CBE that had cause a large number of the rows from our TDS to go into suspense. Now, after I created the mart, I ran the dependant report again.
It worked!
All along, it had been a sequencing issue. The one CBE load had been inadvertently skipped prior to our loading the TDS. Consequently, the TDS was only partially loaded, so the mart was created sans some fairly significant reference data.
This was very much an, "Oh yeah. Duh!" experience for us. However, it does highlight the extent to which even a highly automated product like Kalido still requires a considerable effort on the human side of the equation.
We had put into production a new Kalido mart that sourced an SSRS report on Internet payments to various funds. Our testing had been thorough, and we were pretty sure we had nailed it. So when we moved the new report to production and found that it was not giving correct results, we were not really thrilled. Neither were QA or the release team. Consequently, Joel and I spent much of Friday troubleshooting the issue, but all to no avail.
Come Monday morning, I began working on the issue early (5:30 AM). At 8:00 AM, I had gone into the office and was still working on identifying the issue. At 9:00 AM a few key players in our team (the Product Manager, the acting Tech Lead, myself -- the Data Architect -- and Joel, our lead Kalido developer) had a meeting to try an home in on the root cause of the problem. At 10:00 AM, the Web data SME (Subject Matter Expert) joined me, and we dove into the guts of the source data.
But the data was fine. The Kalido TDS (Transactional Data Set, i.e., fact table) load logic was fine. The CBE (Class of Business Entity, i.e., dimension table) loads' SQL queries were fine. And the Kalido query definitions were fine, having all the right filters, aggregation, and measures.
So at 11:45 AM, I re-ran the query definition to build the mart that was giving us so much grief. Over the weekend, Joel had re-loaded a CBE that had cause a large number of the rows from our TDS to go into suspense. Now, after I created the mart, I ran the dependant report again.
It worked!
All along, it had been a sequencing issue. The one CBE load had been inadvertently skipped prior to our loading the TDS. Consequently, the TDS was only partially loaded, so the mart was created sans some fairly significant reference data.
This was very much an, "Oh yeah. Duh!" experience for us. However, it does highlight the extent to which even a highly automated product like Kalido still requires a considerable effort on the human side of the equation.
Subscribe to:
Posts (Atom)