Already after just a few years of excitement and enthusiasm for data lakes, it is quite clear that there are a few areas where many organizations are struggling to make progress. One of the recent exasperations I noted was the tendency of vendors in this space (whatever this ‘space’ is) to reinvent what they know and loved, and what clients hated and wanted to ignore, related to building of a classic data warehouse. See my blog: When is a Data Lake an Enterprise Data Lake?
Bypassing the hype and the obvious relationship to the proverbial silver bullet, here are three areas where it seems like potentially new next-practices are emerging. I am not sure these are baked, or fully thought through, but we would love to get your feedback. In no particular order of priority….
Discovery versus Delivery
We have noted a pattern that seems reasonable enough to capture the challenges with data lakes and their attendant design and technology, while at the same time provide a means to compartmentalize them so that they can be overcome. It idea goes like this:
Design one set of capability around broad-based, flexible discovery methods (mode 2, using Bimodal terminology)
Design one set of capability around efficient and highly optimized delivery (mode 1, using Bimodal terminology)
The former equates, but should not be perfectly correlated, to a data lake or Hadoop; the second equates but should not be perfectly correlated with a relational data warehouse. The actual technology used by either mode is a different topic.
The former is where flexibility and limited information governance and restrictions need to be applied. Any such restrictions would undermine the very purpose of the capability. Thus information government policy should be light; not implemented in the technology in depth and perhaps more effectively as blankets or layers of process as needed, based on context.
Once an analysis or insight is ‘discovered’ and a business client wants to standardize the delivery of that analytic or insight regularly, it (and its data) are ‘moved over’ or baked or automated in the delivery system. This will likely require different technology; maybe sometimes a different organization too. Here the more traditional use of business rules, workflow, and exception management can be explored to help assure compliance and enforcement of information governance policy.
Too many firms have tried, in the last year or two, to implement MDM or a more rigid data quality routine on a data lake. It has not gone that well. Sometimes partitioning the data in the lake works; more often splitting the lake into two distinct modes of operation seems to work well, so far.
What kind of conceptual models are you setting up? How are you getting along?
Information Stewardship: operational versus analytical
This is a relatively new impact.
Only a just about five years ago we collectively figured out the real needed for the role of an information steward. It is best served and instantiated basically as business, as in line-function, role. Most application ‘power users’ are great candidates. They are business centric chief problem solvers. They are not not all the users of your applications; nor are they IT roles. They are not easy to set up or sell; and many firms are struggling by first setting these roles up in IT as an expedient first step.
But things are changing again. Many clients now report that their ‘BI centers of excellence’ are being supplanted by a ‘data science lab’ with wider powers, more tools, yet with the same governance headache only worse. So the question is, what role related to information governance should persist or be supported by the data science lab
In a nutshell such roles (data science) are not line-function; your data scientist does not as a rule talk with the factory or customers day to day. They tend to provide IP and skills to the business users that do. So clearly:
Data science should follow and at most enforce policy related to information governance Data science should not set policy Thus data science may need to operate as a regular users of data in applications, but as data in an analytical system. They minimally operate as analytic users. They follow the rules set by others.
At most these data science roles may need to understand policy and, when violations to targets are triggered, enforce those policies. So it is quite possible that data science labs will include the role of steward. But is not yet a totally done deal. It just seems to be a good thing to try out.
How are you getting along with this challenge? Do tell us.
(Big) Data catalogs – enterprise-wide or organic?
This is another classic from yesteryear, only with a big data spin.
We all tried to build enterprise-wide data catalogs and data models. Remember the fun and games? Not three years ago I even sat in a vendors’ customer case study presentation where I heard the following from consecutive responses to architects are a very large bank:
“It took us about 6 years to design and develop our ideal, future-state data model for our business,” and
“It is too early to tell if this has generated any business value for our company yet. We have not yet (figured out how to) used it.”
But it seems that big data catalogs are all the rage again. It’s as if a new bunch of technology vendors have just rediscovered semantic discovery and classification tools. Yes, they are faster now. Yes, maybe they even learn a little bit. But the end-state, some logged maybe self-described catalog of data assets, is the in-thing. I suspect that the drive for compliance, especially in markets driven by regulation (think the EU’s GDPR), is a keen trigger: how can you comply with regulations of you don’t know where your assets are?
But what is the value of a data catalog beyond compliance? Is there business value knowing about what assets you have? Yes, clearly there is value. But what form is that value? To whom does that value accrue? How do you extract that value?
It is quite possible that tomorrow’s (big) data catalog will not be like the originally conceived of exhaustive data catalog of old. It might be that the new data catalogs need to operate as follows:
Grow in size and scope until a given size is sustained that encompasses the necessary data needed to assure compliance Contract in size in order to respect the trade off in terms of cost to sustain versus usage and value extracted Thus the data catalog of tomorrow might be more dynamic in size and may consist of different layers:
(largest) Data catalog for compliance (smaller) Information catalog for asset value purposes (smallest) Analytics catalog for Performance management and process/outcome improvement What sort of uses for (big) data catalogs has your organization developed recently?
(About the author: Andrew White is a research vice president and agenda manager for MDM at Gartner Group. This post originally appeared on his Gartner blog, which can be viewed here)
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access