Join the CloudTweaks thought leadership contributor program which includes a customized profile, branded identity page, newsletter marketing, social amplification and more...

The program is currently available to consultants, influencers or executive level contributors.


Data Catalog: Enabling Self-Service Analytics

Enabling Self-Service Analytics

A Chinese proverb says, “The best time to plant a tree was 20 years ago; the second-best time is now”.  Let’s assume you’re already up and running with a big data Hadoop platform for advanced analytics use-cases.  Perhaps you’ve ingested multi-structured data from disparate sources and are performing product delivery/development for proofs-of-concept.  In order to organize the data associated with your products, make it easily searchable/findable, and rapidly provision to your end-users, a Data Catalog is necessary.  As with the tree, the best time to implement a Data Catalog (DC) is during early planning stages; however, the second-best time is today.

There are various associated use-cases relating to ‘why’ a DC is necessary:

  • Fill in the gaps: You’re deep in the midst of new failure analysis and find that you’re missing 60% of maintenance start dates, due an error in the archive job; what other data sets might help fill in that missing data?
  • Explore what’s possible: You’re on the hunt for data keys that will let you hop-scotch from application to application; with a multi-step cross-reference, can you finally unlock those measurement logs from a one-time sensor study?
  • Test out hypotheses: Your team talks in anecdotes and examples; can you prove that there IS a seasonal correlation between new customers and off-season items?
  • Streamline or rationalize: You’re starting an application rationalization and want to trace data lineage from the system of record; how many different versions of “the truth” are there?
  • Learn from the traffic: You’re responsible for enterprise data governance, so you want the metadata about the DC; who is looking for what data, and how can you better meet their needs?
  • Find fresher data: The team’s monitoring report runs off of quarterly inventory losses that are allocated monthly to different organizations; can you track down the raw, weekly data so that the team isn’t surprised at month end?

These use-cases can be boiled down to three considerations: what data do I have (and therefore not have) in the lake, how can I provision data effectively to enable self-service analytics, and how do I classify data to be most useful.

What’s in the Lake?

A DC should provide similar functionality and user experience as a brick & mortar super-store.  Imagine your consumers needing to find proppant levels for the past 6 months for an unconventional well.  Similar to the sign-posts hanging from the ceiling in your local Costco, you should lead them to the right aisle; for example, Upstream → Production → Unconventional → Region → Well → Proppant → Time-Frame.  Spending some time brainstorming structure and multiple-paths to discovery will benefit end-users and increase their retention in utilizing the service.

Provisioning Best Practices

Once those users have found the right data, how do you get it in their hands?  First, a good relationship with your data source stewards is important; they need to feel secure to quickly allow data consumption across many requests, have line-of-sight on lineage for tracking derived data through transformation, and should help with tagging the data coming from their respective system(s).

Second, there should be a quick turnaround between request and provisioning; otherwise, end-users’ ability to leverage data for business decisions is limited.  As such, the DC should have inherent processes for automating provisioning when/where possible.  DevOps processes/culture can go a long way toward meeting the needs of the organization in regards to rapid provisioning.  Change managers are also essential for training those stewards on the tools.


Upon ingestion into the lake, metadata needs to be gathered and the data should be tagged – ideally by a representative (data custodian), with significant business knowledge who can differentiate and assign tags effectively.  As demonstrated in Figure 1, not all data is created equal, and various levels of rigor can be used for tagging, based on its intended use.


If you’re up and running with your Big Data engine, perhaps you’re comfortable in piecemeal-procuring data for pilots and the like.  That can work during inception and early stages, but eventually, you will have new ideas coming through the pike and to-be product owners approaching you to understand what’s in the lake already and what they’ll need to source.  Being able to provide that information, as well as provision/classify it effectively, will buy credibility and can facilitate data gravity (the idea that the more data in a lake, the more data it will attract), which can be a key differentiator in the Enterprise Hub game.

By Tommy Ogden, Senior Manager at Enaxis Consulting

Cloud Syndicate

The 'Cloud Syndicate' is a mix of short term guest contributors, curated resources and syndication partners covering a variety of interesting technology related topics.

Contact us for syndication details on how to connect your technology article or news feed to our syndication network.

The Lighter Side Of The Cloud - The Concept Department
The Lighter Side Of The Cloud - #NerdLove
The Lighter Side Of The Cloud - The Restroom
The Lighter Side Of The Cloud - Dial-up Speeds
The Lighter Side Of The Cloud - Saving Space
Wearable Tech For Those With Disabilities: Shaping the Future

Wearable Tech For Those With Disabilities: Shaping the Future

Wearable Tech For Those With Disabilities Wearable tech is one of the most exciting aspects of the rapidly growing tech ...
Winning the data intelligence game

Winning the data intelligence game

Data intelligence A case can be made that every company is now a data company. But, it is the effective ...

Future Data Storage Needs Increasing At A Rate Of Nearly 25X By The Year 2021

The Future of Data Storage Data is everywhere. In the security industry, there are close to 300 million surveillance cameras ...

The Coming Era of Simple, Fast, Incredibly Cheap Cloud Storage

Cheap Cloud Storage Is On Its Way Data storage, like other commodities such as bandwidth, electricity, or simple computer power, ...
Istio 1.0: Making It Easier To Develop and Deploy Microservices

Istio 1.0: Making It Easier To Develop and Deploy Microservices

With the recent availability of Istio 1.0 it is not surprising that it continues to capture much attention from the ...
Imminent IoT Eye-Tracking Technologies To Transform The Connected World

Imminent IoT Eye-Tracking Technologies To Transform The Connected World

IoT Eye Tracking Smelling may be the first of the perceptible senses, but the eye is the fastest moving organ ...
CloudTweaks Q&A: How Smart Will Your City Be by 2025?

CloudTweaks Q&A: How Smart Will Your City Be by 2025?

How Smart Will Your City Be by 2025? What role does back end infrastructure play in connecting IoT devices? Probably ...
Gartner’s Top 10 Predictions For IT In 2018 And Beyond

Gartner’s Top 10 Predictions For IT In 2018 And Beyond

Gartner’s Top 10 Predictions For IT In 2018 In 2020, AI will become a positive net job motivator, creating 2.3M jobs while eliminating only 1.8M jobs. By 2020, IoT technology will be in 95% of electronics for new product designs ...
Automate Service Management

[Free eBook] 150 Ways to Automate Service Management Throughout Your Organization…

Think about an IT Service Catalog as a supermarket of available services. Everyone in your company requests and delivers services from each other. From Human Resources and Marketing to Facilities and Procurement, each department is a service provider to the ...
real time hacking attacks

Live Real Time Hacking and Ransomware Tracking Maps Online

Real Time Hacking Attacks We've recently covered a few real time hacking maps but have decided to extend the list based on the recent ransomware activities with some additional real time hacking attack and ransomware tracking maps. Ransomware refers to malicious ...
HTML5 Speed Test

HTML5 Speed Test

HTML5 SPEED TEST SERVICES There is no made-for-all solution when it comes to optimizing a website for speed, and while putting a cloud platform in place is a good start, every cloud startup should ensure that they have an optimization ...
Network Management Software Buyer Guide 2018

Network Management Software Buyer Guide 2018

This concise data-driven report covers the Network Management software landscape, as of August 2018. he 24-page report includes: Market Overview - Top 10 Network Management products in 2018, User reviews and vendor size data, In-depth look at the Top 3 ...
Top 50 Cloud Hosting Services

Top 50 Cloud Hosting Services

The methodology behind our top 50 cloud list is based on several years of experience understanding and following who the key players are in the industry. Click to review the current top 50 and stay tuned for future discussion ...