Setting sail across the data lake

Data lakes comprise of multiple silos you might not even know to exist. And even then, which information's most valuable?
8 November 2018

Do you know how to navigate a ‘data lake’? Source: Shutterstock

When businesses start out, data management is easy – there’s not much of it, and its sources number probably only one or two repositories – a local fileshare and a remote, cloud-based business tool deployed as-a-service.

There are hosts of tools out there that make data collation from specific sources easier, such as the open source Apache Mahout and Python libraries like Orange. These, as many readers will know, help those poor souls without the technical chops to negotiate the world where bon mots like the following make sense:

SELECT custom.custom_id, orders.order_id, orders.order_date
FROM custom
INNER JOIN orders
ON customers.custom_id = orders.custom_id
ORDER BY custom.custom_id;

However, as organizations grow, so do their data collections, until a situation is reached where a newly-appointed Chief Data Officer (CDO) will refer to the ‘data lake’ that needs negotiation and management. Data lakes are, for want of a better definition, a collective term for the multiple repositories (sometimes called silos) that form the totality of a company’s data assets.

Business intelligence (BI) systems present various pieces of information and vary from supplier to supplier according to how they mine data, and, of importance in this context, the types of sources from which they can mine that data. Pure BI solutions are designed to satisfy specific purposes and while possibly powerful enough, traditional BI routines might not be suited to the context of data lakes.

Once an enterprise reaches the stage of having a data lake, users and analysts deploying simple BI solutions simply can’t trust or prepare data efficiently. In fact, they may not know even whether whole swathes of data are trustworthy, or where their data might be, or even if the data exists!

New breeds of software are coming to market which addresses the issue of data lakes— some are business intelligence platforms that claim to be able to negotiate multiple data sources, like cloud, in-house, external SaaS and remote ERP, for example.

Others, however, are acting as a kind of middleware, abstracting the location and types of repositories and presenting a single source of data to legacy BI routines. This is no mean accomplishment. The different data sources can be in many flavors: SQL, Arvo, Parquet, raw and compressed, myriad archive formats, and so forth.

Analysts and information stewards need to search, query and collaborate. And to achieve this they need data library solutions which can capture data from disparate contexts and create relationships between data sets. It’s the relationships, as any DBA will tell you, that create the basis for meaningful business insights.

With data storage increasingly cheap and plentiful, administrators have adopted a ‘keep it just in case’ mentality. Therefore it’s usually necessary to assess data for its intrinsic validity, as well as assessing its business ‘weight’, or potential for use strategically. De-duping, arranging in chronologies, prioritizing and discovering the ‘best’ or latest version of data – therein lies the challenge.

One of Gartner’s CDO Research team, Douglas B. Laney, expounds this notion in his book, “Infonomics: How to Monetize, Manage, and Measure Information as an Asset for Competitive Advantage”. He explains that calculating the value of each information asset can help deliver clarity to help data accessibility, governance, and management.

Of course, assessing, preparing and cataloging data is only a starting point in the broader sea of information management or data lake navigation. But software needs to be able to work consistently across all information sources and then track information’s impact and return over time. That’s critical to the success of today’s Chief Data Officers.