Data lakes are the place where you gather, store and analyse your organisation’s raw data, with a view to turning that data into insights and action.
They give you a single view of the various data contained within your organisation in one consistent platform.
In this sense, data lakes are the ‘Big Yellow Storage’ of the data world and they act as the catalyst which brings the whole issue of data management within your organisation into focus.
Having a data lake reaffirms an organisation’s need to invest more effort in data governance and data quality, helping you to understand the content that resides within it.
However, both in our experience, and as revealed by Gartner in their report on data lake failure, if data lakes are misunderstood in the context of a wider data strategy, this can lead many initiatives to flounder. Or in the case of Big Yellow Storage, you can never find that one item you’re looking for that you put in the box.
Indeed, over time we’ve seen the evolution of a negative trend whereby business and IT leaders have been guilty of overestimating the effectiveness of data lakes in their data strategies.
As a result, a number of myths have evolved which we’re keen to bust. These include:
“We don’t need governance, we have a data lake.”
“We have all our data in one place, so we know what data we have.”
“We have new shiny tools, we don’t need to think about data engineers.”
“We do agile, so we will get results.”
“We have a data lake, so we now have a data strategy.”
The mere fact that you have decided to deploy a data lake with fantastic data, analytics and AI in place, doesn’t mean that you have a full-blown data strategy in place.
Nor will it solve all your data management ills.
Yes, data lakes form an essential part of the strategy, but they are by no means the strategy in itself.
You need the whole data road map and its essential to bring the rest of your organisation along with you on this journey towards effective data management.
Because data lakes don’t yet represent a well-trodden road for many organisations, there remains lots to learn when compared with types of implementations such as data warehouses and data marts.
In this sense we’re yet to create the myths and the legends and, to a certain degree, failure does need to play a role here.
Whilst failure isn’t fun, it is an opportunity to learn and regroup. But as a tech community, we ought to help each other to reflect and learn too, by sharing our own experiences and failures so that we can each learn best practice.
Not only this, when we experience a data lake project failure, we need to make sure we have rigorous policies in place around how to manage this, understanding what it means for our stakeholders and the wider data lifecycle.
With this in mind, here are three of my own personal data lake learnings.
I once failed in building out traction with a data lake I was working on, because I was unable to semantically describe what was in the data lake. Without a consistent catalogue, registry and understanding of what’s in the data lake, people won’t know what’s in it, so they won’t come and use it. Or as pointed out earlier, you end up with the Big Yellow Storage problem.
The ability to govern and control the data is vital. This is especially important where data is used to support machine learning and deep learning. Here, people need to understand what was used in order to generate insights within the data lake. Without that governance in place, you can’t rely upon the decisions that are being made, so governance needs to play a central role in the adoption of data lakes in the enterprise environment.
The performance experience of a data lake will either make it very successful, or it will kill it within your organisation. If you can’t create an extremely performant data lake based on the ability to ingest and egress data, it won’t perform well enough for your stakeholders and you will lose their buy-in to using the tool.What else should you consider in order to make your data lake a success?
Having been called in to save many a data lake project, here are some of 6point6’s top tips on how you can ensure a successful data lake experience as part of your wider data management strategy.
One issue that kills data lakes is cost.
As you start to take on more and more data and begin to generate increasing insight from the data lake, the cost will increase – from driving the engineering and data science functions, to storing the data and computing it.
So, make sure that you allocate enough budget to support and resource the project, including any additional costs to train and develop your team.
Data lakes will degrade fast, so keeping your data well curated is important.
In the world of data warehousing, a lot of time was spent on the design of the data model, so it was possible to curate and be custodians of the data that flowed into the data warehouse.
However, we’ve moved very quickly into schema on read with data lakes. This means that they can very rapidly become a data dumping ground rather than a core insight tool.
To combat this and to assign a value to your data, you need to clearly communicate what quality data looks like before it enters the data lake. Setting service level agreements (SLAs) and tolerances for what you are feeding in can help here.
Once the data is in your data lake, the content needs to be registered and this will enable it to gain a value.
With that value comes stakeholder buy-in and you will soon find that the data starts to be treated as an actual asset.
When it comes to creating a more productive data lake environment, it’s imperative to think about what the type of roles are that you need in order to make this data lake successful.
Data lakes have moved things on from extract-transform-load tools (ETL tooling) to data engineering so it’s important for your teams, especially those with lots of SQL-orientated data ETL engineers, to become fully conversant with functional programming and software engineering and development, using things like Scala and Java.
Immerse your teams in project-orientated programming and get them rapidly up the software development maturity curve in order to create practitioners who are productive in a data lake environment.
Too often there is little focus on building up that core software engineering capability, despite the fact that this enables you to get the most value out of your data lake.
A lot of data lake projects fail because you’ve tried to take on too much in too short a timeframe.
Just because it is called the data lake, doesn’t mean you have to fill it from the off.
If you take a small, incremental, value driven approach, you can start to slowly but surely create a ground swell of support and activity.
There’s no point building a data lake without any value, so think about what types of value you can derive when positioning your data lake.
Consider how you are going to increase the value of the information that you derive from the data that flows into the data lake and how you start to add external, or other, data sources to the data in order to enrich it and derive more value.
What type of processes improve with the use of the data coming out of the data lake?
How can you add more value to the business transaction, how does this value relate to improving the customer experience – so how are your customers better served?
How can you use the data to your advantage to drive more business?
How can you improve competitiveness where you are operating in a highly competitive market place? And how can you comply better in a highly regulated space.
How are you going to derive monetary value from your data lake? This is often the holy grail for organisations – is it possible for you to create a business model by selling or utilising the data across your partners or ecosystem?
Finally, how can consumers benchmark the cost and value that they derive from your organisation against other avenues where they conduct business as well?
These are all important considerations around the value proposition of your data lake.
Leveraging data lakes isn’t a technical challenge, although do be wary of cloud-based data lake platforms becoming unwieldy within your organisation.
The real issues that prevent data exploitation are the non-technical drivers we’ve discussed – from skills development and budget, to communication and curation.
Once you address these points and start delivering value within your data lake, you will start to reap the business and financial benefits.
6point6 has a wealth of knowledge and experience in developing data lakes and bringing back failing data lake projects from the brink, drawing on our team of cloud control experts as required.
For more information about how we can help, contact:
GARY RICHARDSON
Managing Director, Emerging Technology