TL;DR – Data lakes sound cool, but you probably need something you’ve already heard of instead.
I’m a sucker for true crime podcasts. I used to listen to them on my commute, but these days I turn them on while I get stuff done around the house. Not long ago, I listened to an episode about Lake Lanier. Since 1994, there have been over 160 deaths reported on the lake, according to the Department of Natural Resources statistics. The podcast wanted me to find it deadly, scary even.
When I started researching how deadly Lake Lanier actually is, I discovered that there have been 434 drownings in Lake Michigan since 2010. That’s way more deadly than Lanier, but I didn’t find a public outcry about Lake Michigan being a killer. It’s generally accepted that Lake Michigan can be quite difficult to navigate, even for the strongest of swimmers.
Lakes in general seem like safe places to swim but they are actually full of hazards. From foliage and rocks lurking beneath the surface to very cold temperatures that can turn you hypothermic, it’s easy to be overwhelmed and drown.
Data lakes, while never having contributed to physical death that I’m aware of, unfortunately also suffer from significant problems.
What is a Data Lake?
Data lakes are an aggregate of all data, even unstructured information, like emails and document contents. Independent of format or source, whether it’s structured or unstructured, a data lake brings everything into a single repository.
It sounds amazing.
But it’s likely so far away from what you really need.
Look out for the firms who are telling you that what you really need to build right now is a data lake. Given the fact that 85% of data lake projects fail, you’ll want to really understand what they’re suggesting you create. They are likely asking you to swim out to the deep part without making sure it’s safe. They sell you on big ideas that come with even bigger price tags without explaining how many years it’s going to take before you see any ROI.
When it comes to data, what does safety look like?
Most big data companies will tell you that 80% of your data is sitting in unstructured formats where you can’t use it. That number is likely accurate, even in the wealth management industry, but that 80% isn’t the place to start.
You’ve got 20% of your data sitting in structures you can access, but haven’t been able to fully integrate. Think of all the databases, tables and fields you have full of information that you wish would inform other databases, tables and fields.
Data Lakes are a trendy idea, but the truth is, you don’t really need one if you don’t have a data warehouse with a bi-diretional API layer on top of it. There is no need to start with the hardest part first. Instead, figure out processes you want to streamline, information that could be shared across your systems if your unique identifiers were associating properly. Figure out the small steps to move toward your big goal.
Once you’ve fully warehoused all of your structured data and streamlined every part of your business, it might be time to head to the lake. Of course, you might just decide that your warehouse and API layer brought you to the blue ocean instead and there’s no need to go to the lake.