Avoid Drowning in Your Data Lake

Blog Post created by Advocate on Aug 21, 2017

In the age of Hadoop and the Big Data hype, millions (or billions) of dollars are being spent to create the biggest data repositories – collections of unstructured data files stored horizontally across server farms.  Over the last few years of investment organizations large and small have built out Big Data collections and, instead of strategically planning for the (data) what? and why? they have started to collect everything in the hope of collecting something (anything) of value. These increasingly large data repositories even have a new name: Data Lakes.


Whether a formal investment in a supported distribution such as Cloudera, Hortonworks or MapR, or simply a side project initiated by downloading open source distributions, the number of Big Data projects is likely still not (yet) dwarfed by the countless words written by marketing and industry analysts hyping the new frontier of promised rich data insights driven by the commoditization of distributed file systems and no-SQL databases. Exacerbated by the promise of real time engagements and streaming data feeds, IT organizations have leapt into Big Data infrastructure with both feet, but with no planned landing.


Not at all ironically, four or five years after Hadoop-mania started, the new “big” thing seems to be artificial intelligence and machine learning – technologies that, while they continue to quickly evolve, have been around for decades. Why, all of a sudden, have AI and ML become trendy areas of investment? Could it be because, with data science experience hard to find and expensive to buy, business stuck with their large Data Lakes are increasingly under tremendous pressure to demonstrate value? What better way to do that than invest in technologies that promise to automate data insights? Check out Can Machine Learning Save Big Data? for a deeper discussion on this.


The big challenge, however, remains: how do businesses glean value from collecting all that data, sifting and wrangling the data to identify value, and then disseminate or leverage that value in real time throughout the business in a manner that’s actionable? Many Big Data investors have discovered an interesting Data Paradox: The more data you have the less value it provides.


bigstock--165557447.jpgDrowning in data is becoming a figurative reality.


More fundamental challenges with the growing repositories of data is that, in many cases, businesses don’t even know what they’re collecting. The required regulatory compliance that comes from harvesting credit card, personal health, or benefits information is onerous to those unprepared. Just as important is the question of whether any of that data is even relevant to the questions that need to be answered? Determining predictability and relevance are critical and should be asked before a Data Lake is even considered. The amount of data and the irrelevance of the data will only obfuscate and confuse inquiries and insights.


In order to try and glean even some value from their Big Data infrastructure many businesses have increased their investments in business intelligence and “data visualization” technologies. That is, the practice of visualizing data in charts and graphs to help identify trends. While these tools are great for reporting on sales numbers, they do not really provide the value add that you’d think would come from the massive investments in Big Data and real-time Data Lakes. After all, BI and data visualization technologies have been around for almost twenty years. They largely haven’t been architected in such a way to provide advanced analytic insights or leverage real-time predictive algorithms.


New data wrangling techniques are now coming to the forefront to address data obfuscation and relevance. However, there are no artificial intelligence or machine learning algorithms to pre-determine relevance. This is the value add of a true data scientist. Programming initial models to carefully consider relevance and applicability of data to problems is a science, so it should not be left to chance. Adding new data to an existing model or data pool is one thing; you can then measure and adjust based on predictive performance. The process of adding new data is certainly one that can be automated. Machine learning techniques exist to refine, train and improve existing models. However, initiating a new model requires training, perspective, and experience.


Unfortunately for many businesses, investments have been made and data collection has been done (or is being done) already. The heavy lifting required to glean value from these in-place repositories will require the human touch.


bigstock-Data-Recovery-3303827.jpgHow can you avoid drowning in all that data?


For those not yet invested, there is still time to consider what problems need to be solved, what insights need to be gleaned and what processes can be enhanced; what data is predictive? Still, the question remains: how can businesses make use of their ever-filling data lakes? There is in fact a way to solve this problem. Hundreds of companies across every industry imaginable are doing it. Check out how Vantiv is streamlining merchant onbarding, Southwest Airlines is optimizing flight and crew schedules, and Bank of Communications is fast tracking it's credit card business. These companies are using a combination of predictive data science along with business rules and optimization to dramatically improve business performance, but also to automate the process from data ingestion to insight to action.


I will continue this discussion with my colleague in a complimentary webinar on Wednesday, August 30, 2017. Join us to discuss Prescriptive Analytics and Decision Management: The Secret Ingredients for Digital Disruption.