Avoid these big data blunders in your business

AI and ML aren't just fad technology — not investing now could be a costly mistake down the line.
9 November 2020
  • Leveraging big data is a challenge for businesses, but it helps to be aware of some common mistakes

Businesses are constantly reminded of the importance of their data. And make no mistake, the ability to leverage it effectively is becoming crucial. From providing insights into what makes customers tick and predicting their behavior, to being fed into the development of machine learning models, using data well gives organizations a significant strategic advantage.

But there are still challenges, in everything from the specialists they should hire, the investment they should make into tools and solutions, and which data, within the vast and growing troves, holds the most value for the business’s own needs.

With a lot of focus in the business press about what organizations should do with data, however, here are some common data blunders that businesses should avoid, according to computer scientist, database research pioneer, and MIT adjunct professor Michael Stonebraker.

Not moving to the cloud

If your organization isn’t planning to become cloud-exclusive, you could be backing losing technology. The cloud is more elastic than your in-house solution and more cost-effective in the long run. Large cloud providers like Azure and AWS offer storage at a fraction of the cost and with better infrastructure, often with tighter security, and staff that specializes in cloud management for a living. 

“They’re deploying servers by the millions; you’re deploying them by the tens of thousands,” he said. “They’re just way further up the cost curve and are offering huge economies of scale.”

Shying away from AI and ML

AI will disrupt operations undoubtedly. It will displace some workforce and has the potential to upend how an organization handles its operations. That being said, you can choose to be a disruptor, or keep your head in the sand and get disrupted.

The solution? The willingness to pay for talent. Machine learning experts aren’t going to come cheap, and chances are high that HR will balk, but spending money now on experts nets you a much greater return later. “There’s going to be an arms race,” he said of the competition to hire talent. “Get going on it as quickly as you can.”

Overlooking real data science problems

It may not be glamorous, data scientists spend 90% of their time on data discovery, data integration, and data cleaning. Without clean data, your big data initiatives mean nothing. Your machine learning is worthless. Don’t miss this step.

Companies should get a system in place and stick to it because your data scientists, your talent that you’ve spent money, and fought with HR to hire, can help lead the way. But your organization needs to solve your real data problem — the quality of data. The best way to address this, he said, is to have a clear strategy for dealing with data cleaning and integration, and to have a chief data officer on staff.

Belief in ‘traditional’ techniques

Traditional data integration isn’t going to cut it in the world of big data. The two most common ones, extract, transform, load (ETL), and master data management (MDM) processes, are too old to work properly and won’t scale. 

Believing data warehouses will solve all your problems

Data warehouses can solve some big data problems — but not all of them. Warehouses don’t work for things like text, images, and video, Stonebraker said. Instead, use data warehouses for what they’re good for such as customer-facing, structured data from a few data sources.

“Get rid of the high-price spread and just remember, always, that your warehouse is going to move to the cloud,” he said.

Succumbing to the “Innovator’s Dilemma”

Often, legacy systems have to be abandoned, even if it results in drastic changes or potentially losing customers. It’s a road of constant bets on the future and being able to reinvent the organization. “You simply have to be willing to do that in any high tech field,” Stonebraker said.

Outsourcing new stuff to big data analytics services firm

New tools shouldn’t be outsourced, Stonebraker said. Other things should, like maintenance — and while you’re at it, don’t run your own email system, he said.

Assuming that data lakes will solve all problems.

Stockbroker suggests companies clean their lake data with a data curation system. “This problem has been around since I’ve been an adult and it’s getting easier by applying machine learning and modern techniques,” Stonebraker said, but it’s still not easy and companies should put their best staff on the problem. 

“Don’t use your homebrew system,” he said of in-house technology, which is often outdated. Usually, the best data curation systems come from startups, he said.

No, Hadoop/Spark will not solve all your problems.

Hadoop, the open-source software collection from Apache, or Spark, the company’s analytics engine for big data processing, shouldn’t be the answer for everything or everyone, Stonebraker said. But many companies have invested in them. 

“In my opinion, you should be looking at best-of-breed technologies, not the lowest common denominator,” Stonebraker said. “Spark and Hadoop are useless on data integration,” Stonebraker asserted, which is where data scientists spend a lot of time.