Why you need to manage your unstructured data

If you haven't yet - why not?
9 December 2022
Getting your Trinity Audio player ready...

Unstructured data has become one of the bigger problems for businesses in recent years due to its exponential growth, and in 2022, it began to be an issue for which solutions were both available and cost effective for enterprise-level companies. We sat down with Krishna Subramanian, co-founder of Komprise, which deals with unstructured data for enterprise-level concerns on a daily basis, to find out why it’s in companies’ interests to manage their unstructured data sooner, rather than later.

THQ:

Just for any newcomers to the problem – what’s the fundamental issue with unstructured data? Come to that, what are we talking about when we say “unstructured data”?

KS:

Unstructured data is any data that doesn’t fit neatly into a database, and isn’t really structured in rows and columns. So every photo on your phone, every X-ray, every MRI scan, every genome sequence, all the data generated by self-driving cars – all of that is unstructured data. And perhaps more relevant to more businesses, artificial intelligence (AI) and machine learning (ML) – they depend on, and usually output, unstructured data too.

Unstructured data is growing every day at a truly astonishing rate. Today, 85% of the world’s data is unstructured data.

And it’s more than doubling, every two years.

THQ:

Are we just old, or is this a fairly new thing?

KS:

It’s definitely fairly new. Go back a decade and almost nobody would know what we were talking about. Unstructured data back then was very small compared to structured data.

THQ:

Because to be considered “data” back then, you had to be the kind of data that fit into databases?

KS:

Partly that, but also the increasing ubiquity of things like smartphones, the development of radical new technologies like AI and ML, and the digitizing of data that was previously undigitized – think census data, medical records, land registration, all that stuff.

THQ:

So since the arrival of the cloud, essentially, we’ve gone crazy for unstructured data. And there’s more and more of it every day. This is us playing Devil’s Advocate, but…why should businesses care? Let alone develop an unstructured data strategy to deal with the problem?

A whole lot of laptops.

KS:

A couple of reasons. When we say there’s a lot of unstructured data being generated, we’re not just talking about “Your hard drive is getting full” levels of data. Just as an example, we work with Pfizer, the pharmaceutical company. Pfizer scientists generate around 10TB every day.

THQ:

We’re talking to you right now via a 2TB laptop.

KS:

Got another four handy? That’s what we’re talking about – a five laptop a day habit.

You wake up tomorrow, that’s another five laptops. And the next day. And the next.

THQ:
Ok, so that’s a lot of data. Got it.

KS:

And most companies that are generating that volume of data have to keep the data for at last 25 years.

THQ:

That’s…that’s 45,500 laptops worth.

KS:

So when we say it’s a problem that’s getting too big to ignore, now you have an idea of the scale of the problem.

THQ:

Many, many laptops of problem.

KS:

Right. But that’s just one part of the problem. After all, it’s the 2020s, you wouldn’t actually store all that data on an endless series of laptops. There are lots of more efficient storage solutions.

First, catch your data.

THQ:

Of course.

KS:

Which data needs which solution? Because they all have different cost implications. Which data needs expensive permanent storage, and which needs less permanent? Which is hot data? Which is cold?

THQ:

How are we supposed to know?

KS:

Exactly. Also, where is all the data? Most businesses aren’t sure what data they have, let alone which silos it’s in.

THQ:

So – that’ll be us getting an unstructured data strategy, then.

KS:

Exactly. The problem’s too big to ignore, and it’s also potentially losing businesses money, day after day, week after week, if they’re paying top dollar to store all their data as though it’s the same.

Want to know the best bit, though?

THQ:

Always.

KS:

If you get an unstructured data strategy, you can not only find all the data, you can not only save on the storage of the data, but once you’ve put some structure around your unstructured data, you can mine the data. And if you can do that, you can turn the data into something that generates income.

Money for old data.

THQ:

So, it’s essential in terms of not bleeding cash to wrongly store the data, and useful inasmuch as it repays you for the cost of doing it?

So where are we in terms of businesses, unstructured data, awareness and uptake of strategies and solutions?

KS:

Well, the point is, it’s too big to ignore now. It’s only going to get more and more impossible to ignore next year. And the technology and the expertise exist to solve this problem today, so every day you wait, you’re just feeding the problem.

THQ:

So essentially, the argument isn’t so much why companies should do this, it’s why wouldn’t they do it as soon as possible, to get the storage monkey off their back and start making it pay its way?

How the solution works.

KS:

Exactly. Companies these days are drowning in unstructured data. They don’t know what’s valuable, what’s not valuable, ransomware is popping up that can steal their data. They have to store this data and protect it, they have to be compliant with laws and regulations.

THQ:

So how does the problem get solved?

KS:

Companies need some technology to help them, because the scale of this problem is too big.

The first thing they want is some automation, so that they can understand what data they have, how fast it’s growing who’s using it, where it’s sitting, how much it’s costing them, and what the security posture is on their data.

THQ:

So, a full data audit?

The interview as example.

KS:

Right. And the second thing they want is data mobilization. How can you use these analytics to move the data to the right place at the right time? The reason that’s important is because data actually has different value at different points of its life. Take this conversation. We’re recording it as we speak. Once we have the recording, maybe you’ll listen to the recording a few times between now and when you write the article, to make sure you’ve got the quotes right. It has high relevance-value up until you use it to write your article.

What then? The day after this goes live on the site, you have a new version of the data, in the form of the story on the website. How likely are you to go back and listen to the recording again?

Maybe you might keep it for a while, in case issues come up and an interviewee says “I didn’t say that” or “You interpreted this wrong.” If you keep the recording, you can double-check for accuracy, and either amend as necessary, or say “You said exactly that – I have it on the recording.”

How about a year from now? What’s the relevance-value of the recording then? Probably significantly less, right? In this case, you might even be able to delete the data safely, because you’ve established another version of it in the article. Most businesses actually keep most data around just in case they may need it later. But most of the data is cold, it’s never actively used.

But if you keep cold data on expensive storage, and you keep maybe three or four backups of it, and a ransomware protection copy somewhere, that’s a lot of added cost and infrastructure that they probably don’t need because they’re storing cold data. It’s not that you don’t protect cold data – if you didn’t, it would be easy pickings for data thieves – but if it’s cold, you can afford to take a more passive data management approach to it.

Assuming you know it’s there, and what it is, and its level of heat. If you know all that, and you know it’s cold data, you can put it on cheaper storage, or durable storage like the cloud, where the cloud itself would keep two or three copies of the data. But you don’t know all those details until you begin the process of data management.

So there’s active data management and passive data management, and the cost of the two are very different. And if you understand your data, and you can move the right data to the right place at the right time, you could save 70 to 80% of the infrastructure costs of unstructured data. And that’s the first thing our customers want to do.

 

In Part 2 of this article, we’ll dive deeper into the process of unstructured data management, how it works, and how company-choking unstructured data can become a source of economic rocket fuel to the company that owns it.