SPLOG – the data management issues facing generative AI

Know your data - it'll help define your results. Of course, if you can't know your data...
30 May 2023

Collaborative, helpful – but the devil may be in the data.

• Five main data issues exist with generative AI in its current state.
• How reliable is the training data of major generative AIs?
• What governance exists for generative AI?

Generative AI is everywhere right now – underpinning whole new businesses, and the way old business are doing new things. But Krishna Subramanian, COO at data management specialist, Komprise, has questions about the data safety of companies that have rushed to integrate the likes of ChatGPT into their systems.

In Part 1 of this article, Krishna took us through some of the general issues with using a system the foundational data of which you cannot be entirely sure – and which there are currently no regulations to let you access.

While we had her in the chair, we asked Krishna what the specific data dangers of using generative AI as freely as we’ve started to do might be.

SPLOG – a breakdown.

KS:
There are five areas where data management plays a role in the potential future of companies using generative AI right now.

THQ:

OK, what are those areas?

KS:

SPLOG.

THQ:

O…K.

KS:

SPLOG – security, privacy, lineage, ownership and governance. SPLOG.

THQ:
Ahhh, SPLOG. Naturally. OK, so how are those areas relevant?

KS:
Let’s talk about security. A lot of people don’t appreciate this, but when you ask a generative AI bot a question, and you give context as part of your question, that data may belong to the company that’s training that bot, based on the legal agreement they have.

The Samsung incident.

This has already happened to Samsung. Samsung told its engineers “Go ahead and try generative AI,” because they wanted to see what impact it could have. And one of their engineers gave ChatGPT some code and said “Find me the problems in this code.”

THQ:

Ah.

KS:

They didn’t realize that by doing that, their code was leaked into the public domain, and now Samsung has shut down all use of AI – and that’s a danger, right? Because you can’t go black or white on this. You have to really know the domain.

If I employ a consultant in my company, they sign an agreement and they know that they can learn using our company’s data while contributing to our company, but they cannot take that data somewhere else.

So that notion of data domain, what belongs to what domain, and the security of that domain, there is no such construct right now, but it needs to exist. That’s why security is the first area of concern.

THQ:

That’s quite the minefield.

The perils of PII.

KS:

Stick with me, there’s more.
Again, people generally don’t know this, but these learning models are pre-trained on opaque data. They’re not transparent. They don’t tell you where that data came from. So there may be PII in the data that it got trained on.

THQ:

Danger, Will Robinson!

KS:

And if you as a company now use that model with its undisclosed PII data, you may be liable for that potentially illegal use. Because you used PII data – and you may not even know it.

THQ:
Warning lights flashing everywhere right now.

KS:
Even worse, you may be using a backup software that’s using AI to detect anomalies and you don’t realize it, but it may be using PII information. And now who’s liable? Are you liable because you used it?

Data privacy and managing data privacy is extremely important – you’re going to need to know your risk on data privacy – but how can you know it when the data’s not transparent?

THQ:

All across the country, lawyers just punched the air.

Falsehood, Schmalsehood…

KS:

Then there’s lineage. There’s no easy way to say this, but I don’t think there’s any good way to track data lineage right now.

THQ:
OK. Helpful.

KS:
Especially with these pre-trained models, you need to know where your data came from. The Washington Post recently did an article that revealed that Google – to its credit – actually published a list of what data Bard is trained on.

THQ:

This may be the only time in our lives we say this, but yay Google!

KS:

What the Post discovered was that 45% of Bard’s training data came from unverified sources. Some of it is from what would be considered… falsehood sites.

THQ:

Fake news?

KS:

Blogs that promote falsehoods. And now you’re in a strange situation, because you wanted to know the lineage of the training data. And now you do, and almost half of it is from unverified sources, including falsehood blogs. So your AI model may be using fake data and you’re not even aware of it. So… that’s a concern.

The Flat Earth AI.

THQ:
Ouch. That reminds us of a conversation that went around the THQ offices when China said it wouldn’t allow the technology in the country unless it was trained on solidly Socialist principles.

Does that mean that, given that companies are not required to a) tell anybody how much AI they’re using or b) disclose anything much about the model that they’re training it on, you could get, for instance, a model of generative AI that’s trained on the Flat Earth hypothesis, which would mold its results around the truth of that hypothesis, or a Young Earth creation model, which would, for instance, not return any inconveniently geological results?

KS:
Yeah, that’s entirely possible.

THQ:

Well… so much for sleeping at night.

KS:

It’s analogous to the Amazon case. Amazon tried using AI in a very innocuous way, to screen resumes for management positions at the company.

THQ:

Ah yes – and it started learning to weed out and discard women and people of color, right? Because they didn’t match its training data of what Amazon executives had looked like upto that point.

KS:

Exactly. That’s why the lineage of the data is important. Your results will depend on the data on which you train your AI.

THQ:

As we recall, Amazon put a team together to correct the problem… and it failed, and eventually had to disband the team and go back to human resume-sifting.

KS:

Right.

Where were we?

The ownership conundrum.

THQ:

SPL-

KS:
Right. So – ownership.

You remember you thought lawyers were punching the air before?

THQ:

Oh boy…

KS:

Here’s the riddle. If you create something that is assisted by AI, which owns that derivative data it provides you… who owns the derivative intellectual property of the created thing? Is that your IP? If you used a third-party API service, ownership of data and IP on derivative works is extremely important – both for accountability and for your proprietary IP.

THQ:

That’s the question about AI-generated art using sampled pieces of other people’s art, who don’t get credit or remuneration for any part of the new work, right? Only turned into a potentially new kind of IP law?

KS:

Exactly. Who owns what, who’s accountable for what, and who, above all, is liable for what?

The governance vacuum.

And lastly, there’s governance. Can I tell you something?

THQ:

Anything. You’re not a generative AI, we implicitly trust the data you provide.

KS:

I don’t even know how you would comply with HIPAA if you used an AI solution right now. So how would you comply with a lot of governance regulations?

You need some framework to do that. And as we said in Part 1, firstly, the framework isn’t currently there, and secondly, regulation usually comes in the wake of bad things happening – or the likelihood of bad things happening being so obvious that it’s equally obvious that Something Must Be Done.

There’s power in being able to see the data beneath the Matrix.

THQ:

That’s… that’s quite the list of data dilemmas. Don’t go away. We’re going to grab a fresh pot of coffee and come back for Part 3 of this article, because there are questions that still need asking…