Generative AI and its deep data issues

Just how reliable is the generative AI we have? And will we only discover the consequences of its data issues when we hit them like a pinball?
31 May 2023

Sam Altman of OpenAI in Paris recently, trying to negotiate regulations on generative AI. Source: JOEL SAGET / AFP

• No regulation currently in place to make generative AI companies show their data.
• Who sues who if your generative AI was trained on PII data?
• Up to 45% of Bard data could be from unreliable sources.

The huge upsurge in generative AI technology since OpenAI launched ChatGPT in November, 2022 has raised a whole bouquet of issues that we’re only just starting to get to grips with.

After two big calls from the great and the good in the tech industry to slow the development and adoption of the technology down, appearances at the Senate and the White House, edicts from both the Chinese and (briefly) the Italian government, and fervent lawmaking efforts from the European Union, we’re starting to get a handle on the complexities surrounding a technology that exploded onto the market, and was instantly adopted by companies everywhere.

Krishna Subramanian, COO at data management specialist, Komprise, has questions about the data safety of companies that have rushed to integrate the likes of ChatGPT into their systems.

In Part 1 of this article, she laid the groundwork of her concerns, showing us the general reasons why there’s a need to address the data use of generative AI systems. In Part 2, she explained the five specific data dangers which need addressing before much more time has passed, as generative AI becomes a fundamental underpinning of many business models.

While we had her in the chair, we wanted to probe more deeply into some of the issues she had raised.

We started with data privacy.


This potentially opens up a whole new world of data privacy issues, doesn’t it? If you’re using something like ChatGPT or Bard in an assumption of good faith, and for instance, the training data of the generative AI of your choice unwittingly includes PII, and that comes to light – who sues who? Are you liable for having used private data without having realized you were doing it? And if so, wouldn’t potentially everybody who uses the same generative AI be potentially liable?

Yeah. Who’s responsible, right? At some point, there has to be a corporation or an entity or a person in a commercial setting that is responsible, that takes ownership and accountability.

So even if I used a machine to do something, I have to certify what that machine did if I’m selling that as a company. That kind of accountability needs to be there. And how do you even track it in these situations?


Exactly. We’re assuming the answer to that one hasn’t been encountered yet, because in a sense, we’re all just hoping the question never comes up. But if it does, it’s going to rock the world.


Yes. But like I say, because it’s not beyond our thinking about it, there has to be an answer. It can’t be a question we just ignore. That kind of responsibility ultimately has to live with someone.


Even though, potentially, there may never have been a conscious decision to scoop up PII data made by a single human being or a corporation. This is the difficulty of the world we’re moving into, isn’t it? Presumably there may be a need for more stringent filtering in future iterations of generative AI. But until then…


Yeah. Until then…

And likewise with data lineage. One of the big issues with generative AI as it stands is that it has no objective truth model, so it can be both staggeringly wrong and enormously persuasive.

At one at the same time.

Yeah, it can “hallucinate,” and the wrongness has every opportunity to perpetuate itself. It’s not deterministic and it is not necessarily objective, either.

And up to 45% of the data it’s trained on could be potentially non-factual, or actively false, given the Bard data investigation. And yet there’s as yet no way of enforcing generative AI foundation model creators to reveal the data they’re using to train their systems. So the whole thing could be, let’s say, up to 45% actively false, and yet it’s being used as an answer-engine.

Exactly – there’s no regulation or requirement right now for the makers of pre-trained models to share their data sources. Google does share, which is how we know that up to 45% of the Bard training data comes from unverified sources, but I don’t think ChatGPT does. They keep it private. So you actually don’t know what it was trained on.

And some people are saying, “Well, we don’t need regulation. Companies can create standards of their own…” We’ve seen this happen in so many industries. As somebody in a corporate entity, I understand it’s a lot easier to compete if you don’t have regulation breathing down your neck. But you have to look at the broader human impact.


That was more or less what the EU told Sam Altman recently, wasn’t it, when he said that the developing EU AI Act was an example of “over-regulation.” Something like “self-regulating codes of conduct are not the EU way.”

And that’s the point, surely. It’s a technology that can be persuasively wrong, that could have been trained on up to 45% actively false data. And so many companies have taken it on and nailed it under the hood of their business model.

Technically, we don’t know how many systems are returning results that are potentially factually wrong, or skewed. Like that example we played with – if you do what China wants to do, and train your model only on the basis of Socialist truths, you’re only going to get Socialistic answers because only those answers will be possible.

By which reasoning, you could build models of any restricted truth you need. Young Earth generative AI that doesn’t include geology or evolution in its training data. Conspiracy theory truth models, that only return answers that, for instance, deny our travel to the Moon.

Yeah. And once something’s happened, we’re gonna know all the pitfalls. And unfortunately, we may have to learn that way, but currently, there’s no requirement for any vendor to disclose how much AI they’re using in their product.

So if you’re using somebody else’s solution in a commercial setting, there’s actually a danger for the customers themselves because they don’t even know if they themselves are using a product that may be using AI in some way that may implicate them.

Just thinking about that becomes mind-boggling in terms of the potential complexities and how we navigate that world without falling into every pothole there is.


When the latest release of ChatGPT, GPT-4 was announced, they said “These are pre-trained models.” They trained it on some data, then they verified how well it was performing and reinforced its learning. It’s like telling a dog “Good dog,” or “Bad dog.” You know if it does something right or wrong. So you reinforce good behavior, right?

But there’s no guarantee that as you reinforce it, it’s going to follow what you said.

They were very proud of the fact that GPT-4, when you reinforce it, listens to you 20% more than it doesn’t listen to you. But that means 80% of the time it’s not taking notice of your reinforcement.

You can pat your good dog as much as you like – it’ll still bite your leg off if it wants to.

So – having identified the potential data-based weaknesses and vulnerabilities of generative AI, what should companies do to minimize those risks while still embracing generative AI as a technology?

Well, first of all, if you’re a company that’s trying to consume generative AI, I think it’s extremely important to put some governing principles of your own in place.

So for example, the Samsung situation, where a developer gave away proprietary code to ChatGPT, I think could have been avoided in hindsight with better governance in saying, “Hey, look, these general purpose AI applications, actually you cannot give any proprietary data to them, because they could take it and there’s legally nothing we could do about it.”

Putting rules in place so that, for instance, if you give your proprietary data through an API, they won’t use it a general domain would be good.

You actually want to build on the knowledge you have, and use generative AI to do that – but you don’t want the tool to take your data away from your environment.

So, think about how you ensure it doesn’t do that, when legally, it has a pretty perfect right to do so if you volunteer the data.

How do you enforce that with these models? There’s no mechanism right now. So knowing that limitation is important.

It’s also always worth keeping in mind that AI isn’t just generative AI.

There is logic-based AI as well that’s not fuzzy. So for any task where you want some outcome relative to an objective that’s predictable, don’t use a learning model for that task. Use a different kind of AI. There are AI models out there that are completely logic-based, and their outputs are entirely deterministic.

There are ways those models can be used along with a natural language solution like ChatGPT. So you could ask ChatGPT a question, ChatGPT would be limited to the model returning logic-based answers, and return those answers, as opposed to ChatGPT trying to pattern match an answer.

Which depends on companies doing extra work, as opposed to the easy application they currently have.

I do believe that some kind of governance is required before you ask all your employees to go and use AI in their job setting.

That’s more or less what we were asking in the long-ago simpler days at the start of Part 1 – whether generative AI had been rushed out into the market. Because as soon as you announced “We have this amazing thing!” this level of uptake was more or less guaranteed, because it’s –

So tempting to use it right? Because let’s remember, it’s actually remarkably accurate in what it produces in seconds or minutes. It’s a very tempting to take that and use it in your product, because that’s a competitive advantage that you get right away.

We’ve spoken to a long-time professional coder, and he said “If I use this, this is fine, this works well. It gets me close to where I need to be, and I know where it’s not as good as it should be, or where it’s gone wrong, and I can correct it.”

The danger that he saw was that it also allows people who have no knowledge at all to get close to where they should be – but they can’t know where it’s gone wrong, or correct it.

Yeah, I think that is the problem. I think that unfortunately, a lot of oversight is required right now to use these solutions, and I come back to data.

It’s not just about using it with oversight. As a commercial entity, you have to really watch the data privacy and security and the implications of the data – because if you don’t, you could very quickly expose yourself to a lot of data risk.

Spoiler alert – the accuracy of predictions – and query results – depends on the data on which the model is trained.