Web scraping and digital peace of mind
Web scraping has a bad reputation. It’s a practice with a surgical-sounding name, and a history studded with headlines about data being used without expressed or understood consent, data used for political chicanery and corporate gain, and digital humanity commoditized with little or no regard for the flesh and blood human beings behind the data.
But web scraping is also the bread and butter of a thousand platforms, and if done within ethical boundaries, should not necessarily make the skins of human beings crawl, or cause shudders to slide down their spines.
The point is that the headlines that have been first and loudest to land on the public’s consciousness have been more or less universally negative, so there’s a battle that needs to be won for the hearts and minds of web users over the practice of ethical web scraping. That’s why a group called the Internet Infrastructure Coalition has developed what it’s calling the Ethical Web Data Collection Initiative (EWDCI), to strengthen public trust in ethical web scraping as a real thing, and establish guidelines for what actually constitute the lines of ethics in a business model that has such a reputation to overcome.
We sat down with Christian Dawson, co-founder of the EWDCI, to chew the ethical fat and find out more.
OK, let’s get the existential crisis out of the way at the start. Why is the EWDCI necessary? What’s the thinking behind its creation?
The Internet Infrastructure Coalition has been around for about 10 years, and we spend a lot of time talking with legislators and regulators about how the internet works, in the hope that they make smart laws, not laws that are detrimental to people’s freedom online or companies’ operations on the internet.
It’s often really hard to educate them about how the internet works. But one of the things that we’ve noticed over the years is that politicians don’t tend to come up with negative laws in a vacuum. People come to them with problems – there are real problems on the internet, right? And so as an advocacy organization that’s trying to fight for smart internet laws, we can’t just rest on our laurels and focus attention on getting legislators to not make dumb laws. They have to solve problems if the problems exist.
But we as an industry can take more responsibility for trying to trying to build a better internet ourselves. And we’ve done that – we’ve gotten together with different parts of the internet ecosystem to figure out how we can ensure that the internet is better and safer and more responsible in different segments. We’ve worked with the domain name companies on initiatives working towards a safer, better DNS. We work with the VPN companies about getting their commitments on certain areas. And data scraping is certainly an area that’s come more into focus in our lives, and more a part of how the internet works.
It’s certainly an aspect that legislators are taking a look at. So it’s something that we should be taking a look at. Our goal was to figure out exactly what ethical web data gathering should look like.
Ethical web data gathering – certainly sounds better than “web scraping.” Especially since web scraping has come to public attention mostly with a negative connotation, because the positive aspects of it aren’t usually advocated for, aren’t usually seen, and aren’t usually understood. So what does ethical web scraping look like as far as you’re concerned?
The point of Principles.
Well, we have put out what we’re calling Principles 1.0 as a starting point. But the real answer is that it’s still an open dialogue that needs to be had between consumers and legislators and the types of companies that want to step up to try to be above a certain bar when it comes to ethical practices. We want to make sure that those companies are only engaging in activities that are lawful and that are producing things that are beneficial to the world and the internet.
As far as the reputation goes, I think that at its best, with the more responsible companies, web scraping practices are no longer aggressive or secretive. I think that there are entire business models that rely on the practice, and that provide a lot of value to consumers and businesses.
For instance, a lot of us use travel fare aggregators. They’re price comparison tools that we’re using for all sorts of things. And even search engines, one of the most used things in the world, can only be created through a form of what is unfortunately still called “web scraping.” I agree – that name is problematic. They may not download entire datasets for webpages, but they have to find URLs and index them, which is largely the same practice, right?
The way of the online world.
So the fact is that the internet as we know it involves this practice. And in its best form, it’s something that we like and want. But not everybody uses it that way. And so if any organization, be it ours or any other, is going to be in a position to advocate for web data aggregation, for web data collection, you can’t advocate for all of it. You have to set a bar, you have to figure out what the responsibility level is at which you can say these practices need to be well regulated and clear and lawful, but they don’t need to be dismissed as bad out of hand. The technology is not, by definition, good or evil. It can be used for good or evil.
So on the legal and ethical side of web scraping, that’s exactly why EWDCI was established, so that we can try to set that bar and determine how to encourage people within that industry – web scraping providers and users – to adhere to data protection legislation like GDPR, CCPA, and others.
It’s not just about adherence to the law, either. There are ways that people do web scraping that negatively impact user experiences on websites that don’t necessarily take into account the intentions of the websites that they’re collecting data from. There is a desire to make sure that when you’re setting principles, they’re not all about staying within the law, but that they’re also about ensuring the process of collecting data is ethical.
You want people to be able to live with themselves and sleep at night, as well as not to fear a knock on the door from the cops?
Consent, first, last and always.
Yeah, ideally! You’ve got organizations that you’re going to be collecting data from in a consenting, knowledgeable and potentially better, mutually beneficial way.
Right. And EWDCI makes a point of talking about the “digital peace of mind” that everyone deserves. Is that what it looks like? Ethical data collection in a consenting, mutually beneficial way?
I don’t think it’s a cop-out to say that that’s what we’re talking about in our first draft – in Principles 1.0. I think that as we get down to it, the devil will be in the details, right? So yes, that is what digital peace of mind looks like. And that’s what’s reflected in the Principles that we’ve put out.
We are also running a public comment period on those Principles, to go out there to people and get their perspectives and maybe even their case studies that challenge some of the assumptions that we’ve got there. That will make us go further or get more granular in our detail about what digital peace of mind looks like, because I think the goal is to have a conversation about these principles, and to head into a cycle of continuous improvement with them.
Because it’s moving; we will adapt as we learn more about the needs of the entire ecosystem. The people within the group only have the lens of being within that system, be they scrapers themselves or proxies or some other piece of the ecosystem. And they’re producing these initial Principles by looking through that lens. When they go out there to collect information from other parts of the ecosystem, they’re probably going to revise it and have a slightly different view into what digital peace of mind looks like.
In Part 2 of this article, we’ll dive deeper into EWDCI’s Principles and explore the longer-term aims of the Initiative.
28 November 2023
27 November 2023
27 November 2023