Web scraping and the evolution of principles

Web scraping is looking to establish a bar of behavioral standards for practitioners.
24 April 2023

In Part 1 of this article, we sat down with Christian Dawson, co-founder of the Ethical Web Data Collection Initiative (EWDCI), to get an idea of what ethical web scraping might look like – and why it was increasingly necessary.

We discovered a dichotomy between the necessity of web scraping – on which a good deal of the modern internet is based (think comparison websites, booking websites, and much, much more) and the ways in which some web scrapers have customarily behaved, giving web scraping a bad name and a bad reputation in the public eye, and creating consumer resistance to web scraping for any purpose, when, done responsibly, it needn’t be the digital Devil incarnate.

EWDCI has created a set of initial principles, Principles 1.0, and put them out into the world for public discussion and feedback.

Crucially, the initial Principles have been created by people who live and breathe the internet ecosystem, and who work entirely within it. We asked Christian whether there was a chance that eventually, the Principles would be in the care and guardianship of people outside the ecosystem – whether, in fact, the public will own the Principles on which ethical web scraping will be allowed.

CD:

Good question. At the moment, the EWDCI is a young group, and a relatively small group. At this point, we’re very happy that the groups that have stepped up to participate in the original work have done so. I’m not certain that there have been decisions made yet as to what and whom the future members of the group will be. But as long as it’s trying to properly set that bar and identify what the industry deems ethical behavior in this space, I think the membership could grow to encompass whoever is going to aid them in delivering that mission.

The importance of consent.

THQ:

You mentioned the idea of web scraping with consent. That seems to be one of the reasons why web scraping has something of its negative reputation – there didn’t previously seem to even be the idea that consent was necessary in some of the companies that did it. How fundamental is the idea of consent in Principles 1.0?

CD:

The fundamental areas regarding our set of Principles are: guiding focus; legality; ethics; social responsibility; and ecosystem management. There’s always a problem with definitions when you get into trying to decide what things are, so I believe that the Principles have been very careful to spell out and define the areas that they’ve got there.

While we are very proud of the work that we’ve done so far, it is purposefully created to facilitate an ongoing dialogue. That starts now, with the release of these initial principles. So rather than focusing on today’s definitions, I’d encourage people to go ahead and read the ones that have been chosen there, and possibly give feedback in the public comment period. We’re using Principles 1.0 as an opportunity to start the conversation.

THQ:

What’s the mantra? Always progress, not perfection, right?

Making better commitments.

CD:

Exactly. What we have right now shows what we’re actively working on right now. And what we’re actively working on is trying to use this as an opportunity to say “What did we get wrong? What do you want to challenge us on? How can we, as a group, improve the work that we’ve done here, and make better public commitments?” We can use this as a tool to build more consumer trust over the work that we’re doing, and potentially use that as a lever to grow the group, to get more people to make those same commitments in the industry.

THQ:

So those are the milestones so far? Publish Principles 1.0, have the consultation period… Do you have a fully realized agenda or timeline for these steps?

CD:

I think so. I think that the Principles, in their various versions, are the important ones. It’s not just about what our own ecosystem thinks. The next round of Principles, from my perspective, the one that comes out of the public comment period, that’s the important one.

At that point, what we’re looking to do is to have some accountability mechanisms to try and hold people within the group accountable for adhering to those Principles.

There’ll be some accreditation criteria for people to attest that they are meeting the criteria demanded of them through the Web Data Principles. And I believe that, once we have this more community-centered version of the Principles in place, and we’ve got a way that you can publicly attest that your organization is abiding by them, that we can cast a wider net than we’ve gotten right now with our small group.

We can go out there and say we’ve done a lot of work to try and identify what we believe is a collection of ethical principles for web data collection, and we want anybody who’s in this space to join us “above the bar.” That’s the ultimate vision for the EWDCI.

A mark of quality.

THQ:

So the aim is a kind of self-regulation by having standards – so to join you “above the bar,” companies would have to be able to attest that they follow the Principles?

CD:

I think so, though it’s technically more about demonstrating good industry behavior than self-regulation.

THQ:

Indeed – we just meant if companies want to be part of the EWDCI, they’ll sign up to the agreed (and probably still evolving) Principles, and have their membership as a mark of the quality of their ethical web scraping.

CD:

Absolutely right. The goal is to try and provide a mark of quality. I do believe that as this industry grows, it will continue to get a good deal of legislative scrutiny. And I think that those who advocate on behalf of this industry, no matter who that is, even if it’s the individual companies in the space, will have an easier time of doing that if they can ensure to legislators that they’re advocating for the good the industry does. That they’re talking about actions that are above the bar, as set out by this program.