I have wanted to see how many news websites have blocked the AI Bots so far using the Disallow rules.
To get some usefule website data, I have checked on Github and found this repository: news-hub by Pr0gramWizard. It includes an SQL file with a lot of websites and their respective categories.
I have used the data contained within that SQL file and filtered for „news“ websites.
After requesting every single robobots.txt from all the domains, and filtering out the bad results (SSL certificate errors, sites being unavailable or simply blocking my request), I was left with a total of 1889 results.
The results basically look like this:
Site,Country,GPTBot Banned,Google-Extended Banned,lang xinhuanet.com,ZH,False,False,zh timeslive.co.za,ZA,False,False,en wz.de,DE,False,False,de welt.de,DE,True,True,de taz.de,DE,True,False,de stuttgarter-zeitung.de,DE,False,False,de stern.de,DE,True,True,de spiegel.de,DE,True,True,de rp-online.de,DE,True,True,de mdr.de,DE,False,False,de kn-online.de,DE,True,False,de focus.de,DE,False,False,delse,es noticierodigital.com,VE,False,False,es elpitazo.net,VE,False,False,es washingtontimes.com,US,True,False,en usatoday.com,US,False,False,en thegrio.com,US,False,False,enCode-Sprache: PHP (php)
News Websites blocking AI bots by country
Since (for this first plot) were just evaluating entries that have a country identifier attached to them (found in the original data source), the number of results is really low.
You see that, out of the total number of sites to report, mostly German and American publishers are blocking AI bots. (The purple bar showing the number of sites blocking both GPTBot and Google-Extended Bot).
News Websites blocking AI bots by language
By looking at the publishers by language, we get a lot more data. A lot of websites in the dataset do have language data, but no country set. So here we are looking at the total of all data entries.
There are quiter a few languages that have less than 20 websites associated with them. So I have set the threshold for this plot to at least 20 websites (for any given language), and at least one bot is being blocked.
I will do another one of these checks in the near future, to be able to compare the results. Until then, do with this information what you like 🙂
If you want the raw results data: go right ahead and download the CSV.