Skip to content

[webconnectivity]: Support user-agent strings in request headers to bypass scrapper security #1741

@DecFox

Description

@DecFox

We observed that several e-commerce sites in Portugal had failing measurements:
https://explorer.ooni.org/domain/www.asics.com
https://explorer.ooni.org/domain/www.elcorteingles.pt

although the websites are accessible through the browser. On further inspection, we found that this is also the case with other http clients like curl and we require a minimum number of user-agent request headers in order to access the website:

curl 'https://www.elcorteingles.pt/' \
  -H 'sec-ch-ua: "Not;A=Brand";v="99", "Brave";v="139", "Chromium";v="139"' \
  -H 'sec-ch-ua-platform: "Android"' \
  -H 'user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Mobile Safari/537.36'

is successful while:

curl -v 'https://www.elcorteingles.pt/' \
  -H 'sec-ch-ua: "Not;A=Brand";v="99", "Brave";v="139", "Chromium";v="139"' \
  -H 'user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Mobile Safari/537.36'

errors out with:

...
* Request completely sent off
* HTTP/2 stream 1 was not closed cleanly: INTERNAL_ERROR (err 2)

To this end, we should support using user-agent strings in request headers so we can measure domains with webconnectivity more confidently

Metadata

Metadata

Assignees

Labels

enhancementNew feature request or improvement to existing functionality

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions