Shucker - URL stripping with licensing fun

31 Dec 2024

Python  Rust  Web 

I started Shucker with a simple dream - take an incoming URL, and rip out all the tracking markers so I could compare it to other incoming URLs and be fairly sure two were different if they had actual differences, v.s. having been found via different other sites that wanted to track users. I did find one existing project doing this, but it was in Scala and mostly just embedded the JS code where as I wanted to actually generate Rust code to do this instead. Now, I've managed to do that, but it turns out the licensing on all of that is a lot more interesting than one might expect.

We'll come back to that in a minute, but as a TL;DR - Shucker is available for use as both a Rust library and a very simple Python wrapper library right now. It's GPL licensed, which will probably cause various corporate overlords to have shrieks of terror if you try and use it in a business content, so be warned there. There's a good reason for this though, not just my own preferences.

I've been an avid user of various adblockers for some years (uBlock origin being my current favourite along with a side of NoScript), but recently I ran into a problem with a side project where I wanted to be able to receive arbitrary URLs server-side and find out if we'd seen them or not already. Except, the problem is that most interesting URLs have often come via some sort of social media or aggregator site, and they tend to be infested with various tracking params, which makes comparison between the same URL from different sources hard. We can't just strip all the query params off a URL as some sites use them to differentiate the actual articles from one another, and so we need to do some smart stripping here. Some of this is easy (utm_source, etc) but there's a long-tail of this problem, and figuring this out by hand myself was fairly infeasible.

Some perusing of the licensing however on these matters turns out to be a bit more restrictive than I'd expected. Both of the main two big names: Adguard and uBlock are both GPL licensed (I did also find another option with a much weirder license which is a hard no from me). This is problematic for library usage in commercial code, but makes some sense given the nature of the projects. Shucker is therefore currently also similarly licensed, because it does build-time transformations of the GPLed data to generate it's work.

Something I haven't implemented yet, but am considering is taking the build-time bits of Shucker and making them available at runtime instead. This would enable a setup where the actual Shucker source code contains no GPLed or GPL-derived code (as I can re-license my own code however I like), but at install or runtime it needs to do an initial setup to download the ad filters and do some parsing. The filters themselves would be GPLed, but I don't think (I am very much not a lawyer FYI) that actually makes Shucker GPLed as it's just using the data at runtime, it's not part of the distributed library.

Previously: Sked and Eventbrite

Comments

With an account on the Fediverse or Mastodon, you can respond to this post. Since Mastodon is decentralized, you can use your existing account hosted by another Mastodon server or compatible platform if you don't have an account on this one. Known non-private replies are displayed below.