~/spook/todo.html

Todo list

I've been putting off changes to redrain for entirely too long. The fact is, it needs some work. When I first wrote it back in 2009, it replaced a wonky perl script that I'd been using for the last two years. What I'd learned in the meantime allowed me to both improve on the original and learn more day-to-day python usage. Of course, there's still problems that are bugging me.

The most prominent issue of course, is that the poor thing gets slow and that's just not okay. I wasn't thinking ahead far enough when I wrote it, clearly and made a foolish mistake. I was keeping all data for old shows in a single file, loading it into a set (it's like a dictionary or hash with only keys) and querying the set to see if something was new or not. Since checking a set for a key has a worst-case runtime of O(log n) I thought I'd be fine. However, large values of n are always eager to kick you in the face. At the time of this writing, my personal oldshows file is about 7,000 lines long (it was several times larger until I did a little cleanup) and each show was checking against a set of data containing all old shows. This becomes especially ugly when a show has nearly 500 entries in its feed!

So obviously, that's got to be changed. Each show needs it's own "oldshows" file, which should dramatically improve scraping performance.

Migration to Python 3 might happen in the future, but a dependency of feedparser has prompted me to not do this yet. I may update things to allow 2to3.py to work, but it'll remain experimental for now. I don't want installation to be any more work than needed.

I'm also dropping my own file format for config stuff and replacing it with JSON. I feel that there isn't any reason to maintain a baroque config file format when there's bound to be better ways to do these things.

It's also time for a whitelist/blacklist feature, something I've not seen on any other podcatching software. The idea is to create a simple mechanism by which certain shows are simply not downloaded. For example, a few shows run "preview" episodes which I never listen to in the first place. Others do "year in review" stuff which are basically just clip shows. So, why not add a method to keep me from ever downloading them in first place?

Seen in the wild lately: podcasts with .torrent files in their media enclosures. Neat! This is a great way to distribute shows and I'm going to attempt to encourage it by adding bittorrent support as an option to my program. It seems easy enough to add a simple downloader, but I'm a little iffy about how to handle seeding afterwards. Since python isn't known for multi threaded anything, it seems that I might not be able to keep a seed going.

Also: testing features are needed. Most importantly, a cache to store feeds so that when something goes awry, I can at least diff the thing and see what went wrong. This will make debugging other things a bit simpler, too.

Anyway, there's my personal todo list for an old project. I'm pretty sure that exactly zero other people use it.

-spook

Written: 2015-Jan-01 1910 CST (GMT-6)