Building A YouTube Archive System

2023-01-01

For the past few years I have been slowly building up a system that downloads and catalogs YouTube videos for my own personal viewing. As a result I have my own little library of internet content that I control and can keep for myself, something that is important to me.

Projects like this are never quite finished, but it’s at a point now where I’m fairly satisfied with it and I think what I’ve got could benefit someone else, or at least make for a mildly interesting article. My goal here is to go through how my system works and why you might want to build one yourself.

And so with that…

Why even do this?

One of my favorite things on the internet is the YouTube channel SiIvaGunner (Wikipedia), a music collective who create meme-filled remixes and mashups of mostly video game music, pretending the whole time to be uploading completely normal, high quality rips of said music. While this should be a fully legal thing to do (both as parody and non-commercial remixes) of course they often run into trouble with YouTube’s copyright system. Most notably, for a brief time in 2019 YouTube completely removed the channel, throwing into question whether they would be able to continue at all and blocking access to their entire back catalog.

When this happened, it occurred to me that if I wanted to make sure I always had access to their work, I would need to own and control my own copy of it. Thus the archive project was born.

If you aren’t a fan of niche remix cultures, maybe the fear of imminent copyright takedowns isn’t enough of a reason to set up a system like this. There are plenty of other reasons though:

YouTubers (or YouTube themselves!) delete, edit, or unlist videos all the time for a variety of reasons. Saving the videos yourself is the only way to make sure you’ll always have access to your favorite creator’s stuff
If your internet connection is unstable but you still want to watch videos in the highest quality, downloading them ahead of time can prevent annoying interruptions for buffering
Maybe you want to watch videos when you’re away from an internet connection and don’t want to use YouTube Premium¹ (or you want to save them on your laptop, which YouTube Premium doesn’t even support)
Honestly, if you’re the sort of person who is thinking of setting up something like this, you probably just like the idea of owning and controlling your own media like me

So if you’re ready to take the plunge, read on for how the system works as of today.

Software Tools and Dependencies

Not all of these are strictly required, but I’ve included everything I’m using for completeness sake.

My systems runs on macOS, so some of these apply only to that platform. Most of the core stuff is fairly platform agnostic though.

My video-archiver script
- Python with the yt-dlp, click, and requests packages
- ffmpeg (installed via Homebrew)
Lingon X for automation (macOS only)
Pushover for notifications
Hazel for organization (macOS only)
IINA for playing videos (macOS only, I like VLC on other platforms)
Plex for browsing and viewing
Play for saving video links (macOS/iOS only)

Hardware

There’s no specialized hardware needed to make this work, but obviously it will take up a fair amount of disk space. I’m running it on an M1 iMac² with zero issues, but a previous iteration ran just fine on a 2011 iMac.

For storage, I have an OWC Thunderbay 4 with four 4TB Seagate Ironwolf NAS drives in RAID 5, giving a total of 12TB of usable storage. This space is shared with backups, Blu-ray rips³, and other long-term large file storage, but provides plenty of room to grow. The Thunderbay hardware has been rock solid for the two years I’ve had it, though the SoftRaid software that’s required to use it has been a bit more hit-or-miss. I haven’t lost any data though, and most of the issues seem to have been resolved through updates (especially the update to macOS Ventura).

The System

Downloading Channels

The core of the system is periodically checking a list of channels and downloading any new content from them. The tool that makes this whole thing possible is yt-dlp, a more actively developed fork of youtube-dl. These little command line utilities are super powerful and the best way to download nearly any internet video (besides DRM protected stuff). When I first started the project, I was just using the batch downloading functionality of youtube-dl, which is surprisingly configurable. Eventually though I outgrew that and wrote my own Python script that hooks into yt-dlp’s Python interface.

My script is on GitHub and is decently well documented both in the readme and by running it with the --help flag. For the purpose of downloading a set of whole channels, the best way is to run the script with a list of channels saved in a text file.

From my channels.txt file, the script gets two things for each channel: the URL to download from (typically https://youtube.com/channel/<some random characters>/videos) and the number of recent videos to download.⁴ Some channels have additional filters on what I want to download. For instance, if a channel has both livestream VODs and short-form videos, I might add {"match_filter":"duration < 1800"} to only download the regular videos.

When run, the script goes through the channels file and downloads the specified number of recent videos for each channel. It is pre-configured to download at the highest quality, embed the thumbnail and subtitles if they exist, and name the file with the title, channel, and upload date. All of its progress is logged to a file, which is rotated so logs of the last five runs are kept.

Once the videos are downloaded, the script moves them from its temporary downloads folder to the permanent Archives folder. From here, Hazel takes over. For each channel that I have saved there is a corresponding folder with the exact name of that channel as it appears in all of the files downloaded from it. Hazel collects all the new videos and sorts them into their proper folder. If there is no folder for the video’s channel, it sorts into the -One-offs-⁵ folder. The Hazel rules I use for this are included along with the script.

The key to keeping this efficient is the ‘download archive’ feature of yt-dlp. This notes every video downloaded to a file, which then is referenced to see if a new video has been downloaded before. Since the downloads folder is separate from the Archives folder, this is essential to ensure videos don’t get downloaded more than once.

Notifications

To make sure everything is going smoothly, my script can optionally send notifications upon completion or when any errors are encountered. To achieve this, I use Pushover, which provides a dead-simple way to send notifications using a REST API. All that’s required to configure this is to provide a user token and an app token to the script using the PUSHOVER_USER and PUSHOVER_TOKEN environment variables, respectively. I tried a few other services that do this, but none were as easy, configurable, and fairly priced as Pushover.

Automating

Now we have the script working great and notifying us of its results, but having to manually run it would be tiresome. To solve this problem I use Lingon X to create a launchd job⁶ to run the script at set intervals.

My job is configured with these parameters:

Run at startup and when saving
Run every day at 11 PM
The Working Directory is set to the directory that my script is in (this is critical to make sure that the videos are downloaded in the correct place)
The PATH environment variable is set to include the Homebrew bin directory
The PUSHOVER_USER and PUSHOVER_TOKEN environment variables are set to their proper values

Downloading One-off Videos

There are some videos that I want to download and save, but I don’t need to keep the entire channel. For those, I have the one-offs.sh script. The basic workflow goes like this:

Wherever I’m watching YouTube, I add the desired video to Play and tag it with the Download tag.
The one-offs.sh script runs every 30 minutes. It runs a Shortcut which grabs any videos with the Download tag and saves their IDs to a file.
That file is passed to the archive script, and the videos are downloaded.
Another Shortcut is run which removes the videos from Play.
The file of IDs is removed.

Viewing

I have two ways of accessing the videos once they’re downloaded.

The simple but least user-friendly way is to connect to my host computer using SMB filesharing (or SFTP if I’m away from home) and use a video player to play the files directly. My preferred is IINA, but VLC and nPlayer also have their time.

The second, more complicated way is using Plex. I already have a Plex server running on my host machine for my movie and television show collection, so it was natural to include the YouTube archive in there as well. However, Plex isn’t really set up to handle this sort of media, so I don’t really recommend setting it up just for this.

Areas for Improvement

Of course there’s always ways things could be better.

Better Viewing

As I mentioned earlier, Plex isn’t really meant for this sort of thing. While it excels at managing a library of known movies and television shows, throwing arbitrary internet content at it has a tendency to confuse it a bit, and it’s missing some sorting and browsing features that an app designed to handle YouTube content would have. In my ideal world, I would write my own custom front end to this, but I haven’t gotten around to that yet.

Remotely Triggering Refreshes

I currently have no way to trigger a refresh or force a one-off download when I’m away from my computer. SSH is an obvious solution to this, but the way things are implemented currently requires the SSH session to remain active while the downloads complete. Some combination of screen or forking the process might be able to solve this, but it hasn’t been pressing enough for me to invest the time yet.

Better One-off Downloading

one-offs.sh works, but it’s pretty clunky. There are two main roadblocks that I’ve run into with it:

First is the remote triggering issue. The script currently has to be run on a schedule, which means most of the time it’s running and not doing anything at all.

Secondly, Play’s only scripting interface is Shortcuts. This would be fine, except that I’ve had major problems getting Shortcuts to behave nicely with Python scripts, especially when they are located on an external disk. Ideally the whole process would be implemented in Shortcuts, but for now I’m stuck writing temporary files to disk to pass information between Play and my Python script.⁷

Into the Future

Overall I’m pretty pleased with the system I’ve built. It’s serving my needs well: already I have some videos that are no longer available on YouTube saved safely on my hard drive(s). The most likely next step is to build an app for viewing the videos in a more pleasant way, but that’s a lot of work for something that probably only I would use.

I hope you’ve found this helpful, informative, or at least interesting! If you’re inspired to build something like this yourself, please let me know how it goes.

It’s worth noting here that I do pay for, use, and enjoy YouTube Premium. It really isn’t a bad value if you use YouTube on your phone a lot, even if you have a system like this. ↩︎
Full specs: 2021 M1 iMac, 8-Core CPU, 8-Core GPU, 16GB memory, 1TB SSD. It’s purple and I love it. ↩︎
Based on this article, it will not surprise you that I also like to own DRM-free copies of my favorite movies and shows. For those, I buy Blu-rays and rip them myself using MakeMKV and a flashed ASUS BW-16D1HT Blu-ray drive that I got from a guy named Alex Coluzzi on the MakeMKV forums. Jason Snell has a great write-up of his process which is pretty much what I do. ↩︎
For a typical channel I’ll usually use 5 for this value, but a more active channel (or one that tends to upload in bursts, such as a musical artist) might have a higher value. ↩︎
This could be named anything, but I like the dashes surrounding the name so it sorts to the top and stands out in the list of all the channels. ↩︎
Yes, cron can do this same thing, but launchd is the more powerful and more macOS way to do automated jobs. And yes, I could write the launchd jobs myself, but Lingon X makes it easier to avoid errors and doesn’t require manually editing XML files. ↩︎
The command-line interface for Shortcuts is also super buggy for trying to output results from running Shortcuts. I’d love to just use pipes instead of a temp file. Maybe I’m just holding it wrong, but I could not for the life of me get that to work. ↩︎