Today I decided to catch up with NSScreencast. NSScreencast has currently over 100 episodes and I stopped somewhere at 40.
I like to watch them offline when I have time but downloading the missing ~60+ episodes by hand would take too long. Having recently installed GNU Parallels I decided to give it a spin.
My goal was to generate PDFs of all episodes as well as to download the mp4 files of every episode.
This blog post is a short documentation of how one might scrap NSScreencast without fancy tools. If you are a subscriber of NSScreencast and like to watch the episodes offline this might help you too.
Generate PDFs of every episode (timing parallel)
Note: the episode description can be accessed without authentication.
I wanted to use wkhtmltopdf to generate the PDFs so I first checked if it worked as desired:
wkhtmltopdf http://www.nsscreencast.com/episodes/1 1.pdf
The PDF was generated without problems and looked good.
Now I used a regular bash loop once to time the entire process:
time for i in {1..102}; do wkhtmltopdf http://www.nsscreencast.com/episodes/$i $i.pdf; done
real 7m47.588s
user 1m46.571s
sys 0m17.431s
This generated all 102 PDFs in the current working directory. But it was quite slow.
For comparison here’s all PDF generation with parallel (YMMV):
time parallel wkhtmltopdf http://www.nsscreencast.com/episodes/{} {}.pdf ::: {1..102}
real 1m7.893s
user 1m58.293s
sys 0m19.502s
That’s incredible fast - and it was enough for me to give up on regular bash loops for this use case because it would only get worse when downloading huge movie files.
Authenticate with NSScreencast
Note: the episodes itself are behind a Paywall.
Since I’m a paying subscriber I needed to login first to retrieve a valid session.
Looking at Chromes developer tools I copied the sign-in request as cURL to extract the Set-Cookie header:
EMAIL=[email protected]
PASSWORD=awesomeSecretPassword
COOKIE=$(curl -D header -d 'email=$EMAIL&password=$PASWORD' \
'https://www.nsscreencast.com/user_sessions' > /dev/null && \
cat header | grep Set-Cookie | sed 's/Set-Cookie: \(.*\); d.*/\1/' && \
rm header)
I had to use sed to clean the Set-Cookie directive to only contain the key=value pair so I could later on feed it into cURL again.
Scraping NSScreencast
Having a valid session cookie stored in $COOKIE I again looked at chrome to get the URL of a episode. As it turned out I could reuse the entire URL and just append “.mp4” to it, which would redirect me to the correct episodes video.
Running cURL once validated this:
curl -O -LJ -b $COOKIE http://www.nsscreencast.com/episodes/1.mp4
Seconds later the first episode finished downloading. Sweet.
Using parallel I speed up the whole process and downloaded all of them:
parallel curl -O -LJ -b $COOKIE http://www.nsscreencast.com/episodes/{}.mp4 ::: {1..102}
This saved me tons of time and as NSScreencast progresses I can continue to watch the videos offline.
Takeaway: GNU parallel is easy to use and speeds things up considerably!