#programming

January 12, 2014

scraping nsscreencast

Today I decided to catch up with NSScreencast. NSScreencast has currently over 100 episodes and I stopped somewhere at 40.

I like to watch them offline when I have time but downloading the missing ~60+ episodes by hand would take too long. Having recently installed GNU Parallels I decided to give it a spin.

My goal was to generate PDFs of all episodes as well as to download the mp4 files of every episode.

This blog post is a short documentation of how one might scrap NSScreencast without fancy tools. If you are a subscriber of NSScreencast and like to watch the episodes offline this might help you too.

Generate PDFs of every episode (timing parallel)

Note: the episode description can be accessed without authentication.
I wanted to use wkhtmltopdf to generate the PDFs so I first checked if it worked as desired:

wkhtmltopdf http://www.nsscreencast.com/episodes/1 1.pdf

The PDF was generated without problems and looked good.

Now I used a regular bash loop once to time the entire process:

time for i in {1..102}; do wkhtmltopdf http://www.nsscreencast.com/episodes/$i $i.pdf; done
real  7m47.588s
user  1m46.571s
sys  0m17.431s

This generated all 102 PDFs in the current working directory. But it was quite slow.

For comparison here’s all PDF generation with parallel (YMMV):

time parallel wkhtmltopdf http://www.nsscreencast.com/episodes/{} {}.pdf ::: {1..102}
real  1m7.893s
user  1m58.293s
sys  0m19.502s

That’s incredible fast - and it was enough for me to give up on regular bash loops for this use case because it would only get worse when downloading huge movie files.

Authenticate with NSScreencast

Note: the episodes itself are behind a Paywall.
Since I’m a paying subscriber I needed to login first to retrieve a valid session. Looking at Chromes developer tools I copied the sign-in request as cURL to extract the Set-Cookie header:

EMAIL=[email protected]
PASSWORD=awesomeSecretPassword

COOKIE=$(curl -D header -d 'email=$EMAIL&password=$PASWORD' \
  'https://www.nsscreencast.com/user_sessions' > /dev/null && \
  cat header | grep Set-Cookie | sed 's/Set-Cookie: \(.*\); d.*/\1/' && \
  rm header)

I had to use sed to clean the Set-Cookie directive to only contain the key=value pair so I could later on feed it into cURL again.

Scraping NSScreencast

Having a valid session cookie stored in $COOKIE I again looked at chrome to get the URL of a episode. As it turned out I could reuse the entire URL and just append “.mp4” to it, which would redirect me to the correct episodes video.

Running cURL once validated this:

curl -O -LJ -b $COOKIE http://www.nsscreencast.com/episodes/1.mp4

Seconds later the first episode finished downloading. Sweet.

Using parallel I speed up the whole process and downloaded all of them:

parallel curl -O -LJ -b $COOKIE http://www.nsscreencast.com/episodes/{}.mp4 ::: {1..102}

This saved me tons of time and as NSScreencast progresses I can continue to watch the videos offline.

Takeaway: GNU parallel is easy to use and speeds things up considerably!