Richard W.M. Jones: Downloading all the 78rpm rips at the Internet Archive

I’m a bit of a fan of 1930s popular music on gramophone records, so much so that I own an original early-30s gramophone player and an extensive collection of discs. So the announcement that the Internet Archive had released a collection of 29,000 records was pretty amazing.

[Edit: If you want a light introduction to this, I recommend this double CD]

I wanted to download it … all!

But apart from this gnomic explanation it isn’t obvious how, so I had to work it out. Here’s how I did it …

Firstly you do need to start with the Advanced Search form. Using the second form on that page, in the query box put collection:georgeblood, select the identifier field (only), set the format to CSV. Set the limit to 30000 (there are about 25000+ records), and download the huge CSV:

$ ls -l search.csv
-rw-rw-r--. 1 rjones rjones 2186375 Aug 14 21:03 search.csv
$ wc -l search.csv
25992 search.csv
$ head -5 search.csv
"identifier"
"78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b"
"78_a-prisoners-adieu_jerry-irby-modern-mountaineers_gbia0000549b"
"78_if-i-had-the-heart-of-a-clown_bobby-wayne-joe-reisman-rollins-nelson-kane_gbia0004921b"
"78_how-many-times-can-i-fall-in-love_patty-andrews-and-tommy-dorsey-victor-young-an_gbia0013066b"

A bit of URL exploration found a fairly straightforward way to turn those identifiers into directory listings. For example:

78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841bhttps://archive.org/download/78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b

What I want to do is pick the first MP3 file in the directory and download it. I’m not fussy about how to do that, and Python has both a CSV library and an HTML fetching library. This turns the CSV file of links into a list of MP3 URLs. You could easily adapt this to download FLAC files instead.

#!/usr/bin/python

import csv
import re
import urllib2
import urlparse
from BeautifulSoup import BeautifulSoup

with open('search.csv', 'rb') as csvfile:
    csvreader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in csvreader:
        if row[0] == "identifier":
            continue
        url = "https://archive.org/download/%s/" % row[0]
        page = urllib2.urlopen(url).read()
        soup = BeautifulSoup(page)
        links = soup.findAll('a', attrs={'href': re.compile(".mp3$")})
        # Only want the first link in the page.
        link = links[0]
        link = link.get('href', None)
        link = urlparse.urljoin(url, link)
        print link

When you run this it converts each identifier into a download URL:

Edit: Amusingly WordPress turns the next pre section with MP3 URLs into music players. I recommend listening to them!

$ ./download.py | head -10
<audio class="wp-audio-shortcode" controls="controls" id="audio-7432-11" preload="none" style="width: 100%;"><source src="https://archive.org/download/78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b/Jeannine%20I%20Dream%20Of%20You%20%22Lilac%20%20-%20Bar%20Harbor%20Society%20Orch..mp3?_=11" type="audio/mpeg">https://archive.org/download/78_jeannine-i-dream-of-you-lilac-time_bar-harbor-society-orch.-irving-kaufman-shilkr_gbia0010841b/Jeannine%20I%20Dream%20Of%20You%20%22Lilac%20%20-%20Bar%20Harbor%20Society%20Orch..mp3</audio>
<audio class="wp-audio-shortcode" controls="controls" id="audio-7432-12" preload="none" style="width: 100%;"><source src="https://archive.org/download/78_a-prisoners-adieu_jerry-irby-modern-mountaineers_gbia0000549b/A%20Prisoner%27s%20Adieu%20-%20Jerry%20Irby%20-%20Modern%20Mountaineers.mp3?_=12" type="audio/mpeg">https://archive.org/download/78_a-prisoners-adieu_jerry-irby-modern-mountaineers_gbia0000549b/A%20Prisoner%27s%20Adieu%20-%20Jerry%20Irby%20-%20Modern%20Mountaineers.mp3</audio>
<audio class="wp-audio-shortcode" controls="controls" id="audio-7432-13" preload="none" style="width: 100%;"><source src="https://archive.org/download/78_if-i-had-the-heart-of-a-clown_bobby-wayne-joe-reisman-rollins-nelson-kane_gbia0004921b/If%20I%20Had%20The%20Heart%20of%20A%20Clown%20-%20Bobby%20Wayne.mp3?_=13" type="audio/mpeg">https://archive.org/download/78_if-i-had-the-heart-of-a-clown_bobby-wayne-joe-reisman-rollins-nelson-kane_gbia0004921b/If%20I%20Had%20The%20Heart%20of%20A%20Clown%20-%20Bobby%20Wayne.mp3</audio>
<audio class="wp-audio-shortcode" controls="controls" id="audio-7432-14" preload="none" style="width: 100%;"><source src="https://archive.org/download/78_how-many-times-can-i-fall-in-love_patty-andrews-and-tommy-dorsey-victor-young-an_gbia0013066b/How%20Many%20Times%20%28Can%20I%20Fal%20-%20Patty%20Andrews%20And%20Tommy%20Dorsey.mp3?_=14" type="audio/mpeg">https://archive.org/download/78_how-many-times-can-i-fall-in-love_patty-andrews-and-tommy-dorsey-victor-young-an_gbia0013066b/How%20Many%20Times%20%28Can%20I%20Fal%20-%20Patty%20Andrews%20And%20Tommy%20Dorsey.mp3</audio>
<audio class="wp-audio-shortcode" controls="controls" id="audio-7432-15" preload="none" style="width: 100%;"><source src="https://archive.org/download/78_ill-forget-you_alan-dean-ball-burns-joe-lipman_gbia0002540a/I%27ll%20Forget%20You%20-%20Alan%20Dean%20-%20Ball%20-%20Burns.mp3?_=15" type="audio/mpeg">https://archive.org/download/78_ill-forget-you_alan-dean-ball-burns-joe-lipman_gbia0002540a/I%27ll%20Forget%20You%20-%20Alan%20Dean%20-%20Ball%20-%20Burns.mp3</audio>
<audio class="wp-audio-shortcode" controls="controls" id="audio-7432-16" preload="none" style="width: 100%;"><source src="https://archive.org/download/78_it-aint-gonna-rain-no-mo-ya-no-va-a-llover_international-novelty-orchestra-wend_gbia0014114a/It%20Ain%27t%20Gonna%20Rain%20No%20M%20-%20International%20Novelty%20Orchestra.mp3?_=16" type="audio/mpeg">https://archive.org/download/78_it-aint-gonna-rain-no-mo-ya-no-va-a-llover_international-novelty-orchestra-wend_gbia0014114a/It%20Ain%27t%20Gonna%20Rain%20No%20M%20-%20International%20Novelty%20Orchestra.mp3</audio>
<audio class="wp-audio-shortcode" controls="controls" id="audio-7432-17" preload="none" style="width: 100%;"><source src="https://archive.org/download/78_i-still-keep-dreaming_leroy-holmes-and-his-orchestra-sourwine-johnny-corva_gbia0004815b/I%20Still%20Keep%20Dreaming%20-%20Leroy%20Holmes%20and%20his%20Orchestra.mp3?_=17" type="audio/mpeg">https://archive.org/download/78_i-still-keep-dreaming_leroy-holmes-and-his-orchestra-sourwine-johnny-corva_gbia0004815b/I%20Still%20Keep%20Dreaming%20-%20Leroy%20Holmes%20and%20his%20Orchestra.mp3</audio>
<audio class="wp-audio-shortcode" controls="controls" id="audio-7432-18" preload="none" style="width: 100%;"><source src="https://archive.org/download/78_it-aint-nobodys-bizness_lulu-belle--scotty-browne-sampsel-markowitz_gbia0010017a/It%20Ain%27t%20Nobody%27s%20Bizness%20-%20Lulu%20Belle%20%26%20Scotty.mp3?_=18" type="audio/mpeg">https://archive.org/download/78_it-aint-nobodys-bizness_lulu-belle--scotty-browne-sampsel-markowitz_gbia0010017a/It%20Ain%27t%20Nobody%27s%20Bizness%20-%20Lulu%20Belle%20%26%20Scotty.mp3</audio>
<audio class="wp-audio-shortcode" controls="controls" id="audio-7432-19" preload="none" style="width: 100%;"><source src="https://archive.org/download/78_i-still-get-a-thrill-thinking-of-you_art-lund-johnny-thompson-coots-davis_gbia0002767a/I%20Still%20Get%20A%20Thrill%20%28Thinking%20Of%20You%29%20-%20Art%20Lund.mp3?_=19" type="audio/mpeg">https://archive.org/download/78_i-still-get-a-thrill-thinking-of-you_art-lund-johnny-thompson-coots-davis_gbia0002767a/I%20Still%20Get%20A%20Thrill%20%28Thinking%20Of%20You%29%20-%20Art%20Lund.mp3</audio>
<audio class="wp-audio-shortcode" controls="controls" id="audio-7432-20" preload="none" style="width: 100%;"><source src="https://archive.org/download/78_in-the-gloaming_art-hickmans-orchestra-logan_gbia0006430a/In%20The%20Gloaming%20-%20Art%20Hickman%27s%20Orchestra.mp3?_=20" type="audio/mpeg">https://archive.org/download/78_in-the-gloaming_art-hickmans-orchestra-logan_gbia0006430a/In%20The%20Gloaming%20-%20Art%20Hickman%27s%20Orchestra.mp3</audio>

And after that you can download as many 78s as you can handle 🙂 by doing:

$ ./download.py > downloads
$ wget -nc -i downloads


Source From: fedoraplanet.org.
Original article title: Richard W.M. Jones: Downloading all the 78rpm rips at the Internet Archive.
This full article can be read at: Richard W.M. Jones: Downloading all the 78rpm rips at the Internet Archive.

Advertisement


Random Article You May Like

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*