Desert Island Discs: all the records, books, and luxuries

Introduction

I was curious to find out more about BBC Radio 4’s Desert Island Discs but couldn’t find a spreadsheet with all the data. So I wrote a program to extract the data from the BBC website.

The top ten books and the number of times they were chosen are:

No. Book Number of times chosen
1 War and Peace by Leo Tolstoy 39
2 Encyclopaedia Britannica 32
3 A La Recherche Du Temps Perdu by Marcel Proust 32
4 Encyclopaedia 30
5 The History of the Decline and Fall of the Roman Empire by Edward Gibbon 18
6 The Oxford Book of English Verse 17
7 The Wind in the Willows by Kenneth Grahame 15
8 The Lord of the Rings by J R R Tolkien 14
9 The Oxford English Dictionary 13
10 The Pickwick Papers by Charles Dickens 13

Of the music played, here are the top musicians:

No. Musician Number of times chosen
1 Wolfgang Amadeus Mozart 969
2 Ludwig van Beethoven 819
3 Johann Sebastian Bach 762
4 Franz Schubert 395
5 Giuseppe Verdi 361
6 Giacomo Puccini 335
7 Edward Elgar 328
8 George Frideric Handel 300
9 The Beatles 291
10 Peter Ilyich Tchaikovsky 271
11 Johannes Brahms 260
12 Frank Sinatra 250
13 Richard Strauss 223
14 Claude Debussy 184
15 Sergey Rachmaninov 178
16 Benjamin Britten 166
17 Maurice Ravel 161
18 Ralph Vaughan Williams 160
19 Jean Sibelius 153
20 Felix Mendelssohn 151
21 Gustav Mahler 141
22 Igor Stravinsky 123
23 Ella Fitzgerald 120
24 Édith Piaf 119
25 William Shakespeare 114
26 Bob Dylan 109
27 Antonín Dvořák 109
28 Joseph Haydn 108
29 Noël Coward 105
30 Arthur Sullivan 104

It’s interesting how classical music dominates the top ten apart from The Beatles.

The top twenty luxuries chosen are:

No. Luxury Number of times chosen
1 Piano 137
2 Writing materials 64
3 Painting materials 43
4 Guitar 36
5 Bed 32
6 Radio receiver 28
7 Golf clubs and balls 28
8 Typewriter and paper 25
9 Telescope 23
10 Pen and paper 19
11 Champagne 19
12 Wine 16
13 Perfume 15
14 Tape recorder 14
15 Typewriter 13
16 Binoculars 13
17 Family photo album 10
18 Tea 9
19 Soap 9
20 Paper and pencils 9

If you want a list of all episodes, the music played, and the books, luxuries and favourite tracks chosen, you can find them here. The list is available as an Excel spreadsheet and a CSV file. The Excel spreadsheet has the full list of top books, musicians, and luxuries.

The rest of this post is about how I wrote the Python program to create the spreadsheet.

Background

I was learning Python as part of a machine learning course. This was a good opportunity to write a full program in Python with an actual use case, which would allow me to learn more Python concepts.

Choosing a web scraping framework

My first task was to find a library that would make scraping data from a website easy. There are many websites listing Python frameworks. One of them is ScrapHero. After looking at the pros and cons, I choose BeautifulSoup 4 (BS4). It was relatively easy to learn, robust and had good documentation.

I downloaded BS4 and instantly became productive. So I stuck to it and looked no further. If your needs are more complex, you may want to look at other frameworks.

Web scraping basics

It had been a long time since I’d done any web scraping. I can’t even remember which language I used at the time.

This is the basic process for scraping data off a website:

  1. Find the web page you want to extract data from.

  2. Look at the structure of the page using the developer tools of your browser. Both Chrome (and derivatives) and Firefox have good developer tools.

  3. Find the data you want on the page.

  4. Press Ctrl-Shift-c to inspect the source code of the element.

  5. Work out how to navigate to that element using HTML and CSS.

  6. Write code!

Example of scraping web data

If you learn from source code, you can go straight to the source code on Github.

I’m going to use the Michael Lewis Desert Island Discs episode for this example.

Here’s a view of the tracks chosen by Martin Lewis and a view of the inspector:

Music list

On the left side in the above screenshot, you’ll see a track highlighted: I want you back by The Jackson 5. On the right side is the source code for it. (You may need to zoom in to read the screenshot.) By looking at the code, you can see that the tracks are described in a <div> with CSS class segment__track. Within the <div>, is an <h3> tag containing a <span> tag (with class artist) that has the name of the artist. Below that, as a <p> tag, is a <span> tag with the name of the song.

The code to extract track data is in scraper.py and the relevant methods are shown below:

def extract_tracks_from_list(self, soup):
    """
    The episode page has a structured list of episodes usually.
    """

    tracks = TrackList()
    tracks_element = soup.find_all('div', class_='segment__track')
    for track_element in tracks_element:
        try:
            track = self.extract_artist_and_song_from_list(track_element)
            tracks.add(track)
        except Exception as e:
            print_error(f'Ignoring error extracting song/artist from element: {track_element}', e)

    return tracks

def extract_artist_and_song_from_list(self, track_element, class_='artist'):
    artist = ''
    gysong = ''
    if track_element:
        # Sometimes there is more than one artist, in which create comma-separated list
        artist_element = track_element.find_all('span', class_)
        for e in artist_element:
            artist += ' & ' if artist else ''
            artist += e.text

        song = track_element.p.span.text if track_element.p.span is not None else ''

    return Track(clean_string(artist), clean_string(song))

The method extract_tracks_from_list finds all the <div> elements with class segment__track and iterates over them. The method extract_artist_and_song_from_text extracts the name of the artist and the song title. The artist is extracted by looking for a <span> tag of class class_ (whose value in this instance is artist). The song is extracted by looking for a for <p> tag with a <span> tag (track_element.p.span.text).

Program structure

The program is written in an object-oriented way. I did some fairly basic object-oriented analysis to determine the classes. In the following description of Desert Island Discs, the italicised words become classes in the program:

Desert Island Discs has people on it who are called castaways. Each castaway appears in his or her own episode. The castaway has a name and job title. Each episode has a URL on the BBC website. The castaway picks a track list, which consists of up to eight tracks. A track consists of an artist and a song. The castaway also picks a favourite track, a book and a luxury to take to the desert island. The program will require a parser to extract data from the BBC website and an output writer to display the data in a formatted way.

Data complications

Hopefully, you’ve seen how easy it is to extract data. Data being data, the greatest difficulty is that data is not represented consistently in the real world. The BBC Desert Island Discs website has over 3,000 episodes. For these, there are several variations for representing the same data. For example, sometimes the tracks don’t appear as a nice list (as above) but instead in the text describing the episode, like this:

Music list in description

Similarly, the luxury, book and favourite track can appear in several places. They are also denoted by different phrases, such as "Favourite", "Favourite track", "Castaway’s Favourite" and so on.

To extract some data, Python regular expressions are used. You may want to read about them if you’re not familiar (although most of the regular expressions used are fairly simple).

The program does a reasonable job of extracting most information for most episodes. It tries a number of ways of extracting data if the default method fails.

However, successfully extracting data from a website depends on the structure of the website and the pages on it. If the URLs, HTML or CSS change, the program is likely to fail, depending on how drastic the change is.

Output

The program can output the data it finds in a CSV file. By default, the delimiter is a tab. From this tab-delimited file, I’ve created an Excel spreadsheet. You can see details for all episodes of Desert Island Discs in the spreadsheet. You’ll sometimes see missing data. Where data has not been extracted, it’s usually because it has been represented in an uncommon way on the website. To check if the program didn’t extract the data or if it’s missing on the BBC website, you can click the link for the episode.

To import the output CSV file into Excel 2016:

  1. Open a new spreadsheet.
  2. Put the cursor on cell A1.
  3. Click the Data tab in the toolbar.
  4. Click From Text.
  5. Select the CSV file.
  6. In the Text Import Wizard:
    1. Select Delimited.
    2. In the File Origin, select 65001: Unicode (UTF-8).
    3. Click Next.
    4. Select Tab as the delimiter (ensure other delimiters are not selected).
    5. Click Next.
    6. Select General as the Column data format.
    7. Click Finish.

The spreadsheet in Github has other formatting and hyperlinking added.

Leave a Reply