Desert Island Discs: all the records, books, and luxuries

If you are interested in just seeing the data for Desert Island Discs, you can download it here.

Note: since this article was first posted (4 Feb 2020), the output has been updated. So the numbers in the summaries below are out of date but the latest data is provided in the output files.

Introduction

I was curious to find out more about BBC Radio 4’s Desert Island Discs but couldn’t find a spreadsheet with all the data. So I wrote a program to extract the data from the BBC website.

The top ten books and the number of times they were chosen are:

No.	Book	Number of times chosen
1	War and Peace by Leo Tolstoy	39
2	Encyclopaedia Britannica	32
3	A La Recherche Du Temps Perdu by Marcel Proust	32
4	Encyclopaedia	30
5	The History of the Decline and Fall of the Roman Empire by Edward Gibbon	18
6	The Oxford Book of English Verse	17
7	The Wind in the Willows by Kenneth Grahame	15
8	The Lord of the Rings by J R R Tolkien	14
9	The Oxford English Dictionary	13
10	The Pickwick Papers by Charles Dickens	13

Of the music played, here are the top musicians:

No.	Musician	Number of times chosen
1	Wolfgang Amadeus Mozart	969
2	Ludwig van Beethoven	819
3	Johann Sebastian Bach	762
4	Franz Schubert	395
5	Giuseppe Verdi	361
6	Giacomo Puccini	335
7	Edward Elgar	328
8	George Frideric Handel	300
9	The Beatles	291
10	Peter Ilyich Tchaikovsky	271
11	Johannes Brahms	260
12	Frank Sinatra	250
13	Richard Strauss	223
14	Claude Debussy	184
15	Sergey Rachmaninov	178
16	Benjamin Britten	166
17	Maurice Ravel	161
18	Ralph Vaughan Williams	160
19	Jean Sibelius	153
20	Felix Mendelssohn	151
21	Gustav Mahler	141
22	Igor Stravinsky	123
23	Ella Fitzgerald	120
24	Édith Piaf	119
25	William Shakespeare	114
26	Bob Dylan	109
27	Antonín Dvořák	109
28	Joseph Haydn	108
29	Noël Coward	105
30	Arthur Sullivan	104

It’s interesting how classical music dominates the top ten apart from The Beatles.

The top twenty luxuries chosen are:

No.	Luxury	Number of times chosen
1	Piano	137
2	Writing materials	64
3	Painting materials	43
4	Guitar	36
5	Bed	32
6	Radio receiver	28
7	Golf clubs and balls	28
8	Typewriter and paper	25
9	Telescope	23
10	Pen and paper	19
11	Champagne	19
12	Wine	16
13	Perfume	15
14	Tape recorder	14
15	Typewriter	13
16	Binoculars	13
17	Family photo album	10
18	Tea	9
19	Soap	9
20	Paper and pencils	9

If you want a list of all episodes, the music played, and the books, luxuries and favourite tracks chosen, you can find them here. The list is available as an Excel spreadsheet and a CSV file. The Excel spreadsheet has the full list of top books, musicians, and luxuries.

The rest of this post is about how I wrote the Python program to create the spreadsheet.

Background

I was learning Python as part of a machine learning course. This was a good opportunity to write a full program in Python with an actual use case, which would allow me to learn more Python concepts.

Choosing a web scraping framework

My first task was to find a library that would make scraping data from a website easy. There are many websites listing Python frameworks. One of them is ScrapHero. After looking at the pros and cons, I choose BeautifulSoup 4 (BS4). It was relatively easy to learn, robust and had good documentation.

I downloaded BS4 and instantly became productive. So I stuck to it and looked no further. If your needs are more complex, you may want to look at other frameworks.

Web scraping basics

It had been a long time since I’d done any web scraping. I can’t even remember which language I used at the time.

This is the basic process for scraping data off a website:

Find the web page you want to extract data from.
Look at the structure of the page using the developer tools of your browser. Both Chrome (and derivatives) and Firefox have good developer tools.
Find the data you want on the page.
Press Ctrl-Shift-c to inspect the source code of the element.
Work out how to navigate to that element using HTML and CSS.
Write code!

Example of scraping web data

If you learn from source code, you can go straight to the source code on Github.

I’m going to use the Michael Lewis Desert Island Discs episode for this example.

Here’s a view of the tracks chosen by Martin Lewis and a view of the inspector:

On the left side in the above screenshot, you’ll see a track highlighted: I want you back by The Jackson 5. On the right side is the source code for it. (You may need to zoom in to read the screenshot.) By looking at the code, you can see that the tracks are described in a <div> with CSS class segment__track. Within the <div>, is an <h3> tag containing a <span> tag (with class artist) that has the name of the artist. Below that, as a <p> tag, is a <span> tag with the name of the song.

The code to extract track data is in scraper.py and the relevant methods are shown below:

def extract_tracks_from_list(self, soup):
    """
    The episode page has a structured list of episodes usually.
    """

    tracks = TrackList()
    tracks_element = soup.find_all('div', class_='segment__track')
    for track_element in tracks_element:
        try:
            track = self.extract_artist_and_song_from_list(track_element)
            tracks.add(track)
        except Exception as e:
            print_error(f'Ignoring error extracting song/artist from element: {track_element}', e)

    return tracks

def extract_artist_and_song_from_list(self, track_element, class_='artist'):
    artist = ''
    gysong = ''
    if track_element:
        # Sometimes there is more than one artist, in which create comma-separated list
        artist_element = track_element.find_all('span', class_)
        for e in artist_element:
            artist += ' & ' if artist else ''
            artist += e.text

        song = track_element.p.span.text if track_element.p.span is not None else ''

    return Track(clean_string(artist), clean_string(song))

The method extract_tracks_from_list finds all the <div> elements with class segment__track and iterates over them. The method extract_artist_and_song_from_text extracts the name of the artist and the song title. The artist is extracted by looking for a <span> tag of class class_ (whose value in this instance is artist). The song is extracted by looking for a for <p> tag with a <span> tag (track_element.p.span.text).

Program structure

The program is written in an object-oriented way. I did some fairly basic object-oriented analysis to determine the classes. In the following description of Desert Island Discs, the italicised words become classes in the program:

Desert Island Discs has people on it who are called castaways. Each castaway appears in his or her own episode. The castaway has a name and job title. Each episode has a URL on the BBC website. The castaway picks a track list, which consists of up to eight tracks. A track consists of an artist and a song. The castaway also picks a favourite track, a book and a luxury to take to the desert island. The program will require a parser to extract data from the BBC website and an output writer to display the data in a formatted way.

Data complications

Hopefully, you’ve seen how easy it is to extract data. Data being data, the greatest difficulty is that data is not represented consistently in the real world. The BBC Desert Island Discs website has over 3,000 episodes. For these, there are several variations for representing the same data. For example, sometimes the tracks don’t appear as a nice list (as above) but instead in the text describing the episode, like this:

Similarly, the luxury, book and favourite track can appear in several places. They are also denoted by different phrases, such as "Favourite", "Favourite track", "Castaway’s Favourite" and so on.

To extract some data, Python regular expressions are used. You may want to read about them if you’re not familiar (although most of the regular expressions used are fairly simple).

The program does a reasonable job of extracting most information for most episodes. It tries a number of ways of extracting data if the default method fails.

However, successfully extracting data from a website depends on the structure of the website and the pages on it. If the URLs, HTML or CSS change, the program is likely to fail, depending on how drastic the change is.

Output

The program can output the data it finds in a CSV file. By default, the delimiter is a tab. From this tab-delimited file, I’ve created an Excel spreadsheet. You can see details for all episodes of Desert Island Discs in the spreadsheet. You’ll sometimes see missing data. Where data has not been extracted, it’s usually because it has been represented in an uncommon way on the website. To check if the program didn’t extract the data or if it’s missing on the BBC website, you can click the link for the episode.

To import the output CSV file into Excel 2016:

Open a new spreadsheet.
Put the cursor on cell A1.
Click the Data tab in the toolbar.
Click From Text.
Select the CSV file.
In the Text Import Wizard:
1. Select Delimited.
2. In the File Origin, select 65001: Unicode (UTF-8).
3. Click Next.
4. Select Tab as the delimiter (ensure other delimiters are not selected).
5. Click Next.
6. Select General as the Column data format.
7. Click Finish.

The spreadsheet in Github has other formatting and hyperlinking added.

Desert Island Discs: all the records, books, and luxuries

Introduction

Background

Choosing a web scraping framework

Web scraping basics

Example of scraping web data

Program structure

Data complications

Output

Related

Leave a ReplyCancel reply

Introduction

Background

Choosing a web scraping framework

Web scraping basics

Example of scraping web data

Program structure

Data complications

Output

Share

Related

Leave a ReplyCancel reply