Basic Britishpoliticalspeech.org Scraper (CSV)¶

This python based scraper will scrape British political speeches from political leaders in the UK from Britishpoliticalspeech.org. When fully run the scraper will output a CSV file containing basic metadata about the speeches and the speeches themselves. These could for further analysis with for instance tools from the Pandas library.

import sys
import csv
import requests
import re
from bs4 import BeautifulSoup

# This function loads a webpage
def load_page(url):
    with requests.get(url) as f:
        page = f.text
    return page

Locate the Data¶

Here we define two functions. First we extract metadata from the main content table of the archive using get_speech_data(). Second we define a function to look at specific speech pages linked in the content table using the get_speech() function.

From the main content table we extract data on:

name of the speech
date on which the speech was held
party to which the speaker of the speech belonged
the hyperlink to the specific speech page

Additionally an id is added to every speech.

def get_speech_data(url):
    content_page = BeautifulSoup(load_page(url), 'lxml')       #Open the webpage
    if not content_page:                                            
        print('Something went wrong!', file=sys.stderr)
        sys.exit()
    data = []
    for count, row in enumerate(content_page.find_all('tr')[2:]): #Find the data we are looking for
        dates = row.find_all('td')[0]
        parties = row.find_all('td')[1]
        speakers = row.find_all('td')[2]
        speech = row.find_all('td')[3]
        link = row.find('a').get('href')
        data.append({                               #Add the data to a dictionary
            'id' : parties.text + '_' + str(count),
            'date': dates.text,
            'party': parties.text,
            'name speech': speech.text,
            'link': 'http://britishpoliticalspeech.org/' + link
        })
    return data 

From the specific speech page we extract data on:

the full speech text
the name of the speaker (collected here as it was incomplete in the main content table list)
the location in which the speech was held (easier to scrape here)

In this function we skip speeches in which the speech text is not available due to copyright.

def get_speech(url):
    speech_page = BeautifulSoup(load_page(url), 'lxml')                  #Open the webpage
    interesting_html = (speech_page.find(class_='speech-content').text.strip()
        .replace('\xa0\n', '').replace('\n','').replace('\x85','').replace('\u2011','')) #Find the full text of the speech
    skip_check = 'Owing to a copyright issue this speech has been removed.' #Check of this text is in the speech, otherwise this can be skipped
    speaker_html = speech_page.find(class_='speech-speaker').text.strip().split('(', 1)[0] #Find the speaker of the speech
    location_html = speech_page.find(class_='speech-location').text.strip() #Find the location at which the speech was held
    if 'Location: ' in location_html:
        location_html = location_html.replace('Location: ', '')
    if not interesting_html or skip_check in interesting_html: # or not speaker_html or not location_html don't really care about not finding these
        #print('Skipped - No information available for {}'.format(url), file=sys.stderr)
        return {}                                                      
    return {'speech' : interesting_html, 'speaker' : speaker_html, 'location' : location_html} #Add the data to a dictionary

Scraping the Data¶

The following code will proceed to apply the previously made functions for scraping the desired data and writes the output in a csv file.

#This code applies the scraping functions
index_url = 'http://britishpoliticalspeech.org/speech-archive.htm'         # Contains the list of speeches
speech_data = get_speech_data(index_url)                      # Get speech metadata
for row in speech_data:
    #print('Scraping info on {}.'.format(row['name speech'])) # Might be useful for debugging
    url = row['link']
    speech_info = get_speech(url)                    # Gets the speeches themselves
    for key, value in speech_info.items():
        row[key] = value                              # Add the new data to our dictionary
print('Done scraping!')

Done scraping!

#This code writes the data in the dictionary in a csv file
with open('metadata.csv', 'w', encoding='utf-8') as f:       # Open a csv file for writing
    fieldnames=['id','speaker', 'party', 'location', 'date', 'name speech',
                'speech']                                 # These are the values we want to store
    writer = csv.DictWriter(f,
                            delimiter=',',                # Common delimiter
                            quotechar='"',                # Common quote character
                            quoting=csv.QUOTE_NONNUMERIC, # Make sure that all strings are quoted
                            fieldnames=fieldnames
                            )
    writer.writeheader()                                  # Create headers in our csv file
    for row in speech_data:
        writer.writerow({k:v for k,v in row.items() if k in fieldnames})

Reading the Metadata¶

In this last part you can run the following code to make a tabular overview (with pandas) of the data stored in the metadata csv file and check if the metadata has been properly scraped.

import pandas as pd

df = pd.read_csv('metadata.csv')
df

	id	speaker	party	location	date	name speech	speech
0	Conservative_0	Theresa May	Conservative	Birmingham	03/10/2018	Leader's speech, Birmingham 2018	Thank you very much for that warm welcome. You...
1	Labour_1	Jeremy Corbyn	Labour	Liverpool	26/09/2018	Leader's speech, Liverpool 2018	Thank you for that welcome. I want to start by...
2	Liberal Democrat_2	Vince Cable	Liberal Democrat	Brighton	18/09/2018	Leader's speech, Brighton 2018	Conference, we meet at an absolutely crucial m...
3	Conservative_3	Theresa May	Conservative	Manchester	04/10/2017	Leader's speech, Manchester 2017	A little over forty years ago in a small villa...
4	Labour_4	Jeremy Corbyn	Labour	Brighton	27/09/2017	Leader's speech, Brighton 2017	We meet here this week as a united Party, adva...
...	...	...	...	...	...	...	...
357	Liberal_357	Sir Henry Campbell-Bannerman	Liberal	Hull	08/03/1899	Leader's speech, Hull 1899	Sir James Reckitt, ladies and gentlemen, I am ...
358	Conservative_358	Lord Salisbury	Conservative	London	16/11/1897	Leader's speech, London 1897	My Lord Derby, my lords, ladies and gentlemen,...
359	Liberal_359	Sir William Harcourt	Liberal	Norwich	17/03/1897	Leader's speech, Norwich 1897	My Lords and Gentlemen, - I will say ‘My lords...
360	Liberal_360	Earl of Rosebery	Liberal	Huddersfield	27/03/1896	Leader's speech, Huddersfield 1896	Mr. Walker, ladies and gentlemen. It is very ...
361	Liberal_361	Earl of Rosebery	Liberal	Cardiff	18/01/1895	Leader's speech, Cardiff 1895	Mr. Bird, ladies and gentlemen, - I am deeply ...

362 rows × 7 columns

Stylometric Analysis on Churchill's Political Speeches (SAPS)

Basic Britishpoliticalspeech.org Scraper (CSV)

Contents

Basic Britishpoliticalspeech.org Scraper (CSV)¶

Locate the Data¶

Scraping the Data¶

Reading the Metadata¶