Basic Britishpoliticalspeech.org Scraper (CSV)
Contents
Basic Britishpoliticalspeech.org Scraper (CSV)¶
This python based scraper will scrape British political speeches from political leaders in the UK from Britishpoliticalspeech.org. When fully run the scraper will output a CSV file containing basic metadata about the speeches and the speeches themselves. These could for further analysis with for instance tools from the Pandas library.
import sys
import csv
import requests
import re
from bs4 import BeautifulSoup
# This function loads a webpage
def load_page(url):
with requests.get(url) as f:
page = f.text
return page
Locate the Data¶
Here we define two functions. First we extract metadata from the main content table of the archive using get_speech_data()
. Second we define a function to look at specific speech pages linked in the content table using the get_speech()
function.
From the main content table we extract data on:
name of the speech
date on which the speech was held
party to which the speaker of the speech belonged
the hyperlink to the specific speech page
Additionally an id is added to every speech.
def get_speech_data(url):
content_page = BeautifulSoup(load_page(url), 'lxml') #Open the webpage
if not content_page:
print('Something went wrong!', file=sys.stderr)
sys.exit()
data = []
for count, row in enumerate(content_page.find_all('tr')[2:]): #Find the data we are looking for
dates = row.find_all('td')[0]
parties = row.find_all('td')[1]
speakers = row.find_all('td')[2]
speech = row.find_all('td')[3]
link = row.find('a').get('href')
data.append({ #Add the data to a dictionary
'id' : parties.text + '_' + str(count),
'date': dates.text,
'party': parties.text,
'name speech': speech.text,
'link': 'http://britishpoliticalspeech.org/' + link
})
return data
From the specific speech page we extract data on:
the full speech text
the name of the speaker (collected here as it was incomplete in the main content table list)
the location in which the speech was held (easier to scrape here)
In this function we skip speeches in which the speech text is not available due to copyright.
def get_speech(url):
speech_page = BeautifulSoup(load_page(url), 'lxml') #Open the webpage
interesting_html = (speech_page.find(class_='speech-content').text.strip()
.replace('\xa0\n', '').replace('\n','').replace('\x85','').replace('\u2011','')) #Find the full text of the speech
skip_check = 'Owing to a copyright issue this speech has been removed.' #Check of this text is in the speech, otherwise this can be skipped
speaker_html = speech_page.find(class_='speech-speaker').text.strip().split('(', 1)[0] #Find the speaker of the speech
location_html = speech_page.find(class_='speech-location').text.strip() #Find the location at which the speech was held
if 'Location: ' in location_html:
location_html = location_html.replace('Location: ', '')
if not interesting_html or skip_check in interesting_html: # or not speaker_html or not location_html don't really care about not finding these
#print('Skipped - No information available for {}'.format(url), file=sys.stderr)
return {}
return {'speech' : interesting_html, 'speaker' : speaker_html, 'location' : location_html} #Add the data to a dictionary
Scraping the Data¶
The following code will proceed to apply the previously made functions for scraping the desired data and writes the output in a csv file.
#This code applies the scraping functions
index_url = 'http://britishpoliticalspeech.org/speech-archive.htm' # Contains the list of speeches
speech_data = get_speech_data(index_url) # Get speech metadata
for row in speech_data:
#print('Scraping info on {}.'.format(row['name speech'])) # Might be useful for debugging
url = row['link']
speech_info = get_speech(url) # Gets the speeches themselves
for key, value in speech_info.items():
row[key] = value # Add the new data to our dictionary
print('Done scraping!')
Done scraping!
#This code writes the data in the dictionary in a csv file
with open('metadata.csv', 'w', encoding='utf-8') as f: # Open a csv file for writing
fieldnames=['id','speaker', 'party', 'location', 'date', 'name speech',
'speech'] # These are the values we want to store
writer = csv.DictWriter(f,
delimiter=',', # Common delimiter
quotechar='"', # Common quote character
quoting=csv.QUOTE_NONNUMERIC, # Make sure that all strings are quoted
fieldnames=fieldnames
)
writer.writeheader() # Create headers in our csv file
for row in speech_data:
writer.writerow({k:v for k,v in row.items() if k in fieldnames})
Reading the Metadata¶
In this last part you can run the following code to make a tabular overview (with pandas) of the data stored in the metadata csv file and check if the metadata has been properly scraped.
import pandas as pd
df = pd.read_csv('metadata.csv')
df
id | speaker | party | location | date | name speech | speech | |
---|---|---|---|---|---|---|---|
0 | Conservative_0 | Theresa May | Conservative | Birmingham | 03/10/2018 | Leader's speech, Birmingham 2018 | Thank you very much for that warm welcome. You... |
1 | Labour_1 | Jeremy Corbyn | Labour | Liverpool | 26/09/2018 | Leader's speech, Liverpool 2018 | Thank you for that welcome. I want to start by... |
2 | Liberal Democrat_2 | Vince Cable | Liberal Democrat | Brighton | 18/09/2018 | Leader's speech, Brighton 2018 | Conference, we meet at an absolutely crucial m... |
3 | Conservative_3 | Theresa May | Conservative | Manchester | 04/10/2017 | Leader's speech, Manchester 2017 | A little over forty years ago in a small villa... |
4 | Labour_4 | Jeremy Corbyn | Labour | Brighton | 27/09/2017 | Leader's speech, Brighton 2017 | We meet here this week as a united Party, adva... |
... | ... | ... | ... | ... | ... | ... | ... |
357 | Liberal_357 | Sir Henry Campbell-Bannerman | Liberal | Hull | 08/03/1899 | Leader's speech, Hull 1899 | Sir James Reckitt, ladies and gentlemen, I am ... |
358 | Conservative_358 | Lord Salisbury | Conservative | London | 16/11/1897 | Leader's speech, London 1897 | My Lord Derby, my lords, ladies and gentlemen,... |
359 | Liberal_359 | Sir William Harcourt | Liberal | Norwich | 17/03/1897 | Leader's speech, Norwich 1897 | My Lords and Gentlemen, - I will say ‘My lords... |
360 | Liberal_360 | Earl of Rosebery | Liberal | Huddersfield | 27/03/1896 | Leader's speech, Huddersfield 1896 | Mr. Walker, ladies and gentlemen. It is very ... |
361 | Liberal_361 | Earl of Rosebery | Liberal | Cardiff | 18/01/1895 | Leader's speech, Cardiff 1895 | Mr. Bird, ladies and gentlemen, - I am deeply ... |
362 rows × 7 columns