2021 Presidential Elections: using GeoJSON and Wikipedia

Introduction

While open data is increasingly made available by public and private institutions it’s not always possible to use it for specific purposes. In the case of elections it’s not uncommon for the final and officially approved data to take months (if not years) to be made available, which impairs the ability to react quickly to ongoing political events.

One possibility is to manually collect information as it is made available, and this is where Wikipedia can become a very useful source of structured data: Wikipedia editors and volunteers create and update voting data to enrich the articles on the election process. This data is not available in a “ready to consume” form but it is structured.

In this short notebook we will see how to find the right data, process it and then use it to visualise results; for this we will use GeoPandas to display voting boundaries based on a GeoJSON file, enrich a GeoDataFrame with the extracted elections results and analyse the voting patterns using a cloropeth map.

Identifying and obtaining the data

The article on the 2021 Presidential Elections in Portugal is available at https://pt.wikipedia.org/wiki/Eleições_presidenciais_portuguesas_de_2021 ; we will be using the Portuguese Wikipedia article since it is the one that has the updated information in table form.

There are several ways to present the data; in this notebook we will use the Municipality (Concelho) as the smallest unit since that’s the smallest one available at the page (the smallest unit, the parish (freguesia), is the one directly below this one, one concelho being composed of one or more parishes).

image.png

Checking the page source we can find the relevant table and observe the markdown that identifies it.

image-2.png

Specifically, we see that we have a table that comes after a span with and id of Resultados_por_Concelhos.

With that information we now proceed to get the entire Wikipedia article; using BeautifulSoup we can parse it and turn it into a structure JSON object that allows us to access it in a much easier way - for example, getting the title:

import requests
from bs4 import BeautifulSoup

wurl="https://pt.wikipedia.org/wiki/Elei%C3%A7%C3%B5es_presidenciais_portuguesas_de_2021"
session = requests.Session()
wpage= session.get(wurl,timeout=10)
soup = BeautifulSoup(wpage.text, 'html.parser')
soup.title
<title>Eleições presidenciais portuguesas de 2021 – Wikipédia, a enciclopédia livre</title>

We make use of that to search for our table: the one that immediately follows a span with the specific id we have found before:

table = soup.find('span',{"id": "Resultados_por_concelho"}).find_next("table")
table["class"]
['wikitable']

The HTML table is “structured”, albeit not the easiest format to immediately consume: we must turn the rows and the cells into a tabular format. The following code achieves that: it identifies the header rows by the number of cells, builds the header (what will be the first row), and then adds the results of all municipalities for all candidates. The result is a list of dictionaries, one for each municipality.

import collections

header=[]
results_list = []
for row in table.findAll('tr'):
    votes_dict = collections.OrderedDict()
    cells=row.findAll(["td", "th"])
    if len(cells) == 3:
        for cell in cells:
            t = cell.find(text=True).rstrip()
            header.append(t)
        header.remove("%")
    if len(cells) == 7:
        for cell in cells:
            t = cell.find(text=True).rstrip()
            if t != "":
                header.append(t)
        header.append(header.pop(header.index('Votantes')))
        #print(row)
    if len(cells) == 9:
        for cell,head in zip(cells,header):
            t = cell.find(text=True).rstrip().replace(",",".")
            #print(head + ": "+ t +".")
            votes_dict[head]=t
        results_list.append(votes_dict)

## Show the first entry of the list
results_list[0]
OrderedDict([('Concelhos', 'Abrantes'),
             ('MRS', '61.34'),
             ('AG', '9.63'),
             ('AV', '15.13'),
             ('JF', '4.76'),
             ('MM', '4.72'),
             ('TMG', '1.68'),
             ('VS', '2.74'),
             ('Votantes', '14 896')])

This format - list of dictionaries - is particularly suited for what we will do next: the creation of a pandas dataframe:

import pandas as pd
res_df = pd.DataFrame(results_list)
res_df.head()
Concelhos MRS AG AV JF MM TMG VS Votantes
0 Abrantes 61.34 9.63 15.13 4.76 4.72 1.68 2.74 14 896
1 Águeda 67.58 9.55 11.98 2.25 3.15 2.63 2.87 16 834
2 Aguiar da Beira 68.11 6.19 15.80 1.02 2.40 2.22 4.26 1 709
3 Alandroal 59.90 6.31 16.78 11.43 2.62 1.19 1.76 1 789
4 Albergaria-a-Velha 68.93 9.10 10.97 1.96 3.43 2.35 3.26 9 365

Trough trial-and-error there were some aspects of the dataframe that need to be tweaked a tiny bit: remove the space separator from numbers and convert the numeric columns to a numeric type (a float for all, although an int could be used for the “Votantes” - Voters - column).

#res_df["Concelhos"]=res_df["Concelhos"].str.upper()
res_df["Votantes"] = res_df["Votantes"].str.replace(" ","")
repl= res_df.iloc[:, 1:9].astype(float)
res_df[repl.columns]=repl
type(res_df.dtypes)
res_df.dtypes.to_frame()
0
Concelhos object
MRS float64
AG float64
AV float64
JF float64
MM float64
TMG float64
VS float64
Votantes float64

This dataframe would enable us, by itself, to perform all sorts of interesting analysis, but the focus of this notebook is how to present it in a map - and for that we will use GeoPandas.

Visualising data on a map

Visualising electoral results is often done through a cloropeth map; for that we need:

We start by importing the necessary libraries and setting up some options, and importing the GeoJSON file (that was added to the repository and we can obtain it directly from the URL).

Using geoplot it’s very simple to produce a map of the imported boundaries.

import geopandas
import geoplot
import pyproj
from matplotlib import pyplot as plt
## This might be needed, or not, depending on where the noteook runs
pyproj.datadir.set_data_dir("/usr/local/share/proj")
geo_file="https://raw.githubusercontent.com/fsmunoz/pt-act-parlamentar/presidenciais2021/concelhos_cont1.json"
municipality_boundaries = geopandas.read_file(geo_file)
geoplot.polyplot(municipality_boundaries, figsize=(18, 14))
plt.show()
_images/Presidenciais_2021_wiki_13_0.png

We now have a map and we have a dataframe with the information we want to visualise: we now combine them into a single GeoDataFrame object, essentially a regular Pandas dataframe that has a column named geometry with the geographic information.

The GeoJSON file was imported into a GeoDataFrame object already, and we can see that the column NAME_2 is the one that contains the name of the Municipality:

municipality_boundaries.head()
ID_0 ISO NAME_0 ID_1 NAME_1 ID_2 NAME_2 HASC_2 CCN_2 CCA_2 TYPE_2 ENGTYPE_2 NL_NAME_2 VARNAME_2 geometry
0 182 PRT Portugal 1 Évora 1 Évora PT.EV.EV 0 0705 Concelho Municipality POLYGON ((-7.72069 38.68486, -7.77012 38.75388...
1 182 PRT Portugal 1 Évora 2 Alandroal PT.EV.AL 0 0701 Concelho Municipality POLYGON ((-7.22761 38.76683, -7.26394 38.77336...
2 182 PRT Portugal 1 Évora 3 Arraiolos PT.EV.AR 0 0702 Concelho Municipality POLYGON ((-7.78996 38.89579, -7.86466 38.90903...
3 182 PRT Portugal 1 Évora 4 Borba PT.EV.BO 0 0703 Concelho Municipality POLYGON ((-7.41777 38.86806, -7.42294 38.90348...
4 182 PRT Portugal 1 Évora 5 Estremoz PT.EV.ES 0 0704 Concelho Municipality POLYGON ((-7.46612 38.92604, -7.45290 38.96442...

With that we can merge the dataframe we have with the results to it, specifying that the column NAME_2 in one and Concelhos in the other should be matched. The result is a single GeoDataFrame with all the information from both dataframes.

geodata = municipality_boundaries.merge(res_df, how='inner', left_on=["NAME_2"], right_on=["Concelhos"])
geodata.head()
ID_0 ISO NAME_0 ID_1 NAME_1 ID_2 NAME_2 HASC_2 CCN_2 CCA_2 ... geometry Concelhos MRS AG AV JF MM TMG VS Votantes
0 182 PRT Portugal 1 Évora 1 Évora PT.EV.EV 0 0705 ... POLYGON ((-7.72069 38.68486, -7.77012 38.75388... Évora 54.78 12.19 16.59 7.92 4.06 2.74 1.73 21847.0
1 182 PRT Portugal 1 Évora 2 Alandroal PT.EV.AL 0 0701 ... POLYGON ((-7.22761 38.76683, -7.26394 38.77336... Alandroal 59.90 6.31 16.78 11.43 2.62 1.19 1.76 1789.0
2 182 PRT Portugal 1 Évora 3 Arraiolos PT.EV.AR 0 0702 ... POLYGON ((-7.78996 38.89579, -7.86466 38.90903... Arraiolos 53.37 8.64 10.99 19.12 4.60 1.31 1.97 2959.0
3 182 PRT Portugal 1 Évora 4 Borba PT.EV.BO 0 0703 ... POLYGON ((-7.41777 38.86806, -7.42294 38.90348... Borba 57.29 9.41 18.37 7.91 3.39 1.90 1.74 2524.0
4 182 PRT Portugal 1 Évora 5 Estremoz PT.EV.ES 0 0704 ... POLYGON ((-7.46612 38.92604, -7.45290 38.96442... Estremoz 53.22 10.46 23.32 6.73 2.93 1.52 1.83 4837.0

5 rows × 24 columns

We now have the information we need to plot the cloropeth map by specifying the column we want to focus on, like the number of voters in each municipality.

geoplot.choropleth(geodata, hue=geodata["Votantes"],cmap="inferno_r", figsize=(15,15))
plt.title("Number of voters")
plt.show()
_images/Presidenciais_2021_wiki_19_0.png

With a bit more work (mostly around setting up the grid and arranging the subplots) we are able to produce a single plot composed of subplots: 7 showing the percentage of each candidate plus a final one with the same information as above.

import matplotlib.pyplot as plt

## Mostly randomly chosen after assigning the most obvious ones
cmap_dict = {
  "MRS": "Oranges",
  "AG": "pink",
  "AV": "Greys",
  "JF": "Reds",
  "MM": "Purples",
  "TMG": "Blues",
  "VS": "Greens",
  "Votantes": "inferno_r"
}

## 2x4 matrix
fig, axes = plt.subplots(ncols=4, nrows=2,figsize=(30,30))

for a, candidate in enumerate(res_df.columns[1:9]):
    if a < 4:
        haxis=0
        correct=0
    else:
        haxis=1
        correct=4
    geoplot.choropleth(geodata, hue = geodata[candidate], cmap=cmap_dict[candidate], ax=axes[haxis][a-correct])
    axes[haxis, a-correct].set_title(candidate)   

plt.subplots_adjust(wspace=0, hspace=0)
_images/Presidenciais_2021_wiki_21_0.png

And that’s it!

Final thoughts

This example is relatively straightforward and can be enhanced in many ways: once we built the main GeoDataFrame it’s easy to come up with additional sources of information that can be added as well, from other sources. Additionally, plots are not limited to cloropeth maps: the dataframe we have built can be used for both geographic visualisation through GeoPandas (and other tools) and as the basis for more quantitative analysis such as analysis of variance,etc.