Zürich Speaks

TWIST2018 Project


Text mining 100+ years of Kanton Zürich's referenda and initiatives

TWIST2018 project


*Peter has some nice papers with previous research

main data sources:


  • Kantonal level CSV contains URLs to machine-readable pdf voting information

  • Gemeinde level CSV contains per-Gemeinde historical voting records

  • CSVs are joined by unique vote ID (STAT_VORLAGE_ID)

  • PDF are converted to TXT via pdftotext and can be joined to CSV files by field ABSTIMMUNGSTAG

using the code and data

(mostly python 2.7 or bash)

  • scrapes the URLs from the Kantonal CSV file and saves them locally. (Actually we got the PDFs from the organizers on a usb stick, because the scraper was getting IP blocked.) Note that the files Bundesamt.pdf are not URL linked in the CSV files.

  • loops over the PDFs and converts them to TXT with pdftotext.

  • reads the individual TXT files, cleanups up the text a bit, and writes a CSV file with some keys for joining later: full_text.csv (zipped).

  • (experimental) reads the combined text from full_text.csv, and also the metadta from the Kantonal CSV file. It attemps to split the TXT file into multiple elements, one for each ballot measure, using some file-specific some keywords. The code then maps based on the rank of this split array. Output file is fulltextmapped.csv.

  • reads fulltextmapped.csv and calculates the polarity (-1,1), the subjectivity (0,1) with textblob_de and the readability. Output file is fulltextmapped_sentiment.csv, and the three scores are added as the last 3 columns.


The most important rule of TWIST 2018 is:

Be Excellent to each other.

As this is an open source event, we will encourage all teams to publish their work under open licenses in open repositories, such as but not restricted to GitHub. The organizers, sponsors, and event staff shall not claim or request any endorsement or special rights and privileges to any work you do at the event. All project documentation created or shared during the event for projects published as above will be republished and promoted under a Creative Commons license as detailed below.

Creative Commons LicenceThe contents of this website, unless otherwise stated, are licensed under a Creative Commons Attribution 4.0 International License.

Latest update: 2018-08-26
Maintainer: Matthias Mazenauer

Launched at TWIST 2018 by