One problem with public policy (for me, as an extremely lazy person) is the amount of reading required to get an overview of what matters to local political figures. In order to get a better idea of what the Seattle City Council members consider important, I figured that I could just see what they seem to talk about often, with the goal being putting together a good idea of what matters to each council member and having a dataset that I can draw some conclusions from regarding how I may want to vote in the future.
In order to get a good idea of what council members find important, I figured it best to go straight to the source. Seattle local government offers a news RSS feed that is freely accessible. I used python’s feedparser
library to grab all RSS feed entries from the last 10 years and stored them in a local Mongo database. Each RSS entry comes with a date, a link to the article or post, and some other metadata. I then sifted through the 20248 entries as of May 15, 2024 and identified every feed entry with a link to council.seattle.gov
.
Using python’s BeautifulSoup
library, I was able to scrape the raw text of each of these council announcements and put those into a separate collection in the same Mongo database. Then, using regex, I extracted the council member whose office published the announcement for each announcement. After finding all this, I used nltk
to remove stopwords and create frequency distributions for word count to get a sense of the subjects of announcements. Using this frequency distributions I came up with the following categories for each announcement:
After determining these categories, I proceeded to use a zero-shot, deep-learning text classification model to measure the strength of the relationship between these categories and each announcement. I used an open source model and ran it locally. The specific model is available here. I used normalized outputs and measured the strength of the relationship for all categories for each announcement. I then transformed these measurements to better adhere to Gaussian distributions using the transformation: \(\mathcal{G} = \frac{1}{2}\log{\left(\frac{p}{1-p}\right)}\). After transforming I centered and scaled it to get data that looks like this:
All code used is available in the github repo, although, be warned, the collection and processing is extremely messy. If this wasn’t just a one-time scrape I would have been cleaner.
As many who know me will attest, I am the first to dismiss the capabilities of deep learning as more hype than substance. I still hold this position (at least until the public’s expectations of deep learning capabilities align with their true capabilities), and my use of deep learning here will not go unconsidered. Something to keep in mind throughout this whole investigation from this point forward, is that all these data points were generated by the model’s embedding of the tokens that make up the category, and the tokens that make up the article. The model has no actual semantic knowledge of the world, and we will address that; both in how that rears its head in the data, and how that may affect our conclusions.
Before we go any further, let me show you an example. You may have noticed that some of the subjects I looked for were covid-related, but I have data from 2014-2024. Let’s see how the model characterized announcements from pre-covid.
As you can see, there’s a legitimate argument that anything under 1 is just random noise. We can combat this by setting anything under 1 to 0, and subtracting 1 from the remaining values. That way there’s a minimum value that we can consider as meaning that the subject of the announcement and the subject tested are orthogonal. Unless otherwise stated, assume the use of this sparse dataset moving forward:
The first step we should take is making sure that the council members are actually talking about different things. We can do this using a MANOVA test.
## subject pval
## 0 business 3.981530e-01
## 1 political_mayor 3.415042e-01
## 2 covid_masks 3.219814e-01
## 3 utilities_heat 2.902140e-01
## 4 public_services 1.652649e-01
## 5 economics 1.301079e-01
## 6 utilities_light 8.775632e-02
## 7 welfare_unemployment 2.200756e-02
## 8 utilities_electric 1.716985e-02
## 9 police_controversy 1.325129e-02
## 10 real_estate 4.846750e-03
## 11 utilities_water 1.455497e-03
## 12 welfare_food 7.413008e-04
## 13 police_general 2.473001e-04
## 14 crime 1.819535e-04
## 15 covid_general 1.022644e-04
## 16 infrastructure 8.502279e-05
## 17 civil_unrest 6.062282e-05
## 18 police_political 1.843063e-05
## 19 budget 2.336022e-06
## 20 public_transit 1.396869e-06
## 21 sports 3.377876e-07
## 22 political_legislation 3.198043e-07
## 23 public_safety 2.930172e-07
## 24 covid_vaccines 1.196167e-07
## 25 parks 2.694141e-09
## 26 housing 3.191576e-11
## 27 civil_rights 9.708386e-12
## 28 environmental 5.153350e-13
## 29 public_health 1.368171e-13
## 30 traffic 3.340381e-14
## 31 political_general 2.768527e-14
## 32 homelessness 7.302153e-15
## 33 labor 9.854326e-18
## 34 political_city_council 1.352759e-51
It would appear that the members of the city council are not always talking about the same things. I would say that’s good, it means there’s a diversity of thought within the council. Let’s get some aggregates and see what each council member is talking about. We can do this with a parallel coordinates plot.
We can also compare subsets, for example, the three most prevalent council members in the data set.
We can see here that council members Herbold and Sawant tend not to talk about the same subjects.
Let’s see if there’s any clusters of posts, and whether any particular council members have outsized prevalence in those clusters. We will do this with a sparse PCA transformation and then observing in 3 reduced dimensions.
That is certainly an interesting and unusual underlying structure, but I think there is a simple explanation for this. This is a natural consequence of using deep learning in the way that we did. By using a model to estimate the strength of the association of each announcement with a number of categories, we are creating a unit hypercube space in which each announcement is embedded. By reducing the dimensionality of the hypercube, we create an rotation of a standard 3-dimensional cube, in which each post is embedded. Since announcements from the city council are presumably usually about one main subject, most announcements are embedded along the vertices of the hypercube, and subsequently along the vertices of the cube as well. This is likely made more clear by the sparse transformation, as the regularization zeroes out the weaker relationships of each announcement with other categories.
Unfortunately this structure doesn’t indicate that any kind of clustering might be useful, but we may still be able to gain some insight from the embeddings.
Let’s do the same as we did previously and examine just Herbold, Sawant, and Mosqueda.
Tip: click on a name in the legend to hide / show announcements from that council member
We can see that each of the three most prevalent council members have their own strong and weak areas when it comes to announcements. Sawant and Herbold each have strong presences in two axes, while Mosqueda’s announcements are mostly concentrated in a single axis. We can use the sparse loadings to get an idea of what these subjects are.
For Mosqueda, we look for subjects with PC3 and PC2 loadings of close to 0.
## [1] "labor" "political-legislation" "welfare-unemployment"
## [4] "welfare-food"
For Sawant, we can look at subjects with just a PC3 of close to 0
## [1] "labor" "housing" "political-legislation"
## [4] "covid-general" "covid-vaccines" "covid-masks"
## [7] "real estate" "welfare-unemployment" "welfare-food"
And for Herbold we can look at subjects with a PC1 of close to 0.
## [1] "traffic" "public transit" "labor"
## [4] "housing" "political-legislation" "political-mayor"
## [7] "political-general" "real estate" "sports"