I have been using Feedly daily for last couple of years as my primary news feed application. I have subscribed to over 180 publications over the years so I was curious to analyze my own news diet. Thankfully Feedly has an open API and I was able to extract a lot of data about the publications I had subscribed to using the API. Here are some observations.
News publications outweigh all other kinds of publications, in terms of amount of content they create, by a huge margin.
Most of the publications I have subscribed to are related to design and technology. Only a handful are related to news or politics. Following is a graph of topics that the different publications covered:
Yet the biggest chunk of content comes from news related publications by a huge margin. Following is a graph of number of articles each publication produced per week. Clearly, news and politics related publications were much more active than others.
Although this is not representative of internet at large, at least my daily consumption of content on internet was dominated by news publications more than anything else.
The number of articles produced by various news publications also varied enormously. The highest number of articles were produced by Firstpost, more than 1,800 per week.
On a deeper investigation it turns out almost 50% of articles published by Firstpost were bot generated. They were basically republishing articles from Reuters. I think they have a partnership with Reuters for world news. Even then, it seems Firstpost is able to generate almost 900 articles every month which makes it at least fourth highest.
With so many articles to read, it is fair to say that I end up reading the headlines of more than 90% of the articles. So I decided to see what information I get from the headlines.
I chose ten publications in US that are well known and respected. I wanted to keep examples from main steam TV media, newspapers and independent media. The publications I chose were CNN, FOX, The New York Times, Vox, Democracy Now, Mother Jones, Slate, The Intercept, Politico and ProPublica.
In order to analyze the headlines, I took the recently published 100 articles from each of these sources. Since there is a huge difference between the speed with which they publish content, 100 articles from CNN were of the same day whereas from Slate it was a couple of days. I could have kept the time period same and let the number of articles vary but I felt it made more sense to analyze the same amount of content from all the publications.
Following are some things I found, analyzing latest 100 headlines from all these publications.
While it seems that every publication always talks about Trump, the difference between different publications can be significant
While Politico mentioned Trump in slightly less than 50% of the articles, for others it is one in four articles that mention Trump is the headline. Fox news was surprisingly on the lower side, I was curious to see what they were talking about instead, analysis below.
The overlap between stories by different publications is actually quite small. Here is an analysis of 100 recent articles from each of the publications.
Stories across publications are horizontally grouped, i.e. stories on similar topic appear in the same line horizontally. This makes it easy to spot who cover what story how many times. Each row of each publication represents a different story, for brevity only keywords are written instead of complete headline. Here is an interactive version with all the links.
The links at the bottom do not follow the horizontal grouping since each article is unique to that publication. Again, view the interactive version here https://th1000links.herokuapp.com/
Here are certain observations from the visualization above:
- When I started this exercise I thought there will be a huge overlap between the topics these publications covered. I was hoping to make a visualization where all the hundred articles for one source have a corresponding piece in other sources. But that was surprisingly not the case, in fact quite the opposite. The number of stories that were unique to a source vastly out numbered the common ones. That means that even if we solve the problem of combining stories on similar topics, there are just too many topics.
- While my dataset is not large enough to claim what stories were covered or ignored by a particular source, it does give a general sense who cares about what, what stories spread across all publications and which ones concentrated in one type of media. For example ProPublica and CNN had no topic in common, ProPublica also seems to do multiple stories on a particular topic. Vox seems to be occupying a space between the mainstream and independent media in terms of the topics it choses to cover.
A side note on the technical challenge of making the visualization above, open source text analysis stuff such as IBM’s cloud does not work very well. I can understand why, it does need a lot of context understanding in order to classify a particular article one way or the other. I did use a combination of packages in order to find similarity and keywords but at the end of the day a lot of manual labor went into cleaning the results up. You can find the code here.
This is just the surface of understanding the content I get from Feedly. The content we consume on the internet shapes our thoughts and actions, it is critical to carefully monitor and understand that intake.