How Much Information Is Lost When We Summarize? Can Deep Learning Algorithms Offer Some Clues?
When we summarize information, we do our best to leave out details while ensuring our audience receives the most important parts of the story we are trying to deliver- just as I am doing writing this article. But, just how much of that story goes missing through the process? How much of what we left out was actually important to the audience in understanding the story?
To measure this quantitatively, I built a deep learning model to capture qualities a block of text (an article, a conversation, anything really) shares with five topics [politics, science, sports, weather, world news] using data scraped from Reddit. After I trained the model on the Reddit data, I used the model to rank podcast episodes based on the summaries by the podcaster. Then I did the same to rank the podcast episodes based on the full transcripts. Across 637 podcast episodes, the average change in rank between the rank based on summary and the rank based on the full transcript was 192 positions. That’s over a 30% change in rank.
Gathering The Data
Before we can build a model, we need data- a lot of data. Fortunately, Reddit not only has data, the architecture of the site makes it one of the best places we can turn to for this project. For those unfamiliar with Reddit, it’s a social news aggregation and discussion website. Within Reddit are subreddits. Subreddits are topic specific, so, if you want to talk or post about sports, you go to the sports subreddit.
To gather the data, I wrote some code to scrape all the posts made over the past 3 months from five different subreddits [r/politics, r/science, r/sports, r/weather, r/worldnews]. Now, we have large blocks of text (articles, comments, etc.) with their associated topic! Articles that came from r/sports are sports topics, articles from r/politics are politics and so on.
Training The Deep Learning Classifier
For this part of the project, I used Google’s Word2Vec model to quantify the text. The goal of Word2Vec is to represent the coordinates of words in N dimensional space. The most popular example of this goal is the derivation of King -Man + Woman = Queen. Since the deeper details are out of the scope of this discussion, feel free to read more here or have a glance at how I use it in my code on GitHub.
Now that I have a way of representing all this scraped text as mathematical values (coordinates, specifically), I can build my deep learning neural network (DNN). For those unfamiliar, a DNN is a mathematical function approximation algorithm. The idea is to make a guess, measure wrongness, adjust based on that wrongness and try again until the wrongness can be reduced no more. If you would like to know more, you can find more information on DNNs here.
Using PyTorch, building the 750 x 164530 node DNN was easier than it probably sounds (refer to GitHub). Over 10,000 epochs, the model scored each block of text highest with its correct topic 60% of the time. This can be improved with a lot more computing power and time. Nonetheless, we have a model!
Summaries and Transcripts of Podcast Episodes
Without a shadow of a doubt, my favorite podcast is Hidden Brain on NPR. This podcast has very rich discussion of intense social topics that range from interesting to outright fascinating. On the site, you will find a summary below each episode’s title. For most episodes you will also find a button with 4 horizontal lines leading to a full transcript of the episode.
Summaries range from 2 to 6 sentences long. As you can imagine, 40+ minutes of discussion far exceeds 6 sentences. So this begs the question: how much information is lost when we summarize?
I scraped the summary and transcript of each episode from the site and repeated the process of quantifying text using Word2Vec as I did earlier when building our classifier model from the Reddit data. Then, I pushed each summary and transcript through the classifier to see how each would be scored across the five topics mentioned earlier. After ranking them based on topic, the average change in rank for the 637 episodes was 192 positions.
The results were a bit surprising at first but not quite so much when I took a step back and really thought about many of my discussions with friends. How often do I leave information out from an experience I’m summarizing that leads to a friend asking “Wait, how did that part happen?” to which I have to back track and explain using details I left out? Suddenly, 30% of the time sounded reasonable. And of course, the more complex the story, the more likely clarifying questions like that are to arise.
On the other hand, how many details are truly needed to deliver the message concisely without losing the audience’s interest through offering too much detail? Of course, mileage varies by topic of interest.
Of course, no research would be complete without the code and mathematics used to achieve the results! You can find all the code and data used for this research on my GitHub, specifically here. Feel free to check out my GitHub and connect with me on LinkedIn. Thank you for reading!