What I’ve Been Up To: Text Mining

I’ve been a lot quieter lately on social media than usual. Thanks dissertation. I thought I would take a break from Chicago style footnotes and R programming to share a little bit of my work… and some of the pretty graphs I have made that likely won’t be included in my final draft after the cutting room.

In brief, my dissertation investigates the utility of talk-back boards, meaning a type of museum exhibit characterized by a “question on a wall” format with some mechanism for public responses such as post-it notes or a computer terminal. I collected thousands of talk-back board responses from two museums: Seminary Ridge Museum (SRM) in Gettysburg, PA, and Women’s Rights National Historical Park (WRNHP) in Seneca Falls, NY. SRM’s talk-back board was installed upon the museum’s opening in 2013 and poses visitors with the question “What is the unfinished work of freedom?” WRNHP’s was installed during a museum renovation in 1993 and was since removed just this past year and asked visitors “What will it be like when men and women are truly equal?”

Using these data sets, my ongoing work seeks to answer two primary questions. First, are visitors engaging with the exhibits at either site and is this detectable through talk-back boards? Second, can talk-back boards be used to detect the presence of historical empathy, or to what degree are SRM and WRNHP respondents engaging in historical empathy when answering these questions.

To answer these questions, I employ a variety of techniques… but in this post I’m only going to talk about two: text mining and topic extraction.

Word Clouds

Like many others, I do not believe the word cloud hype. Word clouds are one of the simplest text mining techniques and simply convert a chart of frequencies into a plot of different sized words. The larger the word, the more often that word appeared in the data set. Other than size, no other information matters in most word clouds. Location and color of each word is generally irrelevant unless the researcher is doing something different. Word clouds cannot and should not be used as any sort of statistical tool because… well there’s not analysis being done here. Take a look:

SRM voyant
Word cloud created from my Seminary Ridge Museum talk-back board dataset

What can we really learn from this graphic that couldn’t be learned from a chart or even a brief glance at the raw data? Not much really. All of the most often used words are clear: peace, people, God, freedom, love, equal, rights, equality, government, and Obama. A different person reading this graphic may pick out different words, such as the weirdness of the word arrow (from my coding scheme to represent illustrated talk-back responses) and the presence of Obama (a lot of people at SRM did not like the Obama image near the talk-back board).

Word clouds have a place in analysis because a lot of people simply do not like to read charts and clouds offer a different visualization, but that won’t stop me from thinking word clouds are basically a pretty gimmick. Unlike Jacob Harris, I don’t “die a little inside” every time I see a word cloud, but my eye twitches a bit. More useful are word trend charts that divide the data into even segments, which are basically a form of time series analysis. Compare the plot below to the word cloud above; there’s a lot more information.

SRM percentile charts
SRM Time  Series Segment Plot, Top 20 Most Common Responses

Lemmatization & Topics

The first step toward useful text mining analysis is cleaning up the data. To do this, I applied lemmatization to the data. Lemmatization effectively condenses the dataset by grouping words with the same root into a single root word (known as the lemma). For example, the set (stop, stops, stopped, stopping) was condensed into the single word “stop.” As an example of lemmatization’s usefulness for analysis, take the SRM segment plot above and notice that “equal” and “equality” appear 5th and 8th most often. Combining the two into a single term would change the way we interpret the data as “equal” would become one of the top most common terms throughout the data.

Topic extraction is a lot like factor analysis (and actually is factor analysis depending on the software used). To better explain lemmatization and topic extraction, take for example the following response from WRNHP: “men will see we are not just for cooking and cleaning but we can work and make money.” A simple topic extraction on a single response with lemmatization results in “man, cook, clean, work, money.” Note that the actual topic of this statement is “we,” which in this case references women in the general sense, is not extracted and is not incorporated into the overall topic extraction analysis. However, a rigorous explanation of each response is not the purpose of topic extraction; the purpose is to both extract the “gist” of a response, do so quickly, generate as few topics as possible, and do so with consistency across the entire dataset.

Topic extraction functions by constructing a word x document frequency matrix then computing factor analysis. WordStat 7 is a great tool for this compared to the step-by-step coding of R. Below is the topic extraction from the SRM dataset:

Topic Keywords EIGEN %VAR FREQ CASES
HISTORY & LEARNING HISTORY; REPEAT; PAST; LEARN; FUTURE; FORGET; REMEMBER 3.40 1.27 197 150
ACCEPTANCE RACE; ORIENTATION; SEXUAL; GENDER; RELIGION; MATTER; SEX; SEXUALITY; COLOR 2.42 1.78 305 190
MEN & WOMEN CREATED EQUAL MAN; EQUAL; CREATE; WOMAN 1.97 1.38 528 374
CHRIST & FREEDOM CHRIST; JESUS; SET; LORD; FREEDOM; FREE 1.89 1.31 668 554
LIMITED GOVERNMENT LIMIT; TERM; FEDERAL; GOVERNMENT 1.87 1.30 160 139
GOD BLESS BLESS; GOD; AMERICA 1.83 1.24 529 415

The topic with the highest returned eigenvalue is the most striking. A large number of respondents consistently related history, the past, and the future to freedom, and it is encouraging that the top extracted topic in a history museum is one about the past. The second topic is also notable in its wide breadth, meaning that respondents who spoke of equality did so in terms of race, sexuality, gender, and religion simultaneous.

Topic extraction can also be expressed graphically. Rather than share a table of WRNHP topic extraction results, below is a graph of those results in the form of a co-occurrence chart. In this chart, each word is plotted in relation to each other word based on each pairing’s relative co-occurrence by Jaccard coefficient. Color coded groupings indicate broadly defined topics (not calculated by factor analysis), and lines between words indicate a strong co-occurrence. The strongest connections can be seen at the center of the pink cluster with some of the most common words in the WRNHP data: woman, man, equal, world, love, respect, world, and peace. This pink cluster generally contains all answers that directly addressed issues directly connected to sex or gender and world peace.

WRNHP cooccurrence chart
WRNHP Co-occurrence Chart

As can be seen from these relatively simple text mining techniques, a great deal of insight can be gained from topic extraction. Before applying these techniques, I did not have the sense that a subset of SRM respondents (197 to be precise) often spoken in exclusively historical terms. Similarly, the co-occurrence chart above indicated the strong correlation between the terms within the “pink” grouping, all of which are generally connected to emotion, imagination, and/or empathy in some way.

Moving forward, I’ve already applied similar methods to historical empathy. If I get another chance to breath in between dissertation writing, teaching, and job applications, then maybe I’ll share some more.

In conclusion, enjoy this really annoying word cloud I made of Abraham Lincoln’s head. Look out Obama, Lincoln’s got his eye on you.

tagxedo_lincoln

Cheers, Josh.

 


Leave a Reply