Tag Archives: Programming

Still Playing Catch-Up

As I was flipping through the February 2014 issue of the American Historical Review I was encouraged to see that American historical profession’s flagship journal seems to be doing a pretty decent job of publishing the impressive work of female historians. Three out of its four main articles were written by women and four out of the five books in its “Featured Reviews” section were also by women. That’s encouraging. But what about the rest of the February issue? Figuring out how many women are in the 176 contributors for this single issue is a lot harder. And what about not just this issue, but all five issues it publishes annually? And what about not just this year, but every year since its inception in 1895?

Looking at gender representation in the American Historical Review is exactly the kind of historical project that lends itself well towards digital analysis. Collecting individual author information from 120 years of publication history would take an enormous amount of tedious labor. Fortunately the information is already online. I wrote a Python script to scrape the table-of-contents from every AHR issue and then, with the help of Bridget Baird, began to process all of this text to try and extract the books that were reviewed in the AHR, their authors, and the names of the person reviewing them. The data was something of a nightmare, but we were eventually able to get everything we wanted: around 60,000 books, authors, and reviewers. The challenge turned to: was there a way to automatically identify the gender of all of these different people? Especially for a dataset that spanned more than a hundred years we needed a way to take into account potential changes in naming conventions. A historian named Leslie who was born before 1950 was likely to be a man, but if that same Leslie was born after 1950 the person was likely to be a woman. Bridget’s solution was for us to write a program that relies on a database of names from the Social Security Administration dating back to 1880 to account for these changes. This approach is not without problems. It only includes American names while subtly reinforcing an insidious gender binary framework. Nevertheless, it does contribute a useful new digital humanities methodology and one that we are planning to explore with Lincoln Mullen in more depth.

This might come as a real shock, but the American Historical Review didn’t feature very many women for much of its publication history. Over the first eighty years of the AHR‘s existence there were rarely more than a handful of books written by female authors in any given issue – as a percentage of all authors, women made up less than 10% of reviewed books through the 1970s. But things began to change in the late 1970s, when female authors began a steady ascent in the AHR‘s reviews. By the end of the 1980s women’s books had nearly doubled in the journal. By the twenty-first century there were three times as many women as there had been in the 1970s.

gender_percent_byyear

Gender of book authors (as a percent of all authors) in the American Historical Review between 1895 and 2013. The number of authors categorized as “Unknown” in the early years stems from the widespread use of initials (ex. K. T. Drew). Most of these authors were likely men, but we’ve erred on the safe side in categorizing them as Unknown. In the later years, many of the “Unknowns” stem from non-U.S. names.

But other numbers paint a less rosy picture. Lincoln Mullen’s recent work on history dissertations showed a similarly steady upwards trajectory in the number of female-authored history dissertations since 1950. Although it has plateaued in recent years, women have very nearly closed the gap in terms of newly completed history dissertations. But the glass ceiling remains stubbornly low in terms of what happens from that point onwards. In book reviews published in the AHR male authors continue to outnumber female authors by a factor of nearly 2 to 1. Whereas there is now a gap of around 3-5% separating the proportion of male and female dissertation authors, that gap jumps to 25-35% in terms of the proportion of male and female book authors being reviewed in the American Historical Review.

mf_diss_book_bluegreen

Gender of dissertation authors and of book authors in the American Historical Review. Note: The above chart only looks at authors whose gender was successfully identified by the program. It is also something of an apples-to-oranges comparison given that Lincoln and I were using slightly different methods, but it gives a rough sense for the gap between dissertations and the AHR.

On the reviewer side of the equation, things aren’t much better. There are still more than twice as many male reviewers as female reviewers in the AHR. But gender inflects this relationship in less direct ways. In particular, we can look at the gender dynamics of who reviews who. About three times as many men write reviews of male-authored books as do women. In the case of female-authored books, there are slightly more male reviewers than female reviewers but the ratio is much closer to 50/50. In short, women are much more likely to write reviews of other women. And while men still write reviews of the majority of female-authored books, they tend to gravitate towards male authors – who are, of course, already over-represented in the AHR.

male_authors_withreviewers

Gender of reviewers for male-authored books. Note: The above chart only looks at authors and reviewers whose gender was successfully identified by the program.

female_authors_withreviewers

Gender of reviewers for female-authored books. Note: The above chart only looks at authors and reviewers whose gender was successfully identified by the program.

Bridget and I were also able to extract the subjects used by the AHR to categorize their reviews. Although these conventions changed quite a bit over time, I took a stab at aggregating them into some broad categories for the past forty years. Essentially, I wanted to find out the gender representation within different historical fields. As you can see in the chart below, the proportion of men and women is not the same for all fields. Caribbean/Latin American history has had something approaching equal representation for the past decade-and-a-half. In both African history and Ancient/Medieval history female historians made some quite dramatic gains during the late-nineties and aughts. The guiltiest parties, however, are also the two subject categories that publish the most book reviews: Modern/Early Modern Europe and the United States/Canada. Both of them have made steady progress but still hover at around two-thirds male.

categories_gender_bytime

The different subjects are sorted left-to-right by the number of reviews in the AHR. Again, please note that the above chart only looks at authors whose gender was successfully identified by the program.

Women are now producing history dissertations at nearly the same rate as men, but the flagship journal of the American historical profession has yet to catch up. There are, of course, a lot of factors at play. This gap might reflect a substantial time-lag as a younger, more evenly-balanced generation gradually moves its way through the ranks even as an older, male-skewed generation continues to publish monographs. It might reflect biases in the wider publishing industry, or the fact that female historians continue to bear a disproportionate amount of the time-burden of caring for families. That the AHR continues to publish far more reviews of male authors than female authors is depressing, but unfortunately not surprising given the systemic inequalities that continue to exist across the profession.

Coding a Middle Ground: ImageGrid

Openness is the sacred cow of the digital humanities. Making data publicly available, writing open-source code, or publishing in open-access journals are not just ideals, but often the very glue that binds the field together. It’s one of the aspects of digital humanities that I find most appealing. Despite this, I have only slowly begun to put this ideal into practice. Earlier this year, for instance, I posted over one hundred book summaries I had compiled while studying for my qualifying exams. Now I’m venturing into the world of open-source by releasing a program I used in a recent research project.

The program tries to tackle one of the fundamental problem facing many digital humanists who analyze text: the gap between manual “close reading” and computational “distant reading.” In my case, I was trying to study the geography within a large corpus of nineteenth-century Texas newspapers. First I wrote Python scripts to extract place-names from the papers and calculate their frequencies. Although I had some success with this approach, I still ran into the all-too-familiar limit of historical sources: their messiness. Namely, nineteenth-century newspapers are extremely challenging to translate into machine-readable text. When performing Optical Character Recognition (OCR), the smorgasbord nature of newspapers poses real problems. Inconsistent column widths, a potpourri of advertisements, vast disparities in text size and layout, stories running from one page to another – the challenges go on and on and on. Consequently, extracting the word “Havana” from OCR’d text is not terribly difficult, but writing a program that identifies whether it occurs in a news story versus an advertisement is much harder. Given the quality of the OCR’d text in my particular corpus, deriving this kind of context proved next-to-impossible.

The messy nature of digitized sources illustrates a broader criticism I’ve heard of computational distant reading: that it is too empirical, too precise, and too neat. Messiness, after all, is the coin of the realm in the humanities - we revel in things like context, subtlety, perspective, and interpretation. Computers are good at generating numbers, but not so good at generating all that other stuff. My computer program could tell me precisely how many times “Chicago” was printed in every issue of every newspaper in my corpus. What it couldn’t tell me was the context in which it occurred. Was it more likely to appear in commercial news? Political stories? Classified ads? Although I could read a sample of newspapers and manually track these geographic patterns, even this task proved daunting: the average issue contained close to one thousand place-names and stretched more than 67,000 words (or, longer than Mrs. Dalloway, Fahrenheit 451, and All Quiet on the Western Front). I needed a middle ground. I decided to move backwards, from the machine-readable text of the papers to the images of the newspapers themselves. What if I could broadly categorize each column of text according both to its geography (local, regional, national, etc.) and its type of content (news, editorial, advertisement, etc.)? I settled on the idea of overlaying a grid onto the page image. A human reader could visually skim across the page and select cells in the grid to block off each chunk of content, whether it was a news column or a political cartoon or a classified ad. Once the grid was divided up into blocks, the reader could easily calculate the proportions of each kind of content.

My collaborator, Bridget Baird, used the open-source programming language Processing to develop a visual interface to do just that. We wrote a program called ImageGrid that overlaid a grid onto an image, with each cell in the grid containing attributes. This “middle-reading” approach allowed me a new access point into the meaning and context of the paper’s geography without laboriously reading every word of every page. A news story on the debate in Congress over the Spanish-American War could be categorized primarily as “News” and secondarily as both “National” and “International” geography. By repeating this process across a random sample of issues, I began to find spatial patterns.

Grid with primary categories as colors and secondary categories as letters

For instance, I discovered that a Texas paper from the 1840s dedicated proportionally more of its advertising “page space” to local geography (such as city grocers, merchants, or tailors) than did a later paper from the 1890s. This confirmed what we might expect, as a growing national consumer market by the end of the century gave rise to more and more advertisements originating from outside of Texas. More surprising, however, was the pattern of international news. The earlier paper contained three times as much foreign news (relative “page space” categorized as news content and international geography) as did the later paper in the 1890s. This was entirely unexpected. The 1840s should have been a period of relative geographic parochialism compared to the ascendant imperialism of the 1890s that marked the United States’s noisy emergence as a global power. Yet the later paper dedicated proportionally less of its news to the international sphere than the earlier paper. This pattern would have been otherwise hidden if I had used either a close-reading or distant-reading approach. Instead, a blended “middle-reading” through ImageGrid brought it into view.

We realized that this “middle-reading” approach could be readily adapted not just to my project, but to other kinds of humanities research. A cultural historian studying American consumption might use the program to analyze dozens of mail-order catalogs and quickly categorize the various kinds of goods – housekeeping, farming, entertainment, etc. – marketed by companies such as Sears-Roebuck. A classicist could analyze hundreds of Roman mosaics to quantify the average percentage of each mosaic dedicated to religious or military figures and the different colors used to portray each one.

Inspired by the example set by scholars such as Bethany NowviskieJeremy Boggs, Julie Meloni, Shane Landrum, Tim Sherratt, and many, many others, we released ImageGrid as an open-source program. A more detailed description of the program is on my website, along with a web-based applet that provides an interactive introduction to the ImageGrid interface. The program itself can be downloaded either on my website or on its GitHub repository, where it can be modified, improved, and adapted to other projects.

Topic Modeling Martha Ballard’s Diary

In A Midwife’s Tale, Laurel Ulrich describes the challenge of analyzing Martha Ballard’s exhaustive diary, which records daily entries over the course of 27 years: “The problem is not that the diary is trivial but that it introduces more stories than can be easily recovered and absorbed.” (25) This fundamental challenge is the one I’ve tried to tackle by analyzing Ballard’s diary using text mining. There are advantages and disadvantages to such an approach – computers are very good at counting the instances of the word “God,” for instance, but less effective at recognizing that “the Author of all my Mercies” should be counted as well. The question remains, how does a reader (computer or human) recognize and conceptualize the recurrent themes that run through nearly 10,000 entries?

One answer lies in topic modeling, a method of computational linguistics that attempts to find words that frequently appear together within a text and then group them into clusters. I was introduced to topic modeling through a separate collaborative project that I’ve been working on under the direction of Matthew Jockers (who also recently topic-modeled posts from Day in the Life of Digital Humanities 2010). Matt, ever-generous and enthusiastic, helped me to install MALLET (Machine Learning for LanguagE ToolkiT), developed by Andrew McCallum at UMass as “a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.” MALLET allows you to feed in a series of text files, which the machine will then process and generate a user-specified number of word clusters it thinks are related topics. I don’t pretend to have a firm grasp on the inner statistical/computational plumbing of how MALLET produces these topics, but in the case of Martha Ballard’s diary, it worked. Beautifully.

With some tinkering, MALLET generated a list of thirty topics comprised of twenty words each, which I then labeled with a descriptive title. Below is a quick sample of what the program “thinks” are some of the topics in the diary:

  • MIDWIFERY: birth deld safe morn receivd calld left cleverly pm labour fine reward arivd infant expected recd shee born patient
  • CHURCH: meeting attended afternoon reverend worship foren mr famely performd vers attend public supper st service lecture discoarst administred supt
  • DEATH: day yesterday informd morn years death ye hear expired expird weak dead las past heard days drowned departed evinn
  • GARDENING: gardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds
  • SHOPPING: lb made brot bot tea butter sugar carried oz chees pork candles wheat store pr beef spirit churnd flower
  • ILLNESS: unwell mr sick gave dr rainy easier care head neighbor feet relief made throat poorly takeing medisin ts stomach

When I first ran the topic modeler, I was floored. A human being would intuitively lump words like attended, reverend, and worship together based on their meanings. But MALLET is completely unconcerned with the meaning of a word (which is fortunate, given the difficulty of teaching a computer that, in this text, discoarst actually means discoursed). Instead, the program is only concerned with how the words are used in the text, and specifically what words tend to be used similarly.

Besides a remarkably impressive ability to recognize cohesive topics, MALLET also allows us to track those topics across the text. With help from Matt and using the statistical package R, I generated a matrix with each row as a separate diary entry, each column as a separate topic, and each cell as a “score” signaling the relative presence of that topic. For instance, on November 28, 1795, Ballard attended the delivery of Timothy Page’s wife. Consequently, MALLET’s score for the MIDWIFERY topic jumps up significantly on that day. In essence, topic modeling accurately recognized, in a mere 55 words (many abbreviated into a jumbled shorthand), the dominant theme of that entry:

“Clear and pleasant. I am at mr Pages, had another fitt of ye Cramp, not So Severe as that ye night past. mrss Pages illness Came on at Evng and Shee was Deliverd at 11h of a Son which waid 12 lb. I tarried all night She was Some faint a little while after Delivery.”

The power of topic modeling really emerges when we examine thematic trends across the entire diary. As a simple barometer of its effectiveness, I used one of the generated topics that I labeled COLD WEATHER, which included words such as cold, windy, chilly, snowy, and air. When its entry scores are aggregated into months of the year, it shows exactly what one would expect over the course of a typical year:

Cold Weather

As a barometer, this made me a lot more confident in MALLET’s accuracy. From there, I looked at other topics. Two topics seemed to deal largely with HOUSEWORK:

1. house work clear knit wk home wool removd washing kinds pickt helping banking chips taxes picking cleaning pikt pails

2. home clear washt baked cloaths helped washing wash girls pies cleand things room bak kitchen ironed apple seller scolt

When charted over the course of the diary, these two topics trace how frequently Ballard mentions these kinds of daily tasks:

Housework

Both topics moved in tandem, with a high correlation coefficient of 0.83, and both steadily increased as she grew older (excepting a curious divergence in the last several years of the diary). This is somewhat counter-intuitive, as one would think the household responsibilities for an aging grandmother with a large family would decrease over time. Yet this pattern bolsters the argument made by Ulrich in A Midwife’s Tale, in which she points out that the first half of the diary was “written when her family’s productive power was at its height.” (285) As her children married and moved into different households, and her own husband experienced mounting legal and financial troubles, her daily burdens around the house increased. Topic modeling allows us to quantify and visualize this pattern, a pattern not immediately visible to a human reader.

Even more significantly, topic modeling allows us a glimpse not only into Martha’s tangible world (such as weather or housework topics), but also into her abstract world. One topic in particular leaped out at me:

feel husband unwel warm feeble felt god great fatagud fatagued thro life time year dear rose famely bu good

The most descriptive label I could assign this topic would be EMOTION – a tricky and elusive concept for humans to analyze, much less computers. Yet MALLET did a largely impressive job in identifying when Ballard was discussing her emotional state. How does this topic appear over the course of the diary?

Emotion

Like the housework topic, there is a broad increase over time. In this chart, the sharp changes are quite revealing. In particular, we see Martha more than double her use of EMOTION words between 1803 and 1804. What exactly was going on in her life at this time? Quite a bit. Her husband was imprisoned for debt and her son was indicted by a grand jury for fraud, causing a cascade effect on Martha’s own life – all of which Ulrich describes as “the family tumults of 1804-1805.” (285) Little wonder that Ballard increasingly invoked “God” or felt “fatagued” during this period.

I am absolutely intrigued by the potential for topic modeling in historic source material. In many ways, it seems that Martha Ballard’s diary is ideally suited for this kind of analysis. Short, content-driven entries that usually touch upon a limited number of topics appear to produce remarkably cohesive and accurate topics. In some cases (especially in the case of the EMOTION topic), MALLET did a better job of grouping words than a human reader. But the biggest advantage lies in its ability to extract unseen patterns in word usage. For instance, I would not have thought that the words “informed” or “hear” would cluster so strongly into the DEATH topic. But they do, and not only that, they do so more strongly within that topic than the words dead, expired, or departed. This speaks volumes about the spread of information – in Martha Ballard’s diary, death is largely written about in the context of news being disseminated through face-to-face interactions. When used in conjunction with traditional close reading of the diary and other forms of text mining (for instance, charting Ballard’s social network), topic modeling offers a new and valuable way of interpreting the source material.

I’ll end my post with a topic near and dear to Martha Ballard’s heart: her garden. To a greater degree than any other topic, GARDENING words boast incredible thematic cohesion (gardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds) and over the course of the diary’s average year they also beautifully depict the fingerprint of Maine’s seasonal cycles:

Gardening


Note: this post is part of an ongoing series detailing my work on text mining Martha Ballard’s diary.

Text Analysis of Martha Ballard’s Diary (Part 3)

One of the most basic applications of text mining is simply counting words. I began by stripping out punctuation (in order to avoid differentiating mend and mend. as two separate words), put every word into lowercase, and then ignored a list of stop words (the, and, for, etc.). By writing a program to count occurrences of the 500 most common words, I could get a general (and more quantitative) sense for what general topics Martha Ballard wrote about in her diary. Unsurprisingly, her vocabulary usage followed a standard path of exponential decay: like most people, she utilized a relatively small number of words with extreme frequency. For example, the most common word (mr) occurred 10,050 times, while her 500th most common word (relief) occurred 67 times:

Top500Words

Because each word has information attached to it – specifically what date it was written – we can look at long-term patterns for a particular word’s usage. However, looking at only raw word frequencies can be problematic. For example, if Ballard wrote the word yarn twice as often in 1801 as 1791, it could mean that she was doing a lot more knitting in her old age. But it could also mean that she was writing a lot more words in her diary overall. In order to address this issue, for any word I was examining I made sure to normalize its frequency – first by dividing it by the total word count for that year, then by dividing it by the average usage of the word over the entire diary. This allowed me to visualize how a word’s relative frequency changed from year to year.

In order to visualize the information, I settled on trying out sparklines: “small, intense, simple datawords” advocated by infographics guru Edward Tufte and meant to give a quick, somewhat qualitative snapshot of information. To test my method, I used a theme that Laurel Ulrich describes in A Midwife’s Tale: land surveying. In particular, during the late 1790s Martha’s husband Ephraim became heavily involved in surveying property. In the raw word count list, both survey and surveying appear in the top 500 words, so I combined the two and looked at how Martha’s use of them in her diary changed over the years (1785-1812):

survey_surveying survey(ing)

Looking at the sparkline, we get a visual sense for when surveying played a larger role in Martha’s diary – around the middle third, or roughly 1795-1805, which corresponds relatively well to Ulrich’s description of Ephraim’s surveying adventures. As a basis for comparison, the word clear appeared with numbing regularity (almost always in reference to the weather):

clear clear

Using word frequencies and sparklines, I could investigate and visualize other themes in the diary as well.

Religion

Out of the 500 most frequent words in the diary, only three of them relate directly to religion: meeting (#28), worship (#143), and god (#220).

meeting meeting

worship worship

god god

Meeting, which was used largely in a religious context (going to a church meeting), but also in a socio-political context (attending town meetings), had a relatively consistent rate of use, although it trended slightly upwards over time. Worship (which Martha largely used in the sense of “went to publick worship”), meanwhile, was more erratic and trended slightly downwards. Finally, and perhaps most interestingly, was Martha’s use of the word god. Almost non-existent in the first third of her diary, it then occurred much more frequently, but also more erratically over the final two-thirds of the diary. Not only was it a relatively infrequent word overall (flax, horse, and apples occur more often), but its usage pattern suggests that Martha Ballard did not directly invoke a higher power on a personal level with any kind of regularity (at least in her diary). Instead, she was much more comfortable referring to the more socially and community-based activity of attending a religious service. While a qualitative close reading of the text would give a richer impression of Martha’s spirituality, a quantitative approach demonstrates how little “real estate” she dedicates to religious themes in her diary.

Death

death death

dead dead

funeral funeral

expired expired

interd interd

Most of the words related to death show an erratic pattern. There are peaks and valleys across the years without much correlation between the different words, and the only word that appears with any kind of consistency is interd (interred). In this case, word frequency and sparklines are relatively weak as an analytical tool. They don’t speak to any kind of coherent pattern, and at most they vaguely point towards additional questions for study – what causes the various extreme peaks in usage? Is there a common context with which Martha uses each of the words? Why was interd so much flatter than the others?

Family

In this final section, I’ll offer up a small taste of how analyzing word frequency can reveal interpersonal relationships. I used the particular example of Dolly (Martha’s youngest daughter):

dolly dolly

The sparkline does a phenomenal job of driving home a drastic change in how Martha refers to her daughter. In a matter of a year or two in the mid 1790s, she goes from writing about Dolly frequently to almost never mentioning her. Why? Some quick detective work (or reading page 145 in A Midwife’s Tale) shows that the plummet coincides almost perfectly with Dolly’s marriage to a man named Barnabas Lambart in 1795. But why on earth would Martha go from mentioning Dolly all the time in her diary to going entire years without writing her name? Did Martha disapprove of her daughter’s marriage? Was it a shotgun wedding?

The answer, while not so scandalous, is an interesting one nonetheless that text analysis and visualization helps to elucidate. In short, Martha still writes about her daughter after 1795, but instead of referring to her as Dolly, she begins to refer to her as Dagt Lambd (Daughter Lambert). This is a fascinating shift, and one whose full significance might get lost by a traditional reading. A human poring over these detailed entries might get a vague impression that Martha has started calling her daughter something different, but the sparkline above drives home just how abrupt and dramatic that transformation really was. Martha, by and large, stopped calling her youngest daughter by her first name and instead adopted the new husband’s proper name. Such a vivid symbolic shift opens up a window into an array of broader issues, including marriage patterns, familial relationships, and gender dynamics.

Conclusions

Counting word frequency is a somewhat blunt instrument that, if used carefully, can certainly yield meaningful results. In particular, utilizing sparklines to visualize individual word frequencies offers up two advantages for historical inquiry:

  1. Coherently display general trends
  2. Reveal outliers and anomalies

First, sparklines are a great way to get a quick impression of how a word’s use changes over time. For example, we can see above that the frequency of the word expired steadily increases throughout the diary. While this can often simply reiterate suspected trends, it can ground these hunches in refreshingly hard data. By the end of the diary, a reader might have a general sense for how certain themes appear, but a text analysis can visualize meaningful patterns and augment a close reading of the text.

Second, sparklines can vividly reveal outliers. In the course of reading hundreds of thousands of words over the course of nearly 10,000 entries, it’s quite easy to lose sight of the forest for the trees (to use a tired metaphor). Visualizing word frequencies allows historians to gain a broader perspective on a piece of the text, and they also act as signposts pointing the viewer towards a specific area for further investigation (such the red-flag-raising rupture in how frequently Dolly appears). Relatively basic word frequency by itself (such as what I’ve done here) does not necessarily explain anomalies, but it can do an impressive job of highlighting important ones.

Text Analysis of Martha Ballard’s Diary (Part 2)

Given Martha Ballard’s profession as a midwife, it is no surprise that she carefully recorded the 814 births she attended between 1785 and 1812. These events were given precedence over more mundane occurrences by noting them in a separate column from the main entry. Doing so allowed her to keep track not only of the births, but also record payments and restitution for her work. These hundreds of births constituted one of the bedrocks of Ballard’s experience as a skilled and prolific midwife, and this is reflected in her diary.

As births were such a consistent and methodically recorded theme in Ballard’s life, I decided to begin my programming with a basic examination of the deliveries she attended. This examination would take the form of counting the number of deliveries throughout the course of the diary and grouping them by various time-related characteristics, namely: year, month, and day of the week.

Process and Results

The first basic step for performing a more detailed text analysis of Martha Ballard’s diary was to begin cleaning up the data. One step was to take all the words and (temporarily) turn every uppercase letter into a lowercase letter. This kept Python from seeing “Birth” and “birth” as two separate words. For the purposes of this particular program, it was more important to distill words into a basic unit rather than maintain the complexity of capitalized characters.

Once the data was scrubbed, we could turn to writing a program that would count the number of deliveries recorded in the diary. The program we wrote does the following:

  1. Checks to see if Ballard wrote anything in the “birth” column (the first column of the entries that she also used to keep track of deliveries)
  2. If she did write anything in that column, check to see if it contains any of the words: “birth”, “brt”, or “born”.
  3. I then printed the remainder of the entries that contained text in the “birth” column but did not contain one of the above words. From this short list I manually added an additional seven entries into the program, in which she appeared to have attended a delivery but did not record it using the above words.

Using these parameters, the program could iterate through the text and recognize the occurrence of a delivery. Now we could begin to organize these births.

First, we returned the birth counts for each year of the diary, which were then inserted into a table and charted in Excel:

Year Deliveries

At the risk of turning my analysis into a John Henry-esque woman vs. machine, I compared my figures to the chart that Laurel Ulrich created in A Midwife’s Tale that tallied the births Ballard attended (on page 232 of the soft-cover edition). The two charts follow the same broad pattern:

YearDeliveriesCompare

Note: I reverse-built her chart by creating a table from the printed chart, then making my own bar graph. Somewhere in the translation I seem to have misplaced one of the deliveries (Ulrich lists 814 total, whereas I keep counting 813 on her graph). Sorry!

However, a closer look reveals small discrepancies in the numbers for each individual year. I calculated each year’s discrepancy as follows, using Ulrich’s numbers as the “true” figures (she is the acting President of the AHA, after all) from which my own figures deviated, and found that the average deviation for a given year was 4.86%. Apologies for the poor formatting, I had trouble inserting tables into WordPress:

Year Deliveries Count Difference Deviation (from Ulrich)
Manual (Ulrich) Computer Program
1785 28 24 4 14.29%
1786 33 35 2 6.06%
1787 33 33 0 0.00%
1788 27 28 1 3.70%
1789 40 43 3 7.50%
1790 34 35 1 2.94%
1791 39 39 0 0.00%
1792 41 43 2 4.88%
1793 53 50 3 5.66%
1794 48 48 0 0.00%
1795 50 55 5 10.00%
1796 59 56 3 5.08%
1797 54 55 1 1.85%
1798 38 38 0 0.00%
1799 50 51 1 2.00%
1800 27 23 4 14.81%
1801 18 14 4 22.22%
1802 11 12 1 9.09%
1803 19 18 1 5.26%
1804 11 11 0 0.00%
1805 8 8 0 0.00%
1806 10 11 1 10.00%
1807 13 13 0 0.00%
1808 3 3 0 0.00%
1809 21 22 1 4.76%
1810 17 18 1 5.88%
1811 14 14 0 0.00%
1812 14 14 0 0.00%

Keeping the knowledge in the back of my mind that my birth analysis differed slightly from Ulrich’s, I went on to compare my figures with other factors, including the frequency of deliveries by month over the course of the diary.

MonthDeliveries

If we extend the results of this chart and assume a standard nine-month pregnancy, we can also determine roughly which months that Ballard’s neighbors were most likely to be having sex. Unsurprisingly, the warmer period between May and August appears to be a particularly fertile time:

Conceptions

Finally, I looked at how often births occurred on different days of the week. There wasn’t a strong pattern, beyond the fact that Sunday and Thursday seemed to be abnormally common days for deliveries. I’m not sure why that was the case, but would love to hear speculation from any readers.

DeliveriesDayWeek

Analysis

The discrepancies between the program’s tally of deliveries and Ulrich’s delivery count speak to broader issues in “digital” text mining versus “manual” text mining:

Data Quality

Ulrich’s analysis is a result of countless hours spent eye-to-page with the original text. And as every history teacher drills into their students when conducting research, looking directly at the primary documents minimizes the degrees of interpretation that can alter the original documents.  In comparison, my analysis is the result of the original text going through several levels of transformation, like a game of telephone:

Original text -> Typed transcription -> HTML tables -> Python list -> Text file -> Excel table/chart

Each level increases the chance of a mistake.  For instance, a quick manual examination using the online version of the diary for 1785 finds an instance of a delivery (marked by ‘Birth’) showing up in the online HTML, but which does not appear in the “raw” HTML files our program is processing and analyzing.

On the other hand, a machine doesn’t get tired and miscount a word tally or accidently skip an entry.

Context

Ulrich brings to bear on the her textual analysis years of historical training and experience along with a deeply intimate understanding of Ballard’s diary. This allows her to take into account one of the most important aspects of reading a document: context. Meanwhile, our program’s ability to understand context is limited quite specifically to the criteria we use to build it. If Ballard attended a delivery but did not mark it in the standard “birth” column like the others, she might mention it more subtly in the main body of the entry. Whereas Ulrich could recognize this and count it as a delivery, our program cannot (at least with the current criteria).

Where the “traditional” skills of a historian come into play with data mining is in the arena of defining these criteria. Using her understanding of the text on a traditional level, Ulrich could create far, far superior criteria than I could for counting the number of deliveries Martha Ballard attends. The trick comes in translating a historian’s instinctual eye into a carefully spelled-out list of criteria for the program.

Revision

One area that is advantageous for digital text mining is that of revising the program. Hypothetically, if I realized at a later point that Ballard was also tallying births using another method (maybe a different abbreviated word), it’s fairly simple to add this to the program’s criteria, hit the “Run” button, and immediately see the updated figures for the number of deliveries. In contrast, it would be much, much more difficult to do so manually, especially if the realization came at, say, entry number 7,819. The prospect of re-skimming thousands of entries to update your totals would be fairly daunting.

Text Analysis of Martha Ballard’s Diary (Part 1)

“mr Ballard left home bound for Oxford. I had been Sick with the Collic. mrs Savage went home. mrs foster Came at Evening. it snowd a little.”

This is the first entry in the diary of Martha Ballard. Martha Ballard was a rural Maine midwife who kept an extensive diary between 1785 and 1812 and whose life was immortalized in 1990 by the historian Laurel Thatcher Ulrich‘s award-winning A Midwife’s Tale. Over the course of three decades, Ballard kept a meticulous, near-daily accounting of her life spanning over 10,000 entries.

When reading A Midwife’s Tale, I was struck by how readily the text would seem to lend itself to digital analysis. In an interview, Ulrich noted, “The very thing that had attracted me to the diary in the first place was also the thing that made it difficult to work with. I mean there’s just so much.” To ground herself, she began by simply counting things: “And I would go day by day for every other year of the diary, and I would tick off what was in each entry: baking or brewing, spinning or washing, or trading, sewing, mending, deliveries, general medical accounts, going to church, visitors, people coming for meals, etc.” Because of the sprawling scope, she took this quantitative approach only for the even-numbered years in the diary. The fact that she was working in the late eighties without a computer makes her work even more impressive.

After poking around online I came across DoHistory.org, a website developed and maintained by the Film Study Center at Harvard University and hosted by (who else, really) George Mason’s CHNM. The website presents the diary to the public in two formats: the viewer can either browse through photographed pages of the diary or read the transcript of the pages (transcribed through a monumental effort by Robert R. McCausland and Cynthia MacAlman McCausland):

ballardpage1 ballardpage1text

When I realized the entire diary was online, it got me thinking about possibilities for text mining. As an aspiring digital humanist with little “hard” skills beyond basic GIS, I had been meaning to learn how to program for quite some time. In Martha Ballard’s diary, I had an intriguing source of data with which to learn how to do so. Now I just had to learn how to program. With the patient help of several programming-savvy family members, I gradually learned the basics of Python and how to apply it to Martha Ballard’s diary. What follows are the first steps we took to process the diary’s raw data into an accessible digital format.

Process

At first, I briefly considered learning how to scrape the text of the diary off the website. After some investigation, I decided that was a little beyond my abilities, so I copped out to the much easier route of sending an email to Kelly Schrum at CHNM, who kindly forwarded my request to Ammon Shepherd, who emailed me a zip file containing 1,431 html documents, one for each page of the diary. The html files of the transcribed diary are a basic, 3-column table that look this. My first step was to find a way to strip out the html tags and organize the text into a systematic database of individual entries. Fortunately, Ballard’s meticulousness and consistency lent itself well to such an approach.

The diary’s format translates quite nicely into creating a list of lists – the “main” diary being a list of all the entries, and each entry being a list in and of itself. The first program we wrote was to open each html file and begin extracting the different sections of text (which were conveniently marked by html tags). Iterating through each entry allowed us to separate the different columns in her diary into different items in the list. Here is the breakdown of our “list of lists”:

  1. Diary
    1. Entry
      1. Date
        1. Month
        2. Day
        3. Year
      2. Day of the Week
      3. Main Text of Entry
      4. Day Summaries (Column 3 of actual diary entry)
      5. Birth(s) (Recorded in Column 1 of actual diary entry)

In creating the list, we had to separate out the raw data from the html tags that formatted it. Fortunately, the folks who built the html files originally used an extremely systematic formatting process that actually made the job of distilling one from the other quite straightforward. A Python module called Pickle allowed us to export the list of entries as a manageable single file that we could then easily import into future programs to manipulate.

For example, the third entry in the diary would translate a bit into something like this:

  1. Diary
    1. Entry (3)

      1. Date
        1. 1 (January)
        2. 3
        3. 1785
    2. 3 (Tuesday – Ballard numbered the weekdays, beginning with Sunday as 1)
    3. “Tuesday. mrs. Foster went home. I had threats of thee Collic; by takein peper found releif.”
    4. Empty
    5. Empty

The list allows us to access pieces of information by “calling” their position. It helped me to think of the entire diary list as a warehouse containing almost 10,000 boxes (entries) inside it, with each box containing five compartments, with the first of those compartments divided into three sub-compartments. If you were to open any of the boxes (entries) and look inside the first compartment, then inside sub-compartment number two, you would always find a number that represented the month of that particular entry. If you were to look inside the third compartment of the entry/box, you would always find the main text for that day’s entry.

The advantages of setting up the data in a list structure is the ability to access these specific pieces of information easily and to compare them across entries. In many ways, processing the text to make it readable and programmable is one of the biggest challenges to text mining. Deciding on the most logical way to organize and break down over 1,400 files will lay the groundwork for the fun part: writing programs to actually analyze the diary of Martha Ballard.

***Special-edition sneak preview of future posts in this series***

A simple counting program reveals that the main text of Martha Ballard’s diary alone contains 377,315 words, spanning I-couldn’t-make-this-number-up 9,999 entries. That is a lot of data to play with.