The bird in hand: Humanities research in the age of open data (Digital Science Report)Posted: October 24, 2016
Originally published as Daniel Paul O’Donnell. 2016. “The Bird in Hand: Humanities Research in the Age of Open Data.” In The State of Open Data: A Selection of Analyses and Articles about Open Data, Edited by Figshare, 34–35. Digital Science Report. London: Digital Science.
Traditionally, humanities scholars have resisted describing their raw material as
Instead, they speak of “sources” and “readings.” “Primary sources” are the
texts, objects, and artifacts they study; “secondary sources” are the works
of other commentators used in their analyses; “readings” can be either the
arguments that represent the end product of their research or the extracts
and quotations they use for support.
These definitions are contextual. The primary source for one argument can be
the secondary source for another or, as in the case of a “critical edition” of a
historical text, simultaneously primary and secondary. Almost any document,
artifact or record of human activity can be a topic of study. Arguments proposing
previously unrecognized sources (“high school yearbooks, cookbooks, or wear
patterns in the floors of public places”) are valued acts of scholarship. 1
This resistance to “data” is a recognition of real differences in the way humanists
collect and use such material. In other domains, data are generated through
experiment, observation, and measurement. Darwin goes to the Galapagos
Islands, observes the finches, and fills notebooks with what he sees. His notes
(i.e. his “data”) “represent information in a formalized manner suitable for
communication, interpretation, or processing” 2 . They are “the facts, numbers,
letters, and symbols that describe an object, idea, condition, situation, or other
factors” 3. Given the extent to which they are generated, it has been argued that
they might be described better as capta, “taken,” than data, “given”. 4
The material of humanities research traditionally is much more datum than
captum, finch than note. Since the humanities involve the study of the meaning
of human thought, culture, and history, such material typically involves other
people’s work. It is often unique and its interpretation is usually provisional,
depending on broader understandings of purpose, context and form that are
themselves open to analysis, argument and modification. In the humanities, we
more often end up debating why we think something is a finch than what we
can conclude from observing it.
Perhaps most telling is the fact that humanities sources, unlike scientific
data, are usually practically as well as theoretically non-rivalrous 5. Humanities
researchers rarely have an incentive (or capability) to prevent others from
accessing their raw material and entire research domains (e.g. Jane Austen
studies) can work for centuries from the same few primary sources. Priority
disputes that occur regularly in the sciences 6 are almost non-existent within
the humanities. 1
The digital age is changing one aspect of this traditional disciplinary difference.
Mass digitalization and new tools make it possible to extract material
algorithmically from large numbers of cultural artifacts. Where researchers
used to be limited to sources in archives and libraries to which they had
physical access, digital archives and metadata now make it easier to work
across complete historical or geographic corpora: all surviving periodicals from
19th century England, for example, or every known pamphlet from the Civil
War. In the digital age, humanities resources can be capta as well as data.
Such changes allow for new types of research and improve the efficacy of some
traditional approaches. But they also raise existential questions about long-
standing practices. Traditionally, humanities researchers have tended to work
with details from a limited corpus to make larger arguments: “close readings” of
selected passages in a given text to produce larger interpretations of the work
as a whole; or of passages from a few selected works to support arguments
about larger events, movements or schools. In one famous but far from atypical
example, author Ian Watt uses readings from five novels and three authors as the
main primary sources in his discussion of the Rise of the Novel. 7
In the age of open data, it is tempting to see this as being, in essence, a small-
sample analysis lacking in statistical power. 8 But such data-centric criticism of
traditional humanities arguments can be a form of category error. Humanities
research is as a rule more about interpretation than solution. It is about why
you understand something the way you do rather than why something is
the way it is. It treats its sources as examples to support an argument rather
phenomena to be observed in the service of a solution. While Watt’s title,
“The Rise of the Novel,” can be understood as implying a historical scope
that his sample cannot support, his subtitle, “Studies in Defoe, Richardson,
and Fielding,” shows that he actually was making an argument about the
interpretation of three canonical authors based on his understanding of
the novel’s early history – an understanding that by definition always will be
provisional and open to amendment.
The real challenge for the humanities in the age of digital open data is
recognizing the value of both types of sources: the material we can now
generate algorithmically at previously unimaginable scales and the continuing
value of the exemplary source or passage. As the raw material of humanities
research begins to acquire formal qualities associated with data in other fields,
the danger is going to be that we forget that our research requires us to be
sensitive to both object and observation, datum and captum, finch and note. In
asking ourselves what we can do with a million books 9, we need to remember
that we remain interested in the meaning of individual titles and passages.
1 Borgman, Christine L. 2007. Scholarship in the Digital Age: Information, Infrastructure, and the Internet. Cambridge, Mass: MIT Press.
2 Consultative Committee for Space Data Systems. 2012. “Reference Model for an Open Archival Information System (OAIS).” CCSDS 650.0-M-2.
3 National Research Council. 1999. Question of Balance: Private Rights and the Public Interest in Scientific and Technical Databases. Washington:
National Academies Press. http://public.eblib.com/choice/publicfullrecord.aspx?p=3375284.
4 Jensen, H. E. 1950. “Editorial Note.” In Through Values to Social Interpretation: Essays on Social Contexts, Actions, Types, and Prospects, vii – xi.
Sociological Series. Duke University Press.
5 Kitchin, Rob. 2014. The Data Revolution. Thousand Oaks, CA: SAGE Publications Ltd.
7 Watt, Ian P. (1957) 1987. The Rise of the Novel: Studies in Defoe, Richardson, and Fielding. London: Hogarth.
8 Jockers, Matthew L. 2013. Macroanalysis : Digital Methods and Literary History. Urbana, IL: University of Illinois Press.
10 Marche, Stephen. 2012. “Literature Is Not Data: Against Digital Humanities.” Los Angeles Review of Books, October. https://lareviewofbooks.org/