The interesting and important questions about the role of data in scholarship address the processes by which something becomes data.
This chapter explores efforts to define data in theoretical and operational terms, and then identifies factors that influence their creation and use.
Buckland (1991) distinguished between information as process, as knowledge, or as thing. Donald Case (2002, 2012) collected dozens of definitions of information, grouping them by how they dealt with uncertainty, physicality, structure/process, intentionality, and truth. Jonathan Furner (2004) applied three criteria to selecting definitions of information: coherence, parsimony, and utility. Later, he identified three families of conceptions of information that are broadly useful: semiotic, socio-cognitive, and epistemic (Furner, 2010).
The thinking of Foucault (1994), Lakoff (1987), and many other philosophers and scholars has been influenced by Borges’ subtle skewering of classification mechanisms.
The most concrete definitions of data are found in operational contexts. Data remains a fluid concept, allowing archives to adapt to new forms of data as they appear.
In both operational and general research contexts, types of data may be distinguished by grouping them in useful ways. Various types of records are associated with observational, experimental, and
computation data, such as historical records, field records, and handwritten notes. Records is useful as a fourth category of data origin, encompassing forms of data that do not fit easily into categories of observation, experiment, or computation or that result
from any of these categories.
Efforts to categorize digital data collections also shed light on the origins and value of data to communities: research data collections, resource or community data collections, and reference data collections. These three categories of data collections are useful for assessing the degree of investment that communities make in their data and in amount of sharing that takes place.
The meaning of “data” is particularly ambiguous in the humanities (Borgman, 2009; Unsworth et al., 2006). Distinctions between primary and secondary sources are the closest analog to raw and processed groupings in the sciences and social sciences. An important distinction about data in the humanities is how uncertainty is treated in knowledge representations (Kouw, Van den Heuvel, & Scharnhorst, 2013). Implicit in research methods and in representations of data are epistemic choices of how to reduce uncertainty.
Differences in the scale of data, in turn, influence what becomes data, the methods applied, the forms of inquiry, and the goals and purposes of research. The amount and type of labor involved in data collection influences selection, volume, processing, and management.
Becoming data is usually a process in which a scholar recognizes that something could be used as evidence of phenomena and then collects, acquires, analyzes, and interprets those units as data.
Data, knowledge, and representation are inseparable.
Sources are data that originate with the investigators on a given project. Resources are extant data reused for a given project.
Metadata, most simply defined as “data about data” are representations used to bridge distances such as time, context, method, theory, technology, or processing. Metadata can bridge distances from sources to resources by formalizing and
standardizing how data are named and described. To name something is to make an ontological commitment that some category of things exists (Furner, 2010). Ontologies, taxonomies, thesauri and other forms of metadata facilitate interoperability
within communities.
Whereas metadata provides multiple entry points, provenance bridges distances by ordering relationships about the origins of data.
Data have many kinds of value, only one of which is monetary. While a comprehensive analysis of the economics of research data is much needed, it is well beyond the scope of this book. Research data are best understood as a commons resource. Threats to
knowledge commons include commodification, degradation, and lack of sustainability. The commodification of scholarly literature and the rapid rise in prices charged to libraries were among the drivers of the open access movement. Similar concerns appear to be influencing the open data movement. Digital data are fragile and vulnerable to loss; sustainability is difficult to achieve. Concerns arise about whether data are too valuable as shared resources to be controlled as commodities, whether the human genome or maps of public spaces.
Some basic principles of the economics of information also are important to the discussion of the value of research data. Goods can be classified along two dimensions into a simple two-by-two matrix. The first dimension is the degree to which individuals can be excluded from use. The second dimension is subtractability, also
known as rivalrous or non-rivalrous.
Laws, policies, and practices applying to the ownership, control, and release of data vary by funding agency, government, and jurisdiction. Whether or not data are intellectual property, they often are treated as such. The most difficult set of property issues involves “orphan works,” which are entities for which the copyright status cannot be determined, or if it can be determined, then the owner cannot be identified, cannot be reached, or does not respond. Orphaned data may be the next frontier of rights issues.
The ethics of what can be treated as data, how, when, and why are evolving rapidly as digital records become the norm and as data mining becomes more sophisticated. Data about people are among the most contentious.
To have a real existence, data do not necessarily have a material
form. They can be ephemeral and fleeting. Data, per se, are difficult to exchange. It is representations of data, or inscriptions of data that are exchanged or made mobile. External influences on research practice, such as economics and values, property rights, and ethics, can be significant determinants of what entities become data in any context.