Provenance

In Uncategorized

W3CW3C Incubator Report

Provenance XG Final Report

W3C Incubator Group Report 08 December 2010

This Version:
http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/
Latest Published Version:
http://www.w3.org/2005/Incubator/prov/XGR-prov/
Editors:
Yolanda Gil, University of Southern California Information Sciences Institute (USC / ISI)
James Cheney, University of Edinburgh (School of Informatics)
Paul Groth, VU University Amsterdam
Olaf Hartig, Humboldt-Universität zu Berlin
Simon Miles, King´s College London
Luc Moreau, University of Southampton
Paulo Pinheiro da Silva, Rensselaer Polytechnic Institute
Contributors:
Sam Coppens, IBB TELIS, Ghent University
Daniel Garijo, Universidad Politécnica de Madrid
Jose Manuel Gomez, Isoco
Paolo Missier, University of Manchester
Jim Myers, RPI
Satya Sahoo, Case Western Reserve University
Jun Zhao, University of Oxford

Copyright © 2010 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.


Abstract

Given the increased interest in provenance in the Semantic Web area and in the Web community at large, the W3C established the Provenance Incubator Group as part of the W3C Incubator Activity with a charter to provide a state-of-the art understanding and develop a roadmap in the area of provenance and possible recommendations for standardization efforts. This document summarizes the findings of the group.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of Final Incubator Group Reports is available. See also the W3C technical reports index at http://www.w3.org/TR/.

This document was published by the W3C Provenance Incubator Group as an Incubator Group Report. If you wish to make comments regarding this document, please send them to [email protected]. All feedback is welcome.

Publication of this document by W3C as part of the W3C Incubator Activity indicates no endorsement of its content by W3C, nor that W3C has, is, or will be allocating any resources to the issues addressed by it. Participation in Incubator Groups and publication of Incubator Group Reports at the W3C site are benefits of W3C Membership.

Incubator Groups have as a goal to produce work that can be implemented on a Royalty Free basis, as defined in the W3C Patent Policy. Participants in this Incubator Group have agreed to offer patent licenses according to the W3C Royalty-Free licensing requirements described in Section 5 of the W3C Patent Policy for any portions of the XG Reports produced by this XG that are subsequently incorporated into a W3C Recommendation produced by a Working Group which is chartered to take the XG Report as an input.

Table of Contents

1. Introduction

Provenance refers to the sources of information, such as entities and processes, involved in producing or delivering an artifact. The provenance of information is crucial in deciding whether information is to be trusted, how it should be integrated with other diverse information sources, and how to give credit to its originators when reusing it. In an open and inclusive environment such as the Web, users find information that is often contradictory or questionable. People make trust judgments based on provenance that may or may not be explicitly offered to them. Reasoners in the Semantic Web would benefit from explicit representations of provenance to make informed trust judgments about the information they use. With the arrival of massive amounts of Semantic Web data (eg, Linked Open Data) information about the origin of that data, i.e., provenance, becomes an important factor in developing new Semantic Web applications. Therefore, a crucial enabler of the Semantic Web deployment is the ability to the explicitly express provenance that is accessible and understandable to machines and humans.

Provenance is concerned with a very broad range of sources and uses. For example, businesses may exploit provenance in their quality assurance procedures for manufactured processes. The provenance of a document or an image in terms of its origins and prior ownerships is crucial to describe publication rights and therefore to determine their legal use. In a scientific context, data is integrated depending on the collection and pre-processing methods used. Further, decisions about the validity or reliability of an experimental result are based on how each analysis step was carried out. In this context, provenance can enable reproducibility. Despite this diversity, there are many common threads underpinning the representation, capture, and use of provenance that need to be better understood to enable a new generation of Web applications enabled by provenance.

There are many pockets of research and development that have studied relevant aspects of provenance. The Semantic Web and agents communities have developed algorithms for reasoning about unknown information sources in a distributed network. Logic reasoners can produce justifications for how an answer was derived, and explanations that can help find and fix errors in ontologies. The information retrieval and argumentation communities have investigated ways to amalgamate alternative views and sources of contradictory and complementary information, taking into account its origins. The database and distributed systems communities have looked into the issue of provenance in their respective areas. Provenance has also been studied for workflow systems in e-Science as a means to represent the processes that generate new scientific results. Licensing standards bodies take into account the attribution of information as it is reused in new contexts. However, by and large, these results tend to be shared within their respective communities, without broad dissemination and take up by the Web community.

Given the increased interest in provenance in the Semantic Web area and in the Web community at large, the W3C established the Provenance Incubator Group as part of the W3C Incubator Activity with a charter to provide a state-of-the art understanding and develop a roadmap in the area of provenance and possible recommendations for standardization efforts. This document summarizes the findings of the group.

2. What is provenance

Provenance is too broad a term for it to be possible to have one, universal definition - like other related terms such as "process", "accountability", "causality" or "identity", we can argue about their meanings forever (and philosophers have indeed debated concepts such as identity or causality for thousands of years without converging). Our goal was to develop a working definition reflecting how the W3C Provenance Incubbator Group views provenance in the context of the Web.

To develop this view, we first the activities reported in the rest of this document. That is, we did not start out trying to agree on a definition of provenance but rather the group came to a shared view once we had a common background and context, based on months of discussions.

2.1 A Working Definition of Provenance

Provenance is a very broad topic that has many meanings in different contexts. The W3C Provenance Incubator Group developed a working definition of provenance on the Web:

Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance.

2.2 Provenance, Metadata, and Trust

Provenance is often conflated with metadata and trust. These terms are related, but they are not the same.

2.2.1 Provenance and Metadata

Metadata is used to represent properties of objects (e.g. an image). Many of those properties have to do with provenance, so the two are often equated. How does metadata relate to provenance?

Descriptive metadata of a resource only becomes part of its provenance when one also specifies its relationship to deriving the resource. For example, a file can have a metadata property that states its size. But, this, is not typically considered provenance information since it does not relate to how the file was created. The same file can have metadata regarding its creation date, which would be considered provenance-relevant metadata. So even though a lot of metadata potentially has to do with provenance, both terms are not equivalent. In summary, provenance is often represented as metadata, but not all metadata is necessarily provenance.

2.2.2 Provenance and Trust

Trust is a term with many definitions and uses, but in many cases establishing trust in an object or an entity involves analyzing its origins and authenticity. How does trust relate to provenance?

Trust is often equated with provenance, and it is indeed related but it is not the same. Trust is derived from provenance and from other data quality metrics, and typically is a subjective judgment that depends on context and use. With provenance, the focus is on how to represent, manage, and use information about resource origins, but not on detailed approaches as to how trust may be derived from it. In essence, provenance is a platform for trust algorithms and approaches on the Web.

Authentication is often conflated with provenance because it leads to establishing trust. However, current mechanisms available for authentication address the verification of an identity or the access to a resource, such as digital signatures and access control. Provenance information may be used for authentication purposes, for example the creator of a document may provide a signature that can be verified by a third party, but is only one component of authentication.

2.3 Alternative Views on Provenance

There are other definitions in the literature that emphasize different views of provenance. Other views include: 1) Provenance as Process, 2) Provenance as a Directed Acyclic Graph, 3) Why-Provenance, 4) Where-Provenance, 5) How-Provenance, 6) Provenance as Annotations, 7) Event oriented view. Chapter 3 of the survey Foundations of Provenance on the Web for a description of these different views of provenance.

The following literature surveys summarize provenance research over the last 10 years:

3. Importance of provenance

In order to provide a better understanding of the need for provenance and its importance, the group carried out the following activities:

3.1 Original Use Cases

The group compiled 33 use cases with the aim of developing an understanding of the need for provenance in a variety of situations and application areas. Each use case follows a template that includes some background and current practices, a description of what can be achieved without the use of provenance and then with the use of some provenance technology.

The use cases covered a broad range of motivations for provenance. The topics included eScience, eGovernment, business, manufacturing, cultural heritage, library sciences, engineering design, emergency response, public policy, privacy, linked data, attribution, licensing, trust, metadata management, and others.

The use cases were reviewed by a team of curators and revised by the authors to ensure sufficient details. Six use cases were merged as a result of their curation because they overlapped with other use cases.

3.2 Flagship Scenarios

After the analysis of the use cases and the identification of the dimensions of concern, the group decided that three broad motivating scenarios should be developed to concisely illustrate the need for provenance. These scenarios were designed to be representative of the larger set of use cases and cover a broad range of provenance issues:

  1. News Aggregator Scenario: a news aggregator site that assembles news items from a variety of sources (such as news sites, blogs, and tweets), where provenance records can help with verification, credit, and licensing
  2. Disease Outbreak Scenario: a data integration and analysis activity for studying the spread of a disease, involving public policy and scientific research, where provenance records support combining data from very diverse sources, justification of claims and analytic results, and documentation of data analysis processes for validation and reuse
  3. Business Contract Scenario: an engineering contract scenario, where the compliance of a deliverable with the original contract and the analysis of its design process can be done through provenance records

3.2.1 News Aggregator Scenario

Many web users would like to have mechanisms to automatically determine whether a web document or resource can be used, based on the original source of the content, the licensing information associated with the resource, and any usage restrictions on that content. Furthermore, in cases of mashed-up content it would be useful to be able to ascertain automatically whether or not to trust it by examining the processes that created, processed, and delivered it. To illustrate these issues, we present the following scenario of a fictitious website, BlogAgg, that aggregates news information and opinion from the Web.

BlogAgg aims to provide rich real time news to its readers automatically. It does this by aggregating information from a wider variety of sources including microblogging websites, news websites, publicly available blogs and other opinion. It is imperative for BlogAgg to present only credible and trusted content on its site to satisfy its customers, attract new business, and avoid legal issues. Importantly, it wants to ensure that the news that it aggregates are correctly attributed to the right person so that they may receive credit. Additionally, it wants to present the most attractive content it can find including images and video. However, unlike other aggregators it wants to track many different web sites to try and find the most up to the date news and information.

Unfortunately for BlogAgg, the source of the information is often not apparent from the data that it aggregates from the Web. In particular, it must employ teams of people to check that selected content is both high-quality and can be used legally. The site owner would like this quality control process to be handled automatically. It is also important to check that it can legally use any pictures or information, and credit appropriate sources.

For example, one day BlogAgg discovers that #panda is a trendy topic on Twitter. It finds that a tweet "#panda being moved from Chicago Zoo to Florida! Stop it from sweating http://bit.ly/ssurl", is being retweeted across many different microblogging sites. BlogAgg wants to find the correct originator of the microblog who first got the word out. It would like to check if it is a trustworthy source and verify the news story. It would also like to credit the original source in its site, and in these credits it would like to include the email address of the originator if allowed by that person.

Following the tiny-url, BlogAgg discovers a site protesting the move of the panda. BlogAgg wants to determine what organization is responsible for the site so that its name can run next to the snippet of text that BlogAgg runs. In determining the snippet of text to use, BlogAgg needs to determine whether or not the text originated at the panda protest site or was a quoted from another site. Additionally, the site contains an image of a panda that appears as if it is sweating. BlogAgg would like to automatically use a thumbnail version of this image in its site, therefore, it needs to determine if the image license allows this or if that license has expired and is no longer in force. Furthermore, BlogAgg would like to determine if the image was modified, by whom, and if the underlying image can be reused (i.e. whether the license of the image is actually correct). Additionally, it wants to find out whether any modifications were just touch-ups or were significant modifications. Using the various information about the content it has retrieved, BlogAgg creates an aggregated post. For the post, it provides a visual seal showing how much the site trusts the information. By clicking on the seal, the user can inspect how the post was constructed and from what sources and a description of how the trust rating was derived from those sources.

Note, that BlogAgg would want to do this same sort of process for thousands to hundreds of thousands a sites a day. It would want to automate its aggregation process as much as possible. It is important for BlogAgg to be able to detect when this aggregation process does not work in particular when it can not determine the origins of the content it uses. These need to be flagged and reported to BlogAgg´s operators.

The provenance issues highlighted in this scenario include:

  • checking licenses when reusing content
  • verifying that a document/resource complies with licensing policies of pieces it reused
  • integrating unstructured content, ie documents and media (in contrast with integrating structured data)
  • content aggregation: aggregating RSS feeds, or product information, or news (in the case of the scenario)
  • checking authority
  • recency of information
  • verification of original sources
  • conveying to an end user the derivation of a source of information
  • versioning and evolution of unstructured content, whether documents or other media (eg images)
  • tracking user/reuse of content
  • scalable provenance management

3.2.2 Disease Outbreak Scenario

Many uses of the Web involve the combination of data from diverse sources. Data can only be meaningfully reused if the collection processes are exposed to users. This enables the assessment of the context in which the data was created, its quality and validity, and the appropriate conditions for use. Often data is reused across domains of interest that end up mutually influencing each other. This scenario focuses on the reuse of data across disciplines in both anticipated and unanticipated ways.

Alice is an epidemiologist studying the spread of a new disease called owl flu (a fictitious disease made up for this example), with support from a government grant. Many studies relevant to public policy are funded by the government, with the expectation that the conclusions will provide guidance for public policy decision-makers. These decisions need to be justified by a cost-benefit analysis. In practice, this means that results of studies not only need to be scientifically valid, but the source data, intermediate steps and conclusions all need to be available for other scientists or non-experts to evaluate and reproduce. In the United Kingdom, for example, there are published guidelines that social scientists like Alice are required to follow in reporting their results.

Alice´s study involves collecting reports from hospitals and other health services as well as recruiting participants, distributing and collecting surveys, and performing telephone interviews to understand how the disease spreads. The data collected through this process are initially recorded on paper and then transcribed into an electronic form. The paper records are confidential but need to be retained for a set period of time.

Alice will also use data from public sources (Data.gov, Data.gov.uk), or repositories such as the UK Data Archive (data-archive.ac.uk). Currently, a number of e-Social Science archives (such as NeISS, e-Stat and NESSTAR) are being developed for keeping and adding value to such data sets, which Alice may also use. Alice may also seek out "natural experiment" data sources that were originally gathered for some other purpose (such as customer sales records, Tweets on a given day, or free geospatial data).

Once the data are collected and transcribed, Alice processes and interprets the results and writes a report summarizing the conclusions. Processing the data is labor-intensive, and may involve using a combination of many tools, ranging from spreadsheets, generic statistics packages (such as SPSS or Stata), or analysis packages that cater specifically to Alice´s research area. Some of the data sources may also provide some querying or analysis processing that Alice uses at the time of obtaining the data.

Alice may find many challenges in integrating the data in different sources, since units or semantics of fields are not always documented. When different data sources use different representations, Alice may need to recode or integrate this data by hand or by guessing an appropriate conversion function - subjective choices that may need to be revisited later. To prepare the final report and make it accessible to other scientists and non-experts, Alice may also make use of advanced visualization tools, for example to plot results on maps.

The conclusions of the report may then be incorporated into policy briefing documents used by civil servants or experts on behalf of the government to identify possible policy decisions that will be made by (non-expert) decision makers, to help avoid or respond to future outbreaks of owl flu. This process may involve considering hypotheticals or reevaluating the primary data using different methodologies than those applied originally by Alice. The report and its linked supporting data may also be provided online for reuse by others or archived permanently in order to permit comparing the predictions made by the study with the actual effects of decisions.

Bob is a biologist working to develop diagnostic and chemotherapeutic targets in the human pathogen responsible for owl flu. Bob´s experimental data may be combined with data produced in Alice´s epidemiological study, and Bob´s level of trust in this data will be influenced by the detail and consistency of the provenance information accompanying Alice´s data. Bob generates data using different experiment protocols such as expression profiling, proteome analysis, and creation of new strains of pathogens through gene knockout. These experiment datasets are also combined with data from external sources such as biology databases (NCBI Entrez Gene/GEO, TriTrypDB, EBI´s ArrayExpress) and information from biomedical literature (PubMed) that have different curation methods and quality associated with them. Biologists need to judge the quality, timeliness and relevance of these sources, particularly when data needs to be integrated or combined.

These sources are used to perform "in silico" experiments via scientific workflows or other programming techniques. These experiments typically involve running several computational steps, for example running machine learning algorithms to cluster gene expression patterns or to classify patients based on genotype and phenotype information. The results need to meet a high standard of evidence so that the biologists can publish them in peer-reviewed journals. Therefore, justification information documenting the process used to derive the results is required. This information can be used to validate the results (by ensuring that certain common errors did not occur), to track down obvious errors or understand surprising results, to understand how different results arising from similar processes and data were obtained, and to reproduce results.

As more data of owl flu outbreaks and treatments become available over time, Alice´s epidemiological studies and Bob´s research on the behavior of the owl flu virus will need to be updated by repeating the same analytical processes incorporating the new data.

This scenario illustrates several distinctive aspects of provenance:

  • data integration: combining structured and unstructured data, data from different sources, linked data
  • archiving: understanding how data sources evolve over time through provenance and versioning information
  • justification: summarizing provenance records and other supporting evidence for high-level decision making
  • reuse: using data or analytic products published by others in a new context
  • repeatability: using provenance to rerun prior analyses with new data

3.2.3 Business Contract Scenario

In scientific collaborations and in business, individual entities often enter into some form of contract to provide specific services and/or to follow certain procedures as part of the overall effort. Proof that work was performed in conformance with the contract is often required in order to receive payment and/or to settle disputes. Such proof must, for example, document work that was performed on specific items (samples, artifacts), provide a variety of evidence that would preclude various types of fraud, allow a combination of evidence from multiple witnesses, and be robust enough to allow provision of partial information to protect privacy or trade secrets. To illustrate such demands of proof, and the other requirements which stem from having such information, we consider the following use case.

Bob´s Website Factory (BWF) is a fictitious company that creates websites which include secured functionality, e.g. for payments. Customers choose a template website structure, enter specifications according to a questionnaire, upload company graphics, and BWF will create an attractive and distinct website. The final stage of purchasing a website involves the customer agreeing to a contract setting out the respective responsibilities of the customer and BWF. The final contract document, including BWF´s digital signature will be automatically sent by email to both parties.

BWF has agreed a contract with a customer, Customers Inc., for the supply of a website. Customers Inc. are not happy with the product they receive, and assert that contractual requirements on quality were not met. Specifically, BWF finds it must defend itself against claims that work to create a site to the specifications was not performed or was performed by someone without adequate training, that the security of payments to the site is faulty due to improper quality control and testing procedures not being followed. Finally, Customers Inc. claim that records were tampered with to remove evidence of the latter improper practices.

BWF wish to defend themselves by providing proof that the contract was executed as agreed. However, they have concerns about what information to include in such a proof. Many websites are not designed from scratch but are based on an existing design in response to the customer´s requests or problems. Also, sometimes parts of the design of multiple different sites, designed for other customers, are combined to make a new website. Both protecting its own intellectual property and confidential information regarding other customers mean that BWF wishes to reveal only that information needed to defend against Customers Inc.´s claims.

There are many kinds of objects relevant to describing BWF´s development processes, from source code text in programs to images. To provide adequate proof, BWF may need to include documentation on all of these different objects, making it possible to follow the development of a final design through its multiple stages. The contract number of a site is not enough to unambiguously identify that artifact, as designs move through multiple versions: the proof requires showing that a site was not developed from a version of a design which did not meet requirements or use security software known to be faulty.

With regard to showing that there was adequate quality control and testing of the newly designed or redesigned sites, BWF needs to demonstrate that a design was approved in independent checks by at least two experts. In particular, the records should show that the experts were truly independent in their assessments, i.e. they did not both base their judgement on one, possibly erroneous, summarized report of the design.

Finally, Customers Inc. claim that there is a discrepancy in the records, suggesting tampering by BWF to hide their incompetence, as the development division apparently claimed to have received instructions from the experts checking the design that it was OK before the experts themselves claim to have supplied such a report. BWF suspects, and wishes to check, that this is due to a difference in semantics between the reported times, e.g. in one case it regards the receipt of the report by the developers, in the other it regards the receipt of the acknowledgement of the report from the developers by the experts. These reports should be shared in a format that both parties understand.

The provenance issues highlighted in this scenario include:

  • Checking whether past actions comply with stated obligations
  • Understanding how one product is derived from another
  • Filtering the information revealed in provenance by privacy and ownership concerns
  • Discovering where two sources have a hidden mutual dependency
  • Resolving apparent inconsistencies in multiple accounts of the same event
  • Verifying that those who performed actions had the expertise or authority to do so

4. Requirements for provenance

As seen in the previous section, provenance touches on many different domains and applications. Each of these has different requirements for provenance. Here, we present the requirements extracted by the group from the collected use cases. The group produced several detailed documents:

  • The incubator group also collected 140 technical requirements. We refer to these technical requirements indirectly, but we do not include them here.

4.1 Provenance Dimensions

The group found useful to organize requirements and use cases in terms of key dimensions that concern provenance. An overview of these dimensions is shown in the following table:

Category Dimension Description
Content
Object The artifact that a provenance statement is about.
Attribution The sources or entities that contributed to create the artifact in question.
Process The activities (or steps) that were carried out to generate or access the artifact at hand.
Versioning Records of changes to an artifact over time and what entities and processes were associated with those changes.
Justification Documentation recording why and how a particular decision is made.
Entailment Explanations showing how facts were derived from other facts.
Management
Publication Making provenance available on the Web.
Access The ability to find the provenance for a particular artifact.
Dissemination Defining how provenance should be distributed and its access be controlled.
Scale Dealing with large amounts of provenance.
Use
Understanding How to enable the end user consumption of provenance.
Interoperability Combining provenance produced by multiple different systems.
Comparison Comparing artifacts through their provenance.
Accountability Using provenance to assign credit or blame.
Trust Using provenance to make trust judgments.
Imperfections Dealing with imperfections in provenance records.
Debugging Using provenance to detect bugs or failures of processes.

 

4.2 Content Requirements

Content refers to the types of information that would need to be represented in a provenance record. That is, what structures and attributes would need to be defined in order to contain the kinds of provenance information that we envision need to be captured.

We need to be able to establish the artifact or object that statements of provenance are about, and be able to refer to that object. This object can be a variety of things. On the Web, this will be a web resource, essentially anything that can be identified with a URI, such as web documents, datasets, assertions, or services. Provenance may refer to aspects or portions of an object. For example, objects may be organized in collections, then subgroups selected, then portions of some objects modified, etc.

Attribution is a critical component of provenance. It refers to the sources (i.e., typically any web resource that has an associated URI, such as documents, web sites, or data) or entities (i.e., people, organizations, and other identifiable groups) that contributed to the creation of the artifact in question. In addition, the provenance representation of attribution should also enable us to see the true origin of any statement of attribution to an entity. Technically, this may require that the statement be verified through the use of a an authentication system (perhaps with the use of digital signature). Additionally, one may want to know whether a statement was made by the original entity or was reconstructed and then asserted by a third party. Attribution may also require some form of anonymization, for privacy or identity protection reasons.

Process refers to the activities (or steps) that were carried out to generate the artifact. These activities encompass the execution of a computer program that we can explicitly point to, a physical act that we can only refer to, and some action performed by a person that can only be partially represented. Provenance representations should represent how activities are related to form a process. Provenance information may need to refer to descriptions of the activities, so that it becomes possible to support reasoning about the processes involved in provenance and support descriptive queries about them. Processes can be represented at a very abstract level, focusing only on important aspects, or at a very fine-grained level, including minute details sufficient to enable exact reproduction of the process.

Dealing with evolution and versioning is a critical requirement for a provenance representation. As an artifact evolves over time, its provenance should be augmented in specific ways that reflect the changes made over prior versions and what entities and processes were associated with those changes. When one has full control over an artifact and its provenance records this may be a simple matter of good-recording keeping, but this is a challenge in open distributed environments such as the Web. Consider the representation of provenance when republishing, for example by retweeting, reblogging, or repackaging a document. It should also be possible to represent when a set of changes grant the denomination of a new version of the object in question. An important aspect to consider is how resource access properties, such as the access time, server accessed, the party accessing the resoource and the party responsible for the server, impacts the version of an artifact.

A particular kind of provenance information is justifications of decisions. The purpose of a justification is to allow those decisions to be discussed and understood. Justifications should be backed up by supporting evidence, which needs to be collected according to a well-defined procedure, ideally in an automatic fashion. It is important to capture both the arguments for and against particular conclusions as well as the ability to capture the evidence behind particular hypotheses. Technically, this may require systems that are provably correct and cater to long-term preservation. Additionally, if justifications are based on external information, systems must have a mechanism to exchange provenance information.

Inference may be required to derive information from the original provenance records. Some provenance information may be directly asserted by the relevant sources of some data or actors in a process, while other information may be derived from that which was asserted. In general, one fact may entail another, and this is important in the case of provenance data which is inherently describing the past, for which the majority of facts may not now be known. It is also important to capture the assumptions that were used when performing inference. For example, this may require in a RDF triple store that supports inference of the capability to provide the provenance of result given for a SPARQL query.

4.3 Management Requirements

Provenance management refers to the mechanisms that make provenance available and accessible in a system.

An important issue is the publication of provenance. Provenance information must be made available on the Web. Related issues include how provenance is exposed, discovered, and distributed. A transparent provenance representation language must be chosen and made available so others can refer to it in interpreting the provenance. The publisher of provenance information should be associated with provenance records. Technically, this requires tools to enable such publication.

Once provenance is available, it must be accessible. That is, it must be possible to find it by specifying the artifact of interest. In some cases, it must be possible to determine what the authoritative source of provenance is for some class of entities. Query formulation and execution mechanisms must be defined for provenance representation.

In realistic settings, provenance information will have to be subject to dissemination control. Provenance information may be associated with access policies about what aspects of provenance can be made available given a requestor´s credentials. Provenance may have associated use policies about how an artifact can be used given its origins. This may include licensing information stated by the artifact´s creators regarding what rights of use are granted for the artifact. Finally, provenance information may be withheld from access for privacy protection or intellectual property protection. Dissemination control must also take into account how security can be integrated.

The scale of provenance information is a major concern, as the size of the provenance records may by far exceed the scale of the artifacts themselves. Despite the presence of large amounts of provenance, efficient access to provenance records must be possible. Tradeoffs must be made with respect to the granularity of the provenance records kept and the actual amount of detail needed by users of provenance.

4.4 Use Requirements

We need to take into account requirements for provenance based on the use of any provenance information that we have recorded. The same provenance records may need to accommodate a variety of uses as well as diverse users/consumers.

Important considerations are how to make provenance information understandable to its users/consumers and usable. Just because the information they need is recorded in the provenance representation does not mean that they would be able to use it for their purposes. An important challenge that we face is how to allow for multiple levels of abstraction in the provenance records of an artifact as well as multiple perspectives or views concerning such provenance. In addition, appropriate presentation and visualization of provenance information is an important consideration, as users will likely prefer something other than a set of provenance statements. To achieve understandability and usability, it is important to be able to combine general provenance information with domain-specific information.

Because provenance information may be obtained from heterogeneous systems and different representations and used across multiple applications, interoperability is an important requirement. A query may be issued to retrieve information from provenance records created by different systems that then need to be integrated. At a finer grain, the provenance of a given artifact may be specified by multiple systems and need to be combined. Users may want to also know what sources contributed specific provenance statements, so they can make decisions in case of conflicts.

Another important use of provenance is for comparison of artifacts based on their origins. Two artifacts may seem very different while their provenance may indicate significant commonalities. Conversely, two artifacts may seem alike, and their provenance may reveal important differences. Technically, this would require that the underlying provenance representation be amenable to comparison, for example, through graph comparison.

Provenance data can be used for accountability. Specifically, accountability may mean allowing users to verify that work performed meets a contract decided upon earlier, determining the license that a composite object has due to the licenses of its components, or comparing that an account of the past suggested by a provenance record is compliant with regulations. Accountability requires that the users can rely on the provenance record and authenticate its sources.

A very important use of provenance is trust. Provenance information is used to make trust judgments on a given entity. Trust is often based on attribution information, by checking the reputation of the entities involved in the provenance, perhaps based on past reliability ratings, known authorities, or third-party recommendations. Similarly, measures of the relative information quality can be used to choose among competing evidence from diverse sources based on provenance. Finally, users should be able to access and understand how trust assessments are derived from provenance.

Using provenance information may imply handling imperfections. Provenance information may be incomplete in that some information may be missing, or incorrect if there are errors. Provenance information may also be provided with some uncertainty or be of a probabilistic nature. These imperfections may be caused by problems with the recording of the provenance information but they may also arise because the user does not have access to the complete and accurate provenance records even if they exist. This would be the case when provenance is summarized or compressed, in which case many details may be abstracted away or missing, or when the user does not have the right permissions to access some aspects of the provenance records, potentially because of privacy concerns. Finally, provenance may also be subject to deception and be fraudulent, partially or in its entirety.

Another use of provenance is debugging. Users may want to detect failure symptoms in the provenance records and diagnose problems in the process that generated an artifact, whether conducted in a software system or by people. For example, detecting when a error was produced because of the use of a common source of information. Debugging may also involve comparison of records from multiple witnesses, e.g. a workflow engine and a computer operating system, to catch instances where fine-grained records are inconsistent with the coarser-grained view.

4.5 Requirements on RDF for supporting Provenance

The group found that the following requirements should be considered in the context of future extensions of RDF:

  • Identity -- A key challenge is to be able to refer to the artifact that we are describing the provenance for. Within the RDF context, the artifact could be a single RDF statement, a set of statements or an arbitrary set of Web resources.
  • Evolution -- An important requirement is the ability to describe the provenance of a dynamic, evolving resource. Over time, there may be updates and even new versions that change some aspect of the resource. A challenge is to describe how the new incarnations of the resource relate to one another, and to determine whether provenance records should be self-contained and attached to each incarnation, or instead refer to prior ones for details. As resources may be republished, perhaps repackaged, summarized, or mixed, their provenance records need to reflect such processes and their implications on the contents.
  • Entailment -- Another important requirement is the ability to distinguish what is directly asserted by the entities and processes that produce the resource from other information that may be inferred from those assertions or perhaps derived or hypothesized by a third party.
  • Publication -- A publisher of provenance information needs to use some provenance representation language and link the provenance assertions to the actual resource information. Publishers may choose to publish only a subset of the provenance records, and should be able to identify themselves possibly with a signature that is verifiable by others.
  • Querying -- Provenance information may be made accessible in some manner, and there must be mechanisms to find the provenance for a given resource. Query formulation and execution must be provided for provenance information. Ideally, there should be a convenient way to formulate queries that span primary and provenance information.

Based on these requirements, the group argued for additional desirable capabilities that the current standard RDF model does not offer, including proper identification of RDF statements and an annotation framework permitting a standard approach for linking meta-information like provenance with sets of RDF triples. The group also argued for the development of a common approach to exchange provenance information between systems and publish it on the Web. This requirement would begin to address the remaining three provenance requirements. Later in this report, we discuss the basis for the development and specification of such an approach.

4.6 Summary

The provenance dimensions described here summarize the requirements that users have with respect to provenance. Users need to be able to point to and ask for the provenance of an object. They want to know who is responsible for information and if there is adequate justification for that information. They may want to understand how information was processed to produce a particular query result or artifact and how information evolves over time. They want to use provenance information to ascertain trust, compare artifacts, debug programs and to hold other parties to account. They need easy access to large of amounts of provenance information across many different systems and those systems should take into account their privacy when making provenance information available.

5. State of the art and technology gaps

Prior and ongoing work on provenance is spread out in many areas. The published literature on provenance is vast and rapidly growing, with a reported half of the articles published in the last two years (Moreau 2010).

To obtain a comprehensive understanding of existing work on provenance the group carried out several activities, including:

These materials provide a checkpoint on the state of the art of provenance during the Incubator Group´s activities in 2009-2010, and provide a basis for the recommendations and future roadmap of this report.

This section summarizes the most salient results from these activities.

5.1 Provenance Bibliography

The group did not think necessary to do a comprehensive literature review of provenance, since there are several surveys published about provenance and trust. Instead, the group agreed to do a limited effort in creating an annotated bibliography collection that would give concrete entry points to the literature in this area.

Driven by the three flagship scenarios and their requirements, the group constructed a shared bibliography of relevant publications and technologies. We then used tagging to organize the entries according to the three scenarios as well as the provenance dimensions above.

The bibliography collection can be browsed here. The collection can be browsed and viewed by authors, keywords, type (journal versus conference), and year.

The entire set of references in bib format can be downloaded here.

5.2 Provenance Vocabularies

Among the relevant technologies, a few stood out as covering approximately similar applications and provenance needs using roughly comparable ontologies or data models. These Provenance Vocabularies include:

  • Open Provenance Model: Outcome of the Provenance Challenge series (initiated in 2005), after discussion and consensus of part of the community. For this reason, and for being general and broad enough, it was selected as the model to which the rest of the provenance vocabularies have been mapped. (The mappings will be further explained in the next section). OPM is used to describe histories in terms of processes (things happening), artifacts (what things happen to), and agents (what controls things happening). These three are kinds of nodes within a graph, where each edge denotes a causal relationship. Edges have named types depending on the kinds of node they relate: a process used an artifact; an artifact was generated by a process; one artifact was derived from another artifact; one process was triggered by another process; a process was controlled by an agent.
  • Provenir Ontology. Common provenance model, which forms the core component of a modular approach to provenance management framework in eScience. Three base classes in the Provenir Ontology are used for representing the primary components of provenance, that is, "data" (continuant entities that represent the starting material, intermediate material, end products of a scientific experiment), "agent" (which models the continuant entities that causally affect the individuals of process) and "process"(models the concurrent entities that affect individuals of data).
  • Provenance Vocabulary. Developed to describe provenance of Linked Data on the Web. The Provenance Vocabulary is defined as an OWL ontology and it is partitioned into a core ontology and supplementary modules. To avoid making the core ontology too complex, the modules provide less frequently used concepts and a broad range of specializations of the core concepts. At present the Provenance Vocabulary provides three supplementary modules: Types, Files and Integrity Verification.
  • Proof Markup Language. PML is an interlingua for representing and sharing explanations generated by various intelligent systems such as hybrid web-based question answering systems, text analytic components, theorem provers, task processors, web services, rule engines, and machine learning components. The interlingua is split into three modules (provenance, justification, and trust relations) to reduce maintenance and reuse costs. While the provenance of PML in the Semantic Web is clear in the choice of names (e.g. InferenceEngines), there are numerous examples where PML has been applied to non-text data and non-logic-based processing and thus the term definitions do not appear restrictive.
  • Dublin Core. Dublin Core Metadata Terms provide a means to describe resources such that others will be able to interpret those descriptions. In particular, it provides a common vocabulary of core terms which can act as metadata keys, qualifications of those terms for specific applications, definitions of data types for the values of resource metadata, and so on. Amongst the terms available are many which relate to the provenance of the resource: who created it, when it was changed, etc.
  • PREMIS. Stands for "PREservation Metadata: Implementation Strategies". It is a data dictionary for supporting long-term preservation: defines a core set of semantic units that repositories should know in order to perform their preservation functions. It focuses on the provenance of the archived, digital objects (files, bitstreams, aggregations), not on the provenance of the descriptive metadata.
  • WEB OF Trust Schema (WOT). Schema designed to facilitate the use of Public Key Cryptography tools such as PGP or GPG to sign RDF documents and document these signatures.
  • Semantic Web Publishing Vocabulary. An RDF-Schema vocabulary for expressing information provision related meta-information and for assuring the origin of information with digital signatures.
  • Changeset Vocabulary. Describes changes to RDF-based resource descriptions. A resource description is a set of RDF triples that "in some way comprise a description of a resource." [Tunnicliffe and Davis, 2009] The change of a resource description is represented by a cs:ChangeSet entity which encapsulates the differences between two versions of the description. Such differences are represented by additions and removals of RDF triples.

The group developed concrete mappings between the terms in these different provenance vocabularies. These mappings can help users better understand the similarities and differences between the provenance terminologies, facilitate the development of applications that can utilize the mappings for provenance interoperability, and enable the provenance research community to move towards the adoption of a common provenance terminology.

The mappings use the Open Provenance Model (OPM) as a reference vocabulary. The mappings between the provenance terms are formally encoded using the W3C recommended Simple Knowledge Organization System (SKOS) vocabulary. The rationale for the mappings was documented in detail.

5.3 Analysis of The State of the Art

We carried out detailed analyses of the requirements and provenance dimensions exhibited by the flagship scenarios, and the extent to which current practice and ongoing research addresses these needs. The State of the Art Report presents these detailed analyses along with references to the bibliography. Here, we summarize the discussion of how current technology is used to address these kinds of problems.

5.3.1 State of the Art Analysis of the News Aggregator Scenario

The News Aggregator Scenario envisions a system that can automatically tell where a piece of content (i.e. object) on the Web comes from and who is responsible for that content after it has been aggregated.

Aggregation Today

Content aggregation is widely used on the Web. Examples of content aggregation for news include sites such The Huffington Post, Digg, and Google News. Personal aggregation is facilitated by feed technologies (RSS, Atom) and their associated readers (e.g. Google Reader). Newer aggregators can provide a merged view of content, thus, hiding some of the provenance of the information to increase visual appeal. This is similar to what is envisioned in the News Aggregator scenario.

Tracking Content

A number of systems have looked at tracking content, in particular, quotes across the Web. Some sites provide ways for tracking distinctive phrases through the blogosphere. Other work has expanded on this to track how information is propagated through the network and thus which blogs and media have the greater influence. Similarly, there has also been work on influence in the microblogging Internet, Twitter, including the difference between the influence of content and the influence of users. Researchers have also studied how the social networks of both Twitter and Digg impact the propagation of information through these networks.

Most of these systems rely on crawls of the Web that are produced uniquely for each application. However, there are tools and services for producing these crawls uniquely, for example by periodically crawling and then analyzing blogs.

Need for Explicit Provenance

It is important to note that these systems deduce the provenance of an object (e.g. a piece of text) after the fact from crawled data. However, determining provenance after the fact is often difficult. For example, during the Iranian Green Revolution protests it was extremely difficult to determine the actual origins of tweets about what was happening on the ground. Later 2009, Twitter launched its own service to explicitly capture the notion of retweeting. There is documented evidence of the difficulty in determining provenance for accountability and trust from crawled documents. There are mechanisms for explicit tracking of the origin of blogs and microblogs. These are used to notify a blog when another blog has linked to it. We note that these systems are isolated to specific technology platforms and do not encompass the whole of the Web.

Licensing

A crucial reason for tracking provenance in the News Aggregator Scenario is the ability to determine if an image (or other content) can be reused. There are sites to track and maintain licensing rights to music or books. Fundamental technology related to this is the digital representation of licenses.

5.3.2 State of the Art Analysis of the Disease Outbreak Scenario

Data provenance

Provenance is central to the requirement that scientific research be reproducible. Paper laboratory notebooks have been in use for hundreds of years as a primary means of recording provenance. In the last few decades, automated mechanisms including laboratory information management systems (LIMS), databases, and electronic notebooks have seen growing use. However, such systems have had significant adoption barriers and continue to have significant limitations, particularly as community-scale infrastructure. Tracking provenance across laboratories, sharing data with provenance, and integrating data from multiple sources remains very labor intensive today. Provenance records for reference information and community data sets are either produced by hand (i.e. by scientists filling in data entry forms on submission to a data archive or repository), produced by ad hoc applications developed specifically for a given kind of data, or not produced at all.

Within curated biological databases, provenance is often recorded manually in the form of human-readable change logs. Some scientists are starting to use wikis to build community databases.

There has been a stream of research on provenance in databases but so far little of it has been transferred into practice, in part because most such proposals are difficult to implement, and often involve making changes to the core database system and query language behavior. Such techniques have not yet matured to the point where they are supported by existing database systems or vendors, which limits their accessibility to non-researchers.

Workflow provenance and e-Science

On the other hand, a number of workflow management systems and Semantic Web systems have been developed and are in active use by communities of scientists. Many of these systems implement some form of provenance tracking internally, and have begun to standardize on a few common representations for the provenance data to enable interchange and integration. Some systems also track the provenance of the workflow templates themselves. Moreover, research on storing and querying such data effectively can more easily be transferred to practice since it relies on an explicit model of data and control flow between computational steps.

Development of electronic notebooks, e-Science systems, and underlying content and provenance middleware have begun to address end-to-end provenance management of physical, computational, and coordination processes. Over the last two decades, numerous open source and commercial electronic notebook systems have been developed. Electronic notebook capabilities range from wiki-style free-text annotation to LIMS/database-style automated documentation of repetitive work using standardized experimental protocols. State-of-the-art systems provide integrated management of provenance across many types of activities. Some systems track provenance from experiment planning through laboratory work to eventual deposition in a reference data repository. Standardization would be a significant benefit to researchers trying to implement such end-to-end capabilities using tools from multiple commercial and open source providers.

Provenance as justification for public policy

Justification information, describing how the data has been processed from raw observations to processed results that support scientific conclusions, is legally required in some scientific settings. Regulatory agencies, e.g. in areas such as food safety and medical research, specifically require research and testing documentation and also define acceptable practice for managing such records electronically. Some researchers in social sciences are developing community-scale techniques directly addressing these problems through computer systems, but most such justification information is created and maintained by user effort.

5.3.3 State of the Art Analysis of the Business Contract Scenario

The Business Contract scenario envisages mechanisms to explain the details of procedures, in this case instances of design and production, such that it can be compared with the normative statements made in contracts, specifications or other legal documents. It further assumes the ability to filter this evidence on the basis of confidentiality. While we are not aware of any system providing exactly the functionality described (standards-based, machine interpretable records that respect privacy and other concerns), there are many ´business/contract fulfillment´ and related services which address particular aspects.

Tracking Design

It is critical to track which products a design decision or production action ultimately affect, so that it is possible to show that what was done in producing an individual product fulfilled obligations and to determine which set of products a decision affected. Product recalls occur that are due to particular manufacturing issues. Without knowing the connection between the manufacturing actions and products, many vehicles unaffected by the problem may be recalled, thereby costing a company a great deal more than necessary.

Computer-Aided Design and Requirements Management

Computer-aided design (CAD) systems can include features to capture what is occurring as a design is created. Aside from storing a history of changes to a design, requirements management systems allow the rationale behind design choices to be captured, which can be an essential part of the record in explaining how contractual obligations were aimed to be met (particularly where a contract states that some factor must be ´taken into account´) and in assessing whether additional design changes will impact a product´s ability to meet stated requirements.

The provenance of a design goes beyond just the changes made through a single CAD system, both because one design may be based on another previously developed design for another customer and because the design is just a part of a larger manufacturing and retail process. Ways in which the interconnection between parts of a process include the use of common formats for describing designs and standardised archiving mechanisms used by all stages of a process.

5.4 Gap Analysis

Given the state of the art in the area of provenance, the group wanted to identify major gaps in technology that stand in the way to making shared provenance on the Web a reality. Since the major issues are illustrated by the three flagship scenarios, this gap analysis was done with the scenarios as a point of reference.

5.4.1 Gap Analysis for the News Aggregator Scenario

Existing provenance solutions only address a small portion of the scenario and are not interlinked or accessible among one another. For each step within the News Aggregator scenario, there are existing technologies or relevant research that could solve that step. For example, one can properly insert licensing information into a photo using a creative commons license and the Extensible Metadata Platform. One can track the origin of tweets either through retweets or using some extraction technologies within twitter. However, the problem is that across multiple sites there is no common format and api to access and understand provenance information whether it is explicitly indicated or implicitly determined. To inquire about retweets or inquire about trackbacks one needs to use different apis and understand different formats. Furthermore, there is no (widely deployed) mechanism to point to provenance information on another site. For example, once a tweet is traced to the end of twitter there is no way to follow where that tweet came from.

System developers rarely include provenance management or publish provenance records. Systems largely do not document the software by which changes were made to data and what those pieces of software did to data. However, there are existing technologies that allow this to be done, for example to document the transformations of images. There are also general provenance models would allow this to be expressed, but they are not currently widely deployed. There are no widely accepted architectural solutions to managing the scale of the provenance records, as they may be significantly larger than the base information itself in addition to also evolving over time.

While many sites provide for identity and there are several widely deployed standards for identity, there are no existing mechanisms for tying identity to objects or provenance traces. This is a fundamental open problem in the web, and affects provenance solutions in that provenance records must be attached to the object(s) they describe.

Finally, although there have been proposals for how to use provenance to make trust judgments on open information sources, there are no broadly accepted methodologies to automatically derive trust from provenance records. Another issue that has been largely unaddressed is the incompleteness of provenance records and the potential for errors and inconsistencies in a widely distributed and open setting such as the web.

5.4.2 Gap Analysis for the Disease Outbreak Scenario

This scenario is data-centric, and there is currently a large gap between ideas that have been explored in research and techniques that have been adopted in practice, particularly for provenance in databases. This is an important problem that already imposes a high cost on curators and users, because provenance needs to be added by hand and then interpreted by human effort, rather than being created, maintained, and queried automatically. However, there are several major obstacles to automation, including the heterogeneity of systems that need to communicate with each other to maintain provenance, and the difficulty of implementing provenance-tracking efficiently within classical relational databases. Thus, further research is needed to validate practical techniques before this gap can be addressed.

In the workflow provenance and Semantic Web systems area, provenance techniques are closer to maturity, in part because the technical problems are less daunting because the information is coarser-grained, typically describing larger computation steps rather than individual data items, and focusing on computations from immutable raw data rather than versioning and data evolution. There is already some consensus on graph-based data models for exchanging provenance information, and this technology gap can probably be addressed by a focused standardization effort.

Guidance to users about how to publish provenance at different granularity is also very important, for example whether publishing the provenance of an individual object or a collection of objects. Users need to know how to use different existing provenance vocabularies to express such different types of provenance and what the consequence will be, for example, how people will use this information and what information is needed to make it useful.

5.4.3 Gap Analysis for the Business Contract Scenario

Overall, there is a gap in practice: provenance technologies are simply not being used for the described purpose. Even with encouragement, the existing state of the art does not make it a simple task to achieve the scenario, because of a lack of standards, guidelines and tools. Specifically, we can consider the gaps with regards to content, management and use, as modelled above.

There is no established way to express what needs to be expressed regarding the provenance of an engineered product, in such a way that this can be used by both the developer and the customer (and any independent third party). Each needs to first identify and retrieve the provenance of some product, then determine from the provenance where one item is a later version of another, where implementation activity was triggered by a successful quality check etc. Moreover, even if a commonly interpretable form for the latter provenance information was agreed, there is not an established way to then augment the data so that we can verify the quality of the process it describes, e.g. how to make it comparable against contractual obligations, or to ensure it cannot be tampered with prior to providing such proof. While provenance models, digital signatures, deontic formalisms, versioning schemes and so on provide an adequate basis for solutions, there is a substantial gap in establishing how these should be applied in practice. Without common expectations on how provenance is represented, a producer cannot realistically document its activities in a way which all of its customers can use.

Assuming that common representations are established, there is a gap in what the technology provides for storing and accessing that information. This goes beyond simply using existing databases due to the form of information conveyed in provenance. For example, if an independent expert documents, in some store local to themselves, that they checked a website design, this must be connected to rest of the website´s manufacturing process to provide a full account. While web technologies could be harnessed for the purpose, there is no established approach to interlinking across parts of provenance data. Moreover, the particular characteristics of provenance data may affect how other data storage requirements must be completed: how to scale up to the quantities of interlinked information we can expect with automatic provenance recording, how to limit access to only non-confidential information etc. Finally, the provenance, once recorded, has to be accessible and there are no existing standards for exposing provenance data on the web.

With regards to use of provenance, the scenario requires various sophisticated functions, and it is non-obvious how existing technologies would realise them. For example, the various parties need to be able to understand the engineering processes at different levels of granularity, to resolve apparent conflicts in what appears to be expressed in the provenance data, to acquire indications and evidence of which parts of the record can be relied on, determine whether the provenance shows that the engineering process conformed to a contract, check whether two supposedly independent assessments rely on the same, possibly faulty, source, and so on. To encourage mass adoption of provenance technologies, it must be apparent how to achieve these kinds of function, through guidelines and standardised approaches.

5.5 Summary of Major Technology Gaps in the State of the Art

There are many proposed approaches and technology solutions that are relevant to provenance. Despite this large body of growing work, there are several major technology gaps to realizing the requirements exemplified by the flagship scenarios. Organized by the three major provenance dimensions described above, key technology gaps are:

  • With respect to provenance content:
    • No mechanism to refer to the identity/derivation of an information object.
    • No guidance on what level of granularity should be used in describing provenance of complex objects.
    • No common standard for exposing and expressing provenance information that captures processes as well as the other content dimensions
    • No guidance on publishing provenance updates
    • No standard techniques for versioning and expressing provenance relationships linking versions of data
    • No standard formats to characterise whether provenance is of adequate quality for proof (e.g. through signatures)
    • No pre-specified way to ensure provenance content can be compared against norms and expectations (eg contracts, required processes)
  • With respect to provenance management:
    • No well-defined standard for linking provenance between sites.
    • No guidance as to how existing standards can be put together to provide provenance (e.g. linking to identity, licenses).
    • No guidance as to how application developers should go about exposing provenance in their web systems.
    • No proven approaches to manage the scale of the provenance records to be recorded and processed,
    • No standard mechanisms to find and access provenance information for each item that needs to be checked,
    • No well-defined means of ensuring only essential non-confidential provenance is released when querying.
  • With respect to provenance use:
    • No clear understanding as to how to relate provenance at different levels of abstraction, or automatically extract high-level summaries of provenance from detailed records.
    • No general solutions to understand provenance published on the Web by another party
    • No standard representations to support integration of provenance across different sources
    • No standard representations to support comparison of provenance.
    • No broadly applicable approaches for dealing with imperfections in provenance,
    • No broadly applicable methodology for making trust judgments based on provenance when there is varying information quality.
    • No standard methods for validating whether provenance is of adequate quality for proof
    • No mechanism to query whether provenance data shows that laws, regulations or contracts have been complied with
    • No standard mechanism to assess whether two supposedly independent assessments rely on the same, possibly faulty, source
    • No means to resolve conflicts in (possibly inferred) provenance data

6. Provenance in Web Architecture

There are no standard mechanisms to publish and access provenance on the Web. It is not sufficient to have an agreement on a language to represent provenance, there must also be an agreement on how the provenance can be found for any given resource.

This section discusses possibilities to integrate provenance into the Web architecture. It focuses on how provenance information about Web resources could be exposed as part of the HTTP based message exchange with which these resources can be accessed. More detailed discussion can be found at the Provenance and Web Architecture report.

6.1 Overview

According to the Web architecture, an HTTP server serves representations of Web resources. These resources might be static documents (i.e. files on the server), but, they can also be dynamically generated documents. Each resources has a URL. When an HTTP client does an HTTP GET request on that URL, the server responds with a representation of the resource referred to by the URL. This representation can be understood as a specific (negotiable) serialization that represents a specific state of the resource. Negotiation can be done on three dimensions: media type, encoding (i.e. charset), and language.

We aim to extend this access pattern in a way that enables clients to access provenance information about the retrieved representation and the represented Web resource.

6.2 Subject of Provenance Information

The provenance information could be about the Web resource. This option might be problematic because Web resources may change over time. Nonetheless, some provenance statements always hold, irrespective of the state of a Web resource. Note, OPM in its current form cannot be used to represent this kind of provenance because OPM focuses on "immutable pieces of state."

The provenance information could be about the representation of the Web resource. Each representation served by an HTTP server may have a unique provenance; this holds in particular for representations that are created on the fly. However, representations of the same state of a Web resource have at least some provenance information in common.

The provenance information could be about the state of the Web resource. While this option ignores the creation of the representation that the HTTP server actually serves, it might be more feasible for passing provenance by reference (see below) because it may avoid establishing a provenance record for each representation served.

6.3 Provenance Passing

We distinguish three patterns to pass provenance via the HTTP based message exchange: by value, by reference, mixed.

6.3.1 Passing by Value

The idea of this provenance passing pattern is to add all (known) provenance information about the served representation directly to the HTTP response. This information could be embedded in the HTTP response header or in the representation itself as discussed later.

Pros:

  • The provenance information is always in sync with the retrieved representation.
  • Once the provenance information has been added to the response it can be forgotten by the server (i.e. no need to store it).

Cons:

  • Provenance may be much bigger than the representation itself, causing a lot of overhead.

The integration of a mechanism for provenance negotiation (see below) may address the disadvantages of this pattern.

6.3.2 Passing by Reference

The idea of this provenance passing pattern is to understand the provenance record as another Web resource and to add the URI of the corresponding record to the HTTP response. This reference could also be embedded in the HTTP response header or in the representation itself.

Supporting this provenance passing pattern requires a server to mint a new URI for each provenance record. Furthermore, the look-up of these provenance records has to be enabled. In response to such a look-up the server could either reconstruct the provenance information on the fly or it could access a provenance store to which it added provenance records, generated at the time when the original response has been sent. Reconstructing the provenance records might be problematic because the reconstructed record may provide a different account than what actually happened, and may vary depending on the reconstruction methods used.

Pros:

  • Very small overhead in the original response.

Cons:

  • Puts a burden on the server to maintain and keep provenance for all delivered representations (or for all states of the Web resources). This might be a big issue for resources that change frequently.

6.3.3 Passing Partially by Value and by Reference

The idea of this provenance passing pattern is to pass some provenance information by value while providing additional references. The references could be separate from the embedded provenance information and refer i) to the complete or ii) to a more detailed provenance record. Alternatively, it could be included in the embedded provenance information via URIs that identify common pieces of provenance (e.g. an agent, a common source artifact, etc), assuming that a look-up of these URIs yields additional information.

6.4 Embedding Provenance

As mentioned before, the provenance information or references could be embedded at the HTTP level or in the representation itself. Both of these provenance embedding patterns can be used for each of the provenance passing patterns discussed above.

6.4.1 Embedded at the HTTP Level

It might be possible to pass provenance (by value or by reference) in the header of an HTTP response. This would require the use of an appropriate header field. For large provenance records passed by value, this option might not be feasible due to the limit on header size. Furthermore, it is not clear how provenance, passed by value, can be provided via a header field.

Alternatively, provenance could be embedded at the HTTP level via a multipart MIME message.

6.4.2 Embedded in the Representations

Instead of embedding provenance at the HTTP level, it can be embedded (by value or by reference) in the representations itself. This provenance embedding pattern is only possible for representations serialized using a media type with metadata capabilities. The actual approach as to how provenance is embedded in the representation would depend on the media type.

For representations of RDF graphs, serialized in RDF/XML, Turtle, N3, etc., it is not clear yet how the embedded provenance description can be associated with the embedding representation (i.e. what should be the subject of provenance statements). Once this problem is solved, provenance passed by value can be represented by additional RDF triples using a suitable provenance vocabulary. To pass provenance by reference an appropriate RDF property has to be established (dct:provenance might be an option).

For representations of web pages, serialized in (X)HTML, provenance passed by reference could be embedded using the link element. This option requires the registration of a suitable link type (e.g. "provenance") that has to be used for the rel attribute. Passing provenance by value could be done using RDFa; however, as with the RDF graphs, it is also not clear yet how the embedded provenance description can be associated with the embedding representation (i.e. with the actual HTML serialization that embeds the provenance description).

6.5 Provenance Negotiation

Different clients / users may have different needs (e.g. provenance described at different levels of detail or no provenance at all). Hence, the aforementioned options do not have to be considered as exclusive, "one size has to fit all" approaches. Instead, it should be possible for clients to negotiate the kind of response they want to receive. This would require the introduction of another dimension of content negotiation. Such an extension of HTTP might be particulary relevant for provenance passed by value.

7. A Roadmap for Provenance on the Web

The group analyzed the major gaps in light of the immediate needs for provenance in the use cases that it considered. This section outlines a roadmap towards addressing those gaps, more details can be found in the report on broad recommendations and priorities.

7.1 Broad Recommendations

The group synthesized eight general recommendations:

  • Recommendation # 1: There should be a standard way to represent at a minimum three basic provenance entities:
  1. a handle (URI) to refer to an object (resource)
  2. a person/entity that the object is attributed to
  3. a processing step done by a person/entity to an object to create a new object
  • Recommendation # 2: A provenance framework should include a mechanism to access provenance-related information addressed by other standards, such as:
    • licensing information of the object
    • digital signature for the object
    • digital signature for provenance records
  • Recommendation # 3: A provenance framework should include a standard way for sites to make provenance information about their content available to other parties in a selective manner, and for others to access that provenance information
  • Recommendation # 4: A provenance framework should include a standard way to express the provenance of provenance assertions, as there can be several accounts of provenance and with different granularity and that may possibly conflict
  • Recommendation #5: A provenance framework should include a representation of provenance that is detailed enough to enable reapplying the process to reproduce it
  • Recommendation #6: A provenance framework should allow referring to versions of objects as they evolve over time, or to temporal information statements of when the object was created, modified, or accessed. In particular it should provide for a representation of how one version (or parts thereof) was derived from another version (or parts thereof).
  • Recommendation #7: A provenance framework should include a standard way to represent a procedure which has been enacted (in the scenario, this is to compare that procedure with what was required to be done)
  • Recommendation #8: A provenance framework should include a way to determine commonality of derivation in two resources (in the scenario, this is needed to judge the independence or otherwise of two reports)

Recommendations #1, #2, and #3 are present in all three flagship scenarios. Recommendations #4, #5, and #6 are more central to the second scenario, while #7 and #8 are more central to the third scenario.

7.2 Short-Term and Long-Term Priorities

The group agreed that recommendations #1, #2, and #3 from the group´s effort were the highest priorities and that they should be addressed within a provenance standardization effort. While acknowledging that the priorities of the recommendations depend on the context, those three recommendations are considered to be the most common and to represent the core set of issues to be addressed. Moreover, there already exist a number of systems that provide some form of provenance capability in these areas demonstrating convergent practices. At the same time, there are a number of developments, such as the growth in popularity of open data initiatives and Linked Data, that create a pressing need for standard representation of provenance and interoperability across systems. Failure to address effective standardization in the recommended areas now could impede effective reuse of open data, and potentially dissipate the existing community momentum and enthusiasm.

Thus, there is both a clear need and a clear starting point for a standardization effort to address these recommendations within the next two years.

The remaining recommendations #4, #5, #6, #7, and #8 represent issues for which there is less understanding or community agreement of the best approaches. While a number of research projects have considered these problems, proposed mechanisms, and deployed solutions in specific projects, there is not yet consensus on how, for example, to represent fine-grained provenance for RDF, XML, or relational data, how to integrate provenance with versioning information, or even how to version and archive past versions of data on the web (an important problem in its own right). Some aspects of these problems involve issues related to description and categorization of processes and how to describe mutable resources that have been debated for thousands of years. However, it is likely that as basic technology and standards are developed, the need for practical techniques to address these (and other) provenance issues will lead to progress. However, there is a case to be made to provide an upgrade path to cater for these elements. Thus, standardization should consider these elements when catering for extensibility.

Therefore, while there may not yet be a clear case for standardization in some areas, it is important that: a core standard be developed with future extension in mind; these problems continue to receive attention in the research community and more broadly; and the question of standardization in these areas be revisited in 3-5 years once the problems and possible solutions are better understood.

7.3 Bootstrapping Provenance Standards

The first recommendation implies that the community would be able to reach agreement on a representation of basic provenance entities. This task would have required major effort, but the group found solid grounds for immediate progress in work that has already been done.

The group found that there is a core set of provenance terms that are common across the different provenance terminologies. Despite the diverse motivations and perspectives that led to these terminologies, the group was able to establish mappings among them and successfully demonstrate that there are many common concepts in provenance.

In addition, the group also identified that the core OPM terms along with a set of provenance terms from the nine other provenance terminology can be a bootstrapping basis for a future common provenance model.

7.4 Potential for Immediate Impact

If the consensus on provenance technologies and progress on technical challenges described above is fruitful, then we can anticipate the provenance of resources to be readily available to users in all domains involving production or exchange of data. This could in turn change common practice in many domains, as knowing the origins of data may allow users to more accurately interpret it and feel more able to rely on it.

The group found evidence of the widespread demand for knowing the provenance of data in a broad range of use cases. These domains spanned scientific research (in biology, physics, geology etc.), medicine, emergency services, museum collections, legal disputes, government policy, industrial engineering, environmental data collection, as well as in everyday usage of public web resources and personal data. Each domain will have specific issues regarding provenance, but we found questions and technical requirements to be fairly generalisable, e.g. who said this and when, why did I end up with results which look like this, can I rely on this data, etc.

If both common standards for provenance are adopted and data regarding provenance is routinely automatically recorded, then a side-effect of provenance could be to connect data across application domains. For example, consider the following scenario. Environmental sensor data is gathered and made public, a social science study is conducted investigating social effects of environmental changes, and conclusions from this finally make their way into an online newspaper article. While each process is conducted independently with domain-specific tools, due to the automatic recording of provenance data, a connection is apparent between the newspaper article and the environmental sensors on which its analysis ultimately depends.

8. Recommendations

The group articulated the requirements for provenance in a variety of contexts on the Web. It has identified technology gaps in terms of standard mechanisms to represent and access provenance to address those needs. The group formulated a roadmap for provenance on the Web that includes short term and long term priorities. The group also agreed to concrete starting points to ensure rapid progress towards a standardization effort, as the potential impact of provenance standards on the Web is very broad and immediate.

The group recommends the formation of a Working Group to address core concepts of provenance with the charter below.

While the proposed charter leaves out aspects of provenance where there is not yet a clear case for standardization, it is important that the core standard be developed with future extension in mind as those aspects continue to receive attention in the research community and can be revisited in 3-5 years when the issues and possible solutions are better understood.

8.1 Proposed Charter for a Provenance Interchange Working Group

Introduction/Mission

The mission of the proposed Provenance Interchange Working Group is to support the widespread publication and use of the provenance of Web documents, data, and resources. It will define a language for exchanging provenance, and publish concrete specifications of the language using existing W3C standards.

1. Background

The W3C Incubator Group on Provenance identified rapidly growing needs for provenance in social, scientific, industry, and government contexts, involving data and information integration across the Web. Provenance is unique in that it inherently draws on distributed information. Therefore, collecting provenance and making sense of it requires consulting different heterogeneous systems.

Over time, multiple techniques to capture and represent various forms provenance have been devised, and are sometimes known under the names of lineage, pedigree, proof, or traceability. As noted in the group´s state-of-the-art report, the lack of a standard model is a significant impediment to realizing such applications. This matters as provenance is key to establishing trust in documents, data, and resources. However, the group´s work also indicates that many provenance models exist with significantly different expressivity, fundamentally different assumptions about the system they are embedded in, and radically different performance impact. The idea that a single way of representing and collecting provenance could be adopted internally by all systems does not seem to be realistic today.

A pragmatic approach is to consider a core provenance language and extension mechanisms that allow any provenance model to be translated into such a lingua franca and exchanged between systems. Heterogeneous systems can then export their provenance into such a core language, and applications that need to make sense of provenance in heterogeneous systems can then import it and reason over it.
2. Scope

The Provenance Interchange Working Group has the following objectives:

  • define a provenance interchange language and methods to publish and access provenance;
  • the scope of this language will be applicable to any resource, not just for Semantic Web objects;
  • the provenance interchange language should have a low entry point to facilitate widespread adoption, and makes it easy to do simple things;
  • it should have a small core model and allow for extensions (ie, profiles, integration of other more expressive/complementary vocabularies/frameworks);
  • the Working Group should release some deliverables early and end in 18 months.

3. Deliverables and Schedule
3.1 Deliverables

By "language", we refer to the provenance interchange language to be defined, and the inferences allowed over it. The Working Group will select an appropriate name for the language.

 

  • D1. Conceptual Model (W3C Recommendation). This document consists of a natural language description and a graphical illustration of concepts involved in the language. Such a document will help broaden the appeal and uptake of provenance beyond the community of technical experts.
  • D2. Formal Model (W3C Recommendation). The purpose of this document is to provide a normative formalization of the conceptual model, making use of Semantic Web languages beginning with RDFS and OWL.
  • D3. Formal Semantics (W3C Note, optional). This optional note consists of a mathematical definition of the language. It will focus on facets of formalization that have not been captured in the formal model.
  • D4. Accessing and Querying Provenance (W3C Note). This document specifies how provenance can be accessed or queried in embedded documents and from remote services. Specifically, it defines how to access provenance embedded in an html document using RDFa, how to access provenance from a service by means of HTTP, and how to query provenance through a SPARQL endpoint.
  • D5. Guidelines for producing XML of the model (W3C Note). This document specifies an XML serialization for the language.
  • D6. Best Practice Cookbook (W3C Note). This document includes a limited set of best practice profiles that link with other relevant models, such as Dublin Core provenance related concepts, licensing in Creative Commons, and the OpenId identity mechanism for people.
  • D7. Primer (W3C Note). This educational document provides users with an easy to understand description of the model.

 

Comments about the deliverables:

  • The conceptual model (D1) and the formal model (D2) will be developed in parallel, ensuring that concepts can be formalized adequately, and vice-versa, that the formalization is explained intuitively.
  • The proposed Working Group is committed to formalizing the provenance interchange language using RDFS and OWL, in a first instance (D2). Depending on the kinds of inferences to be supported, other Semantic Web languages may also be considered, where appropriate. A by-product of this formalization is the mapping of the provenance interchange language to RDF graphs.
  • The proposed Working Group will consider defining formal semantics for the language(D3). Its intent is to disambiguate concepts to ensure inter-operability; the Working Group will specify its exact scope.
  • A serialization to XML (D5) will help disseminate the language to communities beyond the Semantic Web community.

3.2 Milestones

Reports will undergo the W3C development process: Working Draft (WD), Working Draft in Last Call (LC), Candidate Recommendation (CR), Proposed Recommendation (PR) and Recommendation (Rec).

Specification FPWD LC CR PR Rec
D1 T+6 T+9 T+12 T+15 T+18
D2 T+6 T+9 T+12 T+15 T+18
D3 (Optional) T+12 T+18 n/a n/a n/a
D4 T+9 T+15 n/a n/a n/a
D5 T+9 T+12 n/a n/a n/a
D6 T+15 T+18 n/a n/a n/a
D7 T+12 T+18 n/a n/a n/a

4. Provenance Concepts

The proposed Working Group will leverage the activities of the W3C Provenance Incubator Group, its understanding of the state-of-the-art, extensive requirements capture, use cases and flagship scenarios, and mapping of provenance vocabularies.

Drawing on existing vocabularies/ontologies (namely: Changeset Vocabulary, Dublin Core, Open Provenance Model (OPM), PREMIS, Proof Markup Language (PML), Provenance Vocabulary, Provenir ontology, SWAN Provenance Ontology, Semantic Web Publishing Vocabulary, WOT Schema), a set of concepts have been identified to constitute the core of a standard provenance interchange language. The number of concepts is intentionally limited, so as to ensure a cohesive and tractable core. Other concepts can be relevant to provenance, but it is anticipated that they would be defined by means of the envisaged extension mechanism of the provenance interchange language.

In the following list, the names appearing as titles are used for intuition. Concepts with similar intuition in existing vocabularies are provided. Examples from one of the three flagship scenarios from the W3C Provenance Incubator Group are also shown.

  1. Resource: Note that it includes static or dynamic (mutable or immutable), the proposed Working Group can decide whether to subclass this and make a distinction.
    • opm:Artifact, pmlp:IdentifiedThing, provenir:data, "continuant" (obo:BFO_0000002), pmlp:Document, pmlp:DocumentFragment
      • Example: BlogAgg would like to know the state of an image before and after modification to see if it was modified appropriately
    • may include a user query (eg pmlp:Query)
  2. Process execution: refers to execution of a computation, workflow, program, service, etc. Does not refer to a query.
    • opm:Process, provenir:process, "process" (obo:BFO_0000007)
      • Example: Alice collects data from public sources and "natural experiment" data. Alice then processes and interprets the results and writes a report summarizing the conclusions. All these steps should be captured.
  3. Recipe link: we will not define what the recipe is, what we mean here is just a standard way to refer to a recipe (a pointer). The development of standard ways to describe these recipes is out of scope.
    • pmlp:InferenceRule, pmlp:DeclarativeRule, pmlp:MethodRule, "function" (obo:BFO_0000034)
      • Example: Alice is processing data and executes a linear regression implementation as one of the steps, the recipe could refer to a linear regression algorithm
  4. Agent: entity (human or otherwise) involved in the process execution. An agent can be the creator or contributor
    • opm:Agent, provenir:agent, prv:Actor, pmlp:Agent
      • Example: Alice starts and facilities the tool SPSS when doing data analysis.
  5. Role
    • opm:Role, "role" (obo:BFO_0000023)
      • Example: Whether a data file was used as a training or test data set when running machine learning algorithms.
  6. Location: a link to a description of location. Defining how the spatial information will be represented is out of scope, will point to an existing ontology.
    • provenir:spatial_parameter, provenir:located_in, provenir:adjacent_to
      • Example: The location where the disease was declared.
  7. Derivation
    • opm:WasDerivedFrom, opm:WasDerivedFromStar, provenir:derives_from
      • Example: The thumbnail image was derived from the panda image.
  8. Generation
    • opm:WasGeneratedBy, opm:WasGeneratedByStar,
      • Example: A thumbnail image was generated by Blog Agg using the panda image.
  9. Use
    • opm:Used, opm:UsedStar, prv:usedBy
      • Example: The panda image was used by BlogAgg to generate a thumbnail image.
      • Example: John Markoff used SPSS
  10. Ordering of Processes
    • opm:WasTriggeredBy, provenir:preceded_by, provenir: preceded_by*
      • Example: Report writing was triggered by the interpretation of results.
      • Example: Bob is a researcher of the flu epidemic starts a process to send email about the status of an (long-running) experiment process. The notification process is preceded by the experiment process.
  11. Version
    • dc:replaces, provenir:transformation_of, pmlp:SourceUsage
      • Example: When Alice releases a new report this would express that this version should be used rather than the previous one.
      • Example: Alice consults a website URI whose content changes over time, a document that has versions going through edits, etc.
  12. Participation
    • provenir:has_participant, "participates in" (obo:BFO_0000056), "has participant" (obo:BFO_0000057), prv:involvedActor
      • Example: Alice participates in reviewing a paper and approving it for publication, she is not an author but participates in the process.
  13. Control, is a subclass of participation. Related to this is a notion of "responsibility", i.e. an entity that stands behind the artifact that was produced (Alice controls the process but the organization that she worked for is responsible, so that even after she leaves the organization is still responsible). It may be a useful shortcut to add.
    • opm:WasControlledBy, prv:operatedBy
      • Example: SPSS was operated by Alice.
  14. Provenance Container
    • opm:OPMGraph, dc:provenance, pmlp:NodeSet
      • Example: Bob´s Website Factory provides proof in the form of a set of provenance statements that the contract was executed as agreed.
  15. Views or Accounts
    • opm:Account
      • Example: Bob´s Website Factory and Customers Inc both provide two different and conflicting sets of information (i.e. accounts) describing the provenance of the production of the same website.
  16. Time
    • opm:Time, opm:Used, opm:WasGeneratedBy, opm:WasDerivedFrom, opm:wasControlledBy, prv:wasPerformedAt, dc:modified, provenir:has_temporal_value, provenir:temporal_parameter, "begins to exist during" (obo:BFO_0000068), "ceases to exist during" (obo:BFO_0000069), "temporal region" (obo:BFO_0000008)
      • Example: BlogAgg wants to find the correct originator of the microblog who first put the word out.
      • Example: Alice performs quality checks on the data before analyzing it.
      • Example: The timestamp associated with a published dataset.
      • Example: The time when Alice modifies a previous report.
  17. Collections: Should be a lightweight notion, mainly focused on "part of". Might be treated as a resource ultimately.
    • prv:containedBy, provenir:contained_in, provenir:contained_in, dc:hasPart
      • Example: A layer is part of an image
      • Example: An image is contained in a news item
      • Example: A report contains a data plot
Acknowledgements:
Chris Bizer, James Cheney, Sam Coppens, Kai Eckert, Andre Freitas, Irini Fundulaki, Daniel Garijo, Yolanda Gil, Jose Manuel Gomez, Paul Groth, Olaf Hartig, Deborah McGuinness, Simon Miles, Paolo Missier, Luc Moreau, James Myers, Michael Panzer, Paulo Pinheiro da Silva, Christine Runnegar, Satya Sahoo, Yogesh Simmhan, Raphaël Troncy and Jun Zhao.
We thank all the outside participants who consulted with the group over its lifetime