View more stories by categories: DataBits

This article part of DataBits, stories about data management, techniques, and tools. DataBits is curated by the LTER Information Managers. For more information and to contribute a DataBits article, reach out to the Network Office or Marina Frantz, current editor of DataBits.

John Porter, Virginia Coast Reserve LTER

Imagine academic literature without footnotes. Clarifying information typically found in footnotes would need to be included directly in the text. Alternatively, look at Wikipedia.org. There, hyperlinks abound, making it quick and easy to access ancillary information. Following links is at the option of the viewer, so if a reader doesn’t need the clarifications provided by the links or footnotes they can forge on, but the links are there if they need them. 

Like the scientific literature, metadata for LTER datasets can also benefit from allowing footnotes in the form of “annotations.” The Ecological Metadata Language (EML) used by LTER, the Environmental Data Initiative (EDI), and many other organizations is rich internally, including places for presenting a wide variety of information regarding datasets. Historically, it has been primarily inward-focused, meaning that the expectation is that an EML metadata document will contain most of the information within itself. However, recent releases of the EML standard have included an exciting new capability: annotation. 

Annotations enable the metadata to include details, in a computer-interpretable form, that can unambiguously identify something or capture relationships between something in the metadata itself and external resources that provide more details on the entity being annotated. For example, in different EML metadata documents the same unit of measure can be described in different ways. “Milligrams per liter,” “milligram/liter,” “mg/l”, “milligrams/liter (mg/l)”, “mg per liter” and “milligrams per liter (mg/l)” are all units descriptions that have been used to describe the number of milligrams of a substance contained in a liter of liquid. For experts in the field, deciphering this variety of descriptions poses few challenges. But for researchers outside the ecological sciences “mg” might mean “milligravities” to a space researcher, “magnesium” to a chemist, and if capitalized, it might mean the disease “myasthenia gravis” to a medical doctor, a brand of car to an auto enthusiast, a “media gateway” to a computer networking specialist, or a “minimum guarantee” to a business expert. 

Ideally metadata should be providing an unambiguous representation of what is meant by a unit description. One option would be to constrain metadata creators to a specific set of unit descriptions (e.g., everyone use “mg.L-1”), but developing such a list for all the possible units an ecological researcher might use is very challenging, and getting everyone to use that list is even more difficult. Moreover, some researchers have strong preferences for particular representations. So if having a “master list” somewhere isn’t an ideal solution what is?

A solution for this problem might be a footnote. If a footnote was attached to a unit description that linked to a description of the unit, perhaps including formal relations to SI units, multipliers for unit conversions and alternative representations; many potential problems would be avoided. In the world of ecological metadata, an <annotation> can take the place of that footnote in an efficient way that makes it machine interpretable and extremely flexible. 

<annotation> elements in an EML document take on a very odd form (at least for humans). An example is: 

<annotation>
  <propertyURI label="has unit">
     http://qudt.org/schema/qudt/hasUnit
 </propertyURI>
  <valueURI label="Milligram Per Liter">
     http://qudt.org/vocab/unit/MilliGM-PER-L
  </valueURI>
</annotation>

Why such an odd way of doing it? What is happening here? First, remember that machine readability is important, so the <annotation> uses a widely-accepted computer framework known as RDF (the Resource Description Framework) that is both extremely flexible and powerful, even if RDF is not very human-friendly. RDF uses web-address-like “Universal Resource Identifiers” (URIs) to link to concepts, rather than some other sort of identifier. To make it general there are two parts. Starting with the second, the “valueURI” points to a well-established ontology (more on ontologies later) that contains a formal definition for “Milligram Per Liter.” Figure 1 shows the web page describing the unit which eliminates any confusion about which unit is actually being used. 

So what is the first element in the annotation the “propertyURI”? It is needed to allow annotations to be general in nature – not tied to a specific function. Here it links to a resource that describes what “has unit” means. Such flexibility allows us to define any sort of relationship such as “also known as” for people whose name has changed over time, or “substance measured in the numerator” if we wanted to specify milligrams of nitrogen. 

RDF data is arranged in “triplets” each with a subject, a predicate and an object. Here in our example annotation the valueURI is the object and the propertyURI is our predicate, but what is our subject? Not shown was where the <annotation> was placed in the EML document. It was located inside an <attribute> element, so that the annotation implicitly applies to the <attribute> of which it was a part as the subject of the triplet. 

The utility of annotations is greatly augmented by the use of ontologies. Ontologies are collections of concepts connected by relationships. Our example above used QUDT.org (Quantity, Unit, Dimension, Type) which focuses on units. Others focused more broadly, such as The Environment Ontology (ENVO) have a wider array of concepts and relationships (some drawn from the LTER Keyword Thesaurus). For example “Forest Biome” can be placed in a variety of contexts and relationships (Figure 2) showing that broadleaf forest biomes are a subset of forest biomes and include temperate forest types including three types of temperate broadleaf forest. Figure 2 emphasizes parent-child relationships (broadleaf forest biome is a child of forest biome), but other types of relationships can also be used as predicates. 

Linking annotations to ontologies provides access to information that, importantly, we don’t have to provide. Like dictionaries, we get to use ontologies, but ideally we don’t have to create them ourselves! Aspects of creating annotations may be automated, so that researchers can more easily create rich metadata. For example, several LTER sites are annotating units using a web service that matches many ad hoc unit descriptions with the QUDT ontology (https://vocab.lternet.edu/unitsws.html). 

A growing number of LTER metadata documents are starting to use <annotation> elements (especially for units). However, they are buried in the raw EML. Making them more accessible to users will require enhancing metadata display and search interfaces. For search interfaces, having unambiguous unit annotations can facilitate identification of similar datasets based on the units they share. Without annotations, only datasets that shared a particular representation of a unit (e.g., mg/l) could be linked. With annotations, all datasets sharing the same valueURI can be quickly identified. Once located, metadata displays that add annotation-based hyperlinks can allow users to quickly get the details on any units about which they are uncertain. Finally, for automated processing, unit conversions can be facilitated using information drawn from an ontology. Those benefits come just from annotation of units, but similar benefits may accrue from annotating locations, people, methods, ecosystems and organisms. We are only at the dawn of using annotations, but they hold great promise for enhancing the utility of ecological data.