Adventures in metadata ...

Metadata are good. Everyone should create metadata. But writing raw XML is bad. rOpenSci has created the EML package to make this easier.

In theory. My first look at EML looked like it would be intimidating to a casual user of R. But the proof is in the pudding, and I thought I should give it a test drive with some actual data. I’ve put some data generated in my Ecology Lab classes up on figshare, so this is an ideal opportunity to give it a test run.

devtools::install_github("ropensci/EML")

I’ve put the data on figshare, so

library(tidyverse)
## Warning: package 'tibble' was built under R version 3.5.2
aquariums <- read_csv("https://ndownloader.figshare.com/files/9932452")
## Warning: Missing column names filled in: 'X14' [14]
#write out the data file for later 
write_csv(aquariums, "data/aquarium_experiment_2017.csv")

OK. Now I want to make a metadata record for this data.

library(EML)

The first problem I’m running into is the absence of a vignette to tell me what to do. Even which function to start with! I thought there was one on GitHub …

Ah, there are vignettes, but they don’t appear in the documentation. I can see the source on GitHub though, so I’m following along the creating EML vignette. It looks a bit intimidating because they are building it all up manually. There are 3 levels to this thing: The EML file which contains datatable elements, which have a variety of other specific elements. There are things besides the datatables in the EML, but that’s the most important bit. We start from the bottom up. First we need an dataframe describing the attributes of our datatable. There will be one row for each variable in the dataframe. Some of the data depend on the type of the variable – a definition for strings, format for dates etc.

aq_attrib <- data.frame(
  attributeName = names(aquariums),
  attributeDefinition = c(
    "Year",
    "Month",
    "Day of month",
    "Tank ID",
    "Time 24 hour",
    "Team ID",
    "pH",
    "Temperature, degrees celsius",
    "Ammonium concentration, mg/L",
    "Data quality Flag, Ammonium",
    "Nitrite concentration, mg/L",
    "Data quality Flag, Nitrite",
    "Nitrate concentration, mg/L",
    "Data quality Flag, Nitrate",
    "Total Phosphorous concentration, mg/L",
    "Data quality Flag, Phosphorous",
    "Comments field"
  ),
  formatString = c(
    "YYYY",
    "MM",
    "DD",
    NA,
    "HH:MM",
    NA,
    NA,
    NA,
    NA,
    NA,
    NA,
    NA,
    NA,
    NA,
    NA,
    NA,
    NA
  ),
  definition = c(
    NA,
    NA,
    NA,
    NA,
    NA,
    "Team ID",
    NA,
    NA,
    NA,
    NA,
    NA,
    NA,
    NA,
    NA,
    NA,
    NA,
    "Comments field"
  ),
  unit = c(
    NA,
    NA,
    NA,
    "number",
    NA,
    NA,
    "dimensionless",
    "celsius",
    "milligramsPerLiter",
    NA,
    "milligramsPerLiter",
    NA,
    "milligramsPerLiter",
    NA,
    "milligramsPerLiter",
    NA,
    NA
  ),
  numberType = c(
    NA,
    NA,
    NA,
    "integer",
    NA,
    NA,
    "real",
    "real",
    "real",
    NA,
    "real",
    NA,
    "real",
    NA,
    "real",
    NA,
    NA
  ),
  stringsAsFactors = FALSE  
)

One thing that wasn’t immediately obvious was how to specify the units. But there’s a vignette for that! get_unitlist() gave me a list. Big. Mostly they were obvious, but pH gave me pause. After digging through definitions on wikipedia I believe it is dimensionless, but I could be wrong.

By this point I was wishing I’d selected a smaller example! But onwards. The last type of variable in the data is a factor, or enumerated string. My data quality flags fall into that category, so I need to make a table describing all the levels of those. One row per level. This is a good spot for row-wise dataframe creation.

aq_factors <- tribble(
  ~attributeName, ~code, ~definition,
  "Flag_NH4", "Bad", "Not good data",
  "Flag_NH4", "Good", "Good data",
  "Flag_NH4", "Out-of-range", "Exceeded maximum measurement on instrument",
  "Flag_NO2", "Bad", "Not good data",
  "Flag_NO2", "Good", "Good data",
  "Flag_NO2", "Out-of-range", "Exceeded maximum measurement on instrument",
  "Flag_NO3", "Bad", "Not good data",
  "Flag_NO3", "Good", "Good data",
  "Flag_NO3", "Out-of-range", "Exceeded maximum measurement on instrument",
  "Flag_P", "Bad", "Not good data",
  "Flag_P", "Good", "Good data",
  "Flag_P", "Out-of-range", "Exceeded maximum measurement on instrument"
  )

So that’s described the data, now we create an object that EML knows about. I thought I could be clever about the classes, but I ran into a snag – Time was brought in as class hms which inherits from difftime. I coerced it to a character vector and then my cleverness worked, but it turns out the set of values for col_class is not the same as the classes of the objects. So no way to avoid typing them out.

aq_attribList <- set_attributes(aq_attrib, aq_factors, 
                                col_classes = c("Date",
                                                "Date",
                                                "Date",
                                                "numeric",
                                                "Date",
                                                "character",
                                                "numeric",
                                                "numeric",
                                                "numeric",
                                                "factor",
                                                "numeric",
                                                "factor",
                                                "numeric",
                                                "factor",
                                                "numeric",
                                                "factor",
                                                "character"))

The next step is to describe the physical structure of the data. This was easy because the defaults for the EML function are for a csv file.

aq_physical <- set_physical("data/aquarium_experiment_2017.csv",
                            url = "https://ndownloader.figshare.com/files/9932452")
## Automatically calculated file size using file.size("data/aquarium_experiment_2017.csv")
## Automatically calculated authentication size using digest::digest("data/aquarium_experiment_2017.csv", algo = "md5", file = TRUE)

Now we put the whole thing together into a dataTable object.

aq_dataTable <- eml$dataTable(entityName = "aquarium_experiment_2017.csv",
                    entityDescription = "Nutrient cycling in new aquaria",
                    physical = aq_physical,
                    attributeList = aq_attribList)

Now there are a few more things to do, but I’m wondering if this is already sufficient to create an EML file. I need to create a “dataset”, and then put that into an eml object.

creator <- eml$creator(
  individualName = list(givenName = "Andrew", surName = "Tyre"),
  electronicMailAddress = "atyre2@unl.edu")

data_set <- eml$dataset(
  title = "Nutrient cycling experiment in NRES 222",
  creator = creator,
  pubDate = "2017",
  intellectualRights = "CC0",
  distribution = eml$distribution(),
  coverage = eml$coverage(), 
  purpose = eml$purpose(),
  maintenance = eml$maintenance(),
  contact = creator,
  dataTable = aq_dataTable)

eml_222 <- list(
  packageId = "b0efd6ed-3b61-40f7-b29e-318dad7eb095",
  system = "uuid",
  dataset = data_set)

eml_validate(eml_222)
## [1] FALSE
## attr(,"errors")
## [1] "Element 'enumeratedDomain': Missing child element(s). Expected is one of ( codeDefinition, externalCodeSet, entityCodeList )."

OK, so not going to get off that easy! Validation is looking for lots of fields that I left out in my rush to get some kind of usable starting point. Some of them aren’t described in the vignette, so I was hoping I could skip those. So I’ve created some EML but it isn’t valid.

EML is a great idea; I love that I could create metadata programmatically for a dataset. This will make updating really easy. The package has progressed enormously from the first time I tried this, although it still has a ways to go. The documentation in particular seems neglected. In an ideal world I’d fork the repository and start pounding out documentation. Unfortunately the world isn’t ideal, and I have to shelve this for now.

Avatar
Andrew Tyre
Professor of Wildlife Ecology
comments powered by Disqus