Jamie's Museum Informatics Blog

Tuesday, March 18, 2014

Custom Color Palettes in Tableau

Tableau and Color Palettes

Tableau is a powerful business intelligence tool that allows for interactive visualizations of large sets of data. I have successfully used Tableau to connect to text data sets of more than 2 million rows and have been able to generate large interactive visualizations for biodiversity data sets and departmental annual reports for the UM College of LS&A. Tableau allows users to define their own color palettes using a simple xml based schema, and I have recently coded color schemes for 1) University of Michigan based branding, 2) Wes Anderson Palettes , and 3) some Bauhaus inspired color sets.

You can ignore all the following and just jump to the two links below to get started:

How to customize colors in Tableau
My Tableau preferences file from github (which includes the color palettes)

This file assumes a windows operating system, so the file has the default line endings generating by the Context Text Editor in windows.

Wolverine Maize and Blue

The university of Michigan has a design principles guide that defines brand standards and establishes a style guide for UM colors.

UM Color Palette

I defined a set of UM based color palettes around this scheme

UM Maize to Blue -- a diverging color from maize to blue for use for tagging continuous data
UM Web -- the set of UM web colors for identifiying categorical data
UM Print Neutral -- the set of neutral and muted colors for print media
UM Print Bright -- the set of bright colors for print media
UM Print All -- All primary and secondary colors for print media including metallic
UM Expanded -- A small set of expanded colors

Use for a visualization with UM web colors would look like the following:

Wes Anderson Color Palette

The excellent Wes Anderson Palletes tumblr has a fun set of Wes Anderson inspired color palettes. Some of theses have been transformed to color palettes for R (https://github.com/karthik/wesanderson) and I wanted to have access to these in Tableau .

Don't point that gun at him, he's an unpaid intern.

These color palettes all start with the Wes prefix and include all color palettes available from the tumblr on March 18th, 2014.

A 'Life Aquatic' palette visualization is:

Bauhaus

My personal aesthetic tends to be drawn to more modern and simple form-follows-function Bauhaus inspired color themes. I've added a simple Bauhaus inspired set along with a color-theory wheel. I will possibly also add a Le Corbusier color set as well.

12 color wheel from Johannes Itten

This set includes:

Bauhaus -- Simple 5 color Bauhaus influenced theme
Farbkreis -- 12 color color wheel from Johannes Itten

The five color Bauhaus palette looks like:

Friday, January 31, 2014

GULO: Museum Databases as a Collection of Globally Unique Linked Objects

Museums and Modern Data Stores for Connected Data

Over the past six months I have been doing some thinking (and a little bit of coding) about database schema and interface designs that would support hosting any information about any museum collection. One major goal of my design planning is to think about a museum database as more than just a database of transcribed label data or a management solution for knowing where objects are location. Ideally I think a museum datastore should be a rich collection of metadata that allows for linking individual objects to each other as well as to the digital resources associated with the collection.

My current position as a ~~museum informatician~~ business systems analyst at the University of Michigan supports museums in multiple disciplines including anthropology, archaeology, botany, paleontology and zoology. I therefore can not limit my thinking about museum data to biological collections. However since I have a background in developing software and data stores for genomics and phylogenetics I generally think about how the existing ontology anchored database schema like Chado and BioSQL could be extended to support museum collections and allow for data connections between physical accessions in a museum and the digital resources associated with those collections. I have began some initial work on extending the Chado database schema to support museum collections, and I outline some of the early results of this experience below.

Database Engineering Decisions

In my opinion an ideal database system for museum collections will have the following properties:

Interoperability - An open schema design and data store that can interoperate internally with existing software resources as well as be able to interoperate with external resources such as centralized metadata warehouses like iDigBio or GBIF as well as archival media repositories such as DLXS, Hathi Trust or the Digital Preservation Network.
Flexibility - Can store any meta-data about a museum collection or museum object as well as any digital media associated with that item. Ideally this will not be limited to just a label transcription, but can store phenotypic measurement data directly in the data store. Slight changes to the type of data that you want to store should not require any modifications of the underlying tables in the database.
Linkability - Data about museum objects can be effectively linked to existing or newly derived resources about that object. This allows for queries that link information about the object across multiple domains of knowledge. This allows for the all important JOIN queries. External data for museum objects include but are not limited to:

Molecular sequence data such genome scale molecular sequence data associated with an accession. These range from whole genome sequence assemblies to SNP and
Literature references to the use of the object (ie. objects used in phylogenetic studies or population natural diversity studies) and the description that object (ie. type specimens)
Digital media collected in the field or derived once the object has been accessioned
Collections objects can be linked to a named collection and individual objects can be associated with more than one collection

Stability - Stability means the underlying schema and database storage software passes the ACID test. The data store needs to be a reliable archival quality data store.

I think that that the weak-data typing paradigm that has been employed in open source genomic databases can be effectively combined with existing ideas on generating shared open and connected data for museums. I have therefore tried to combined these ideas in the database engineering decisions that I've made to create a flexible museum data storage system. The key concepts I have incorporated in a an RDBMS schema for museum collections are:

Weak-data typing - The weak data typing paradigm allows for storing any information about an object. This system is designed for archival quality storage of meta data and is not indexed for speedy retrieval of specific types of data. Think of this as a way of storing museum data in archival boxes for retrieval similar to the way we use general purpose compactors to store museum objects in a space efficient manner. We can use this system express the data in other formats such as RDF for speedy task specific queries.
Controlled Vocabularies - The use of controlled vocabularies provides the means for connected information about an object to a dictionary of what that information is. Ideally this would be a vocabulary that is based on a published ontology. Most importantly the system allows for mapping among controlled vocabularies. Thus it is possible to translate a system one dictionary of terms to another dictionary of terms. For example, we can translate information in an internal set of terms to Darwin Core terms.
GUIDs - Globally Unique Identifiers (GUIDs) are the cornerstone for building the web of connected data.
Nested set model - Objects in collections follow the nested set model object representation. This allows for tracking parent-child relationships of museum objects. For example genome information about an individual egg, collected from an individual fish specimen in jar of multiple fish can be tracked as three sets of meta-data that are connected and query-able.

The MuseSQL Schema

I chose to start by engineering a museum specific module extension to the Chado database schema. I called the SQL extension module for museum data MuseSQL (pronounced ♪musical♫) since this is a Museum SQL store. I chose to use Chado due to 1) it's wide use in genomics as part of GMOD, 2) a large library of existing interoperable modules and 3) my familiarity with loading both OBO and OWL ontologies into the Chado controlled vocabulary tables.

Information on how to store and use controlled vocabulary tables in Chado have been extensively discussed elsewhere (Mungall et al, CV module documentation) so I will just generally outline them here. The Chado CV terms allow for any type of information about an object to be stored in a stable way that can be referenced back to a definition of what that term means. Since this method includes support for the storing relationships among terms, we can also store these definitions as an ontology. The database even allows for mapping the relationship among terms from different ontologies to be stored in the same database. Thus it is possible to use the relationship table to store a cross referenced set of terms mapped onto another set of terms. For example, it is a straightforward exercise to map terms from an existing set of ad-hoc database columns to Darwin core terms, or to map Darwin Core terms to another controlled vocabularies such as schema.org.

The core concepts regarding museum collections that exist in the MuseSQL module include:

Museum objects - These are the physical object in a museum collections. This is about the stuff .. be it jars of fish, pressed plants, boxes of seeds etc.
Digital objects - These are any type of digital resources including image scans of an object, images captured in the field, audio capture in the field, 3d scans of an object etc. I don't require a dependency between a digital object and a museum physical object so that digital media captured by a museum can be held even when there is not voucher for that digital object. This could for example include recordings of birdsong captured in the field, or digital photographs taken in association .
Collection - A collection of museum objects or digital objects
Transactions - Time delineated events that concern a collection or object. This can include accessioning, de-accessioning, loans or transfer to another institution.
Institutions - Including the people and places associated with a museum or a transaction among museums

An initial implementation of this design has been developed in MySQL. An overview of the schema for the museum module tables and the controlled vocabulary tables looks like the following:

These tables are outline in more detail below, but this gives the birds-eye view of the design and shows the general links among concepts. The table names all follow Chado naming conventions in which names of tables are in all lowercase and do not have spaces or special characters. Linker tables have a '_' character naming the two tables that they link.

Controlled Vocabularies

The controlled vocabulary tables are what support the tagging of properties of objects within the collection. These tables are discussed in detail in the the Chado documentation of the CV module, and are represented here to show how the MuseSQL tables connect to the controlled vocabulary tables.

Museum Objects

The museum objects represent the physical items that a individual museum unit is responsible for curating.

The tables for museum objects include:

muso - the main museum object table. This is the table that represents an individual item in a museum. There are some important columns here that are worth pointing out:

muso_parent_id - this column allows for individual items to be associated with a parent item. For example if we want to store measurement information of the individuals from a jar of multiple fish. The individual fish can be treated as objects that are children of the jar object. The jar object would be used to record the label information, while the individual fish objects would be used to store the phenotypic measurement data.
left_index & right_index - these are the indexes that provide nested-set model support of storing the collection data such that it is possible to do a search for all child objects for a given parent object
guid - this is the GUID for the individual item. It would have been possible to provide support for GUIDs in the musprop table, but I wanted to store a single GUID within the table as a column to allow for quick queries and indexes against the GUID. This would be the internal institutional GUID. The musoprop table would support any additional GUIDs for that item. For example a GUID assigned by iDigBio could be supported in the database.
institution_id - is the institution that owns the object. This column allows for multiple museum units within a single university to share a single datastore while allowing for custom queries about just the items that an individual unit is responsible for. It also allows for a datastore to include metadata for museum objects that were borrowed from another institution.
catalog_number - supports an internal catalog number that may or may-not be a GUID. If the item has multiple additional historical catalog numbers, these can all be supported in the musoprop table. This allows for institutions to maintain historical catalog numbers that are not GUID compliant.

musoprop - properties of a museum object, for example we would have a row each for each bit of data derived from a specimen label (collector, collection date, verbatum_longitude, verbatum_latidtude etc)
museloc - a physical location within a museum. This could be a cabinet, compactor display case etc. This is where the item is physically located.
muselocprop - A property of a museum object location
musocol - A collection of museum objects
musocolprop - Properties of the collection of museum object.

Digital Objects

Digital objects are the digital assets association with a museum collection. These could really by any digital data that the museum is responsible for the curation of.

The digital object tables include:

digo - a digital object. The digital object could be things like a sound recording from the field, an image of a specimen, a 3D scan of a specimen, a DNA sequence tracefile associated with a specimen or any other digital media that may need to be curated by a museum.
This table includes a GUID for the digital object as well as a DOI for an archival location of the digital object. The use of both a GUID and a DOI may be redundant, but I can see cases where an object may need to have a GUID before a DOI can be associated with it.
digoprop - a property of the digital object. This would be any metadata associated with the digital object and could include things like EXIF data from an image.
digocol - a collection of digital objects. Individual objects can be associated with more than one collection.
digocol - a property of the collection of digital objects.
digocol_digo - a linker table that allows for indivudal digital objects to be linked into a digital collection. It is possible for individual objects to be part of one or more collections, or to not be part of any collection at all
digo_muso - a linker table that allows for digital objects to be linked to museum objects. This is what allows for museum specimens to be linked to digital information (such as image scans) about that specimen. This would allow for meta-data such as label information to remain linked to the physical object, while the digital objects would just need to record information about the digital asset. Thus multiple images of an object from multiple angles could be stored without any need for a redundancy in the meta-data about the object.

Institutions

The institution tables record the people and places associate with transactions of museum resources. I would possibly replace the institution tables below with contact module tables from Chado if I can work out a system for reliably using the contact tables in chado to derive links to the transactions table of MuseSQL. One key difference between the chado module and the MuseSQL tables is that the institution and person tables below allow for GUIDs to be associated with the people and places. For example, person GUIDS could make use of ORCIDs.

The tables include:

person - the person associated with a transaction.
personprop - a property associated with that person. this is where you would store information like address, contact email, phone number etc
institution - an institution. This would be the institution that owns objects within a collection (for example UMMZ or MICH), or another external institution that is involved in a transaction event.
institutionprop - the properties of the institution.
institution_person - allows for the association of individuals with an institution or multiple institutions
institution_person - allows for describing the nature of the association with an institution, ie curator, professor, volunteer, docent etc.

Transactions

Transactions are any time-delineated event that affects a collection of museum objects. This could be a loan, a set of accessions, a set of deaccessions, a pesticide treatment for a collection, a transfer to another internal unit or any other transaction or event that occurs to a collection of objects. Since a collection of objects can be just a single object, then this supports any size collection ranging from one to all of the items in the collection. Thus transactions can support any set of things in your collection. It is possible in this schema for an object to be part of many different collections so it is possible for the many different types of transactions to be recorded for any one item in the collection.

Transaction records include the following tables:

musocol_transaction - named currently as a linker table since it is linking a person that is responsible for a transaction to the set of objects that is involved in the transaction
musocol_transactionprop - a property of the transaction

Meeting the Engineering Goals

Looking back the the engineering goals for the database we can consider how the MuseSQL schema meets those goals:

Interoperability - Since this model is based on an open source MySQL RDBMS language there are many systems that can directly connect to the underlying data store. This includes programming languages like Java, Perl, and PHP as well as data visualization and analysis interfaces such a R or Tableau. It is therefore possible to build many types of GUIs to access and display the underlying database in the language of your choice. It is also possible to translate the MySQL code to support a PostgreSQL RDBMS or other RDMBS and thus have the MuseSQL schema available from many other RDBMS storage system. Even better the design principles of the data store allow the RDBMS to be translated to a RDF triple store which would make the data available in federated queries.
Flexibility - This system can store truly store any information about a museum resource. Modifications in the type of data that users want to store in the database only require a modification or addition to the records in the controlled vocabularies. The MuseSQL table designs do not change with changing needs in data storage, and thus development of an API and GUI against the database can rely on a stable table structure.
Linkability - The digo_muso table allows digital objects to be linked to physical objects in a museum and both the museum object set and digital object set have tables that allow for combining objects into collections. Although physical objects can be linked to digital resources, it is also possible for digital objects and physical museum objects to stand alone without any sort of dependency on a physical resource or digital record.
It is also possible to link the museum objects to the following data and analysis modules via linker tables

Chado Publication Module to store references to and about the object in the literature
Chado Sequence Module to store sequence data derived from museum specimens
Chado Natural Diversity Module to store genotyping and phenotype data associated with collections
Chado Comparative Analysis Module to store sequence comparative analysis data
Chado Mage Module to store data derived from microarray experiments
Chado Contact Module could even be used instead of the instituion tables above
Chado Phylogeny Module would allow for museum speciments to be directly connected to phylogenetics. This would allow for phylogenetically informed searches of the database.
Chado Organisms Module could hold a taxonomic reference list

Stability - MySQL complies with ACID, and the workflows for data entry and deletion can be controlled to make MuseSQL a stable archival quality database resource.

Wolverines are Gluttons

General methods for connecting to the MuseSQL data store would ideally involve the development of an API that treats the dataset as a collection of Globally Unique Linked Objects. Since I am employed as a UM wolverine, Gulo seemed like the appropriate name for the API and data collection framework. Wolverines (Gulo gulo) are gluttons that are not very finnickey about what they eat and will consume carrion fast. So what better name for a flexible museum database system that could quickly consume any information for any object. GULO also references the basic idea that objects in the collection would be assigned a globally unique identifier in a way that allows objects and meta-data to be universally linked to each other.

One important outcome from this design is that it allows for a RDBMS archival storage system that can be forward translated to RDF triple-stores. Since the system integrates GUIDs and controlled vocabularies, it is a very simple matter to translate these data into to representative triple stores that can be hosted online and used in federated queries of RDF stores.

So Where are GULO and MuseSQL?

The MuseSQL schema is currently a test framework that exists only on my laptop as MySQL tables and SQL code. The code is and backed up to our LSA/MIS SVN server but is currently not deployed anywhere in production. I have written code to consume ontologies and vocabularies such as Darwin Core and have tested the ability to tag test museum specimen data. Although the MuseSQL schema is not something that is currently in production at UM, I would hope to have the opportunity to develop aspects of MuseSQL and GULO in the near future. I just wanted to put this blog post out there to get public feedback on the schema now rather than years down the road when it would be less helpful.

My Opinions are my own and do not reflect the opinions of the University of Michigan, the UM College of Literature and Science and Arts, the Management Information Systems group, the University of Michigan Museum Units or any wolverines.

Tuesday, November 5, 2013

Illustrating the First Record of Species Occurrence with Tableau

The following is an experiment of features in Tableau, and is not a presentation of normalized compete data for the Flora of Michigan.

The following is a test of using the Tableau program to explore a collection of Herbarium resources at the University of Michigan. These data have not been normalized and do contain some inaccuracies. The species indicated as invasive are based on a query that does contain some problems. For an accurate and verified Flora of Michigan please see the Flora of Michigan web site which includes an accurate vetted assessment of alien status in Michigan. The new Field Manual of Michigan Flora is also a must-have for anyone interested in the Flora of Michigan.

The Tableau software allows us to connect to a live database such as MySQL or MS Access, and create informative visualizations. These visualizations can be exported as images, pdfs, and even presented on the web as interactive worksheets. In the example below, I connected to a summary table in an MS Access database that was converted to a Tableau data extract for optimized queries. These data extracts can be updated from the live data on a regular basis either manually or in an automated fashion. This allows for fully optimized read-only access to the data, while allowing for regular updates to reflect the current state of knowledge for a collection.

Overview of a Michigan Herbarium Database

Using Tableau it is possible to capture a tremendous about of information about a museum collection in a single view. For example, taking a database of herbarium specimens as input, the image below is an overview of the collections of native and invasive species collected in Michigan. This shows the county level distribution of the collection and provides an the indication of when the taxon was first collected in the state.

This view summarizes nearly a quarter of a million specimens, and has multiple parts summarizing this data.

Map and Summary

The upper right hand area of the view shows a map of the distribution of accessions and a summary for that view.

The upper right hand panel shows a summary of the number of distinct taxa names, the number of distinct collector names, and the total number of accessions in the collection. The current view shows 8,432 distinct taxon names in the collection occur in the state of Michigan. This number includes some misspellings as well as many subspecies, varieties and named hybrid species. There are over 6,000 collector names that have accession in the collection, but this is currently an overestimate since it is doing a match on the full name of the collector. Misspellings and accessions with multiple collectors are inflating this number.

The size of the dots over each county represent the number of accessions for that county, with putative invasive species shown in orange and all other species indicated in blue.

Collection Years

The bottom panels of the view explores the temporal pattern of the collection.

The bar chart at the top represents the number of state records that are recorded for that year. These are the first known vouchered occurrences for those species in the state of Michigan. Other spottings may have been recorded, but these records are physically represented in the collection. Similar to the distribution map above, this chart shows putative invasive species in orange and all other species in blue.

The bar chart at the bottom represents the total number of accessions collected in that year. Comparing the top chart to the bottom illustrates that many of the important early state records were some of the only specimens collected in that year. We can also clearly see the effect of World War II on collecting in Michigan.

List of Taxa

The top middle panel gives the list of distinct taxon names.

This is a scrollable list that gives the number of accessions for that taxon in the state. Putative invasive species are shown in orange and all other species are indicated in blue.

"Acer ruburm an invasive!" you say .. yes this is native in Michigan. See ... I told you this is an illustration of a process that can be facilitated by Tableau, not publishable hard science. One of the things that Tableau facilitates is finding errors and outliers in your data set that can be fixed by data normalization before taking your findings to print.

Who did all the Work

The top left panel is a bubble plot of the scientists that have contributed to the collection.

The larger the dot the more vouchered specimens the individual biologist have contributed to the collection. In the interactive view, hovering over an individual dot reveals the name of the scientists and the number of accessions in the collection.

Filter Records by Subcollection

The real power in the Tableau interface is the interaction among the panels. For example, we can select what components of the collection we want an overview for by selecting the sub-collections in the panel on the right.

Bryophytes (Mosses)

Vascular Plants

It is possible to show any combinations of the subcollections. For example the Monocots, Dicots, Gymnosperms and Pteridophytes.

Fungi

The following shows only the fungi records for Michigan in the database.

Filter Record by Location

All collections from Washtenaw County

It is also possible to use a view in one panel to filter the data displayed on other panels. For example, selecting Washtenaw county the location of the University of Michigan will show the collection statistics for that county. The plot of the year the taxon was first collected now represents vouchered county records for Washtenaw county. The taxa list is the list of all species known to have been collected in Washtenaw county.

Washtenaw County Native Species

Selecting on the blue portion of the pie in Washtenaw county will only display data for putative native species in Washtenaw county.

Washtenaw County Putative Invasive Species

Similarly selecting only the orange part of the pie will show data for putative invasive species in Washtenaw county. Again the year of first occurrence is the first year that putative invasive species was collected from Washtenaw county.

All Records from the Michigan Upper Peninsula

We can spatially select multiple counties as well. For example, we can limit the data display to all records in the Upper Peninsula of Michigan. The year first collected bar chart will show the year the species was first collected in the UP, and the taxa list is the list of all species collected in the UP.

Filter Records by Taxonomy

Fagaceae

I like the Oak family so we can limit a display to the Fagaceae.

Quercus

Limiting to all species of Quercus.

Quercus bicolor

We can even show just a single species, Quercus bicolor while ignoring the named hybrids.

Contributions of Single Individuals

Ed Voss

For example we can show the collections from Ed Voss in Michigan. The county map represents his individual contributions to the flora of Michigan, and the Year First Collected panel represents the first time he personally collected that species in the state of Michigan.

Fungus Collections from A.H. Smith

We can combine a filter on sub-collection and collector to visualize the fungus collections from Alexander H. Smith. This illustrates the diversity of his collecting, and shows that much of his work was conducted in the vicinity of the University of Michigan campus in Ann Arbor, and the University of Michigan Biological Field Station at the northern tip of the Upper Peninsula.

Why I like Tableau

As you can see, Tableau provides an exciting way to view natural history collections that adds a level of interactivity to a traditional distribution dot map. These views can help highlight needs in existing collections, and can provide an interactive way to highlight any accuracy issues in the underlying data.

Tableau provides a view of the biodiversity of a region that is anchored in vouchered specimens, and will hopefully serve as a venue for gaining new biological insights from a collection. Once data are available in an interactive form on the web, users will have access to on-demand lists and reports that will add value to any museum collection database.

Wednesday, October 30, 2013

Using Tableau to Explore the Flora of Michigan

One of the first things I worked at UM on was using Tableau to dynamically explore the Flora of Michigan. The example shown below uses a connection to an MS Access database that has been complied by folks at the UM Herbarium. The database shown here represents transcribed records of accessions in the UM Herbarium that have contributed to our knowledge of the Flora of Michigan.

This video is a few months old now but I just wanted to see if I can embed video on the Google Blog platform. I have since transferred these interface to other views that would allow for direct access to the database to workers in the field. For example you could now use your mobile phone to report where you are, a javascript will convert this data to county locality and then show you the accessions that are in the flora for that county using a Tableau interface. That way while in the field you can know that yes, I should collect that stupid common oak because it would actually represent a county record for that species.

The university has recently purchased a site license to Tableau, and will be rolling out a public interface to Tableau early next year. Even more exiting is that I am currently a Beta tester for the v8.1 version of Tableau that allows for direct interaction with the R statistical programming language. This will allow for direct interaction with biodiversity analysis tools like Vegan to be directly available for analysis against live biodiversity databases.

Getting Started

Hopefully this blog will serve as a place for me to post some notes regarding my work in museum informatics. Since July of 2013 I have been employed as Senior Business Systems Analyst working in the Management Information Systems Group in the College of Literature,Science and the Arts at the University of Michigan. The goal of my position is to serve as a museum informatician working with the various museums within LS&A. The specific museums I will be engaged with over the next few years include the Kelsey Museum of Archaeology, the Herbarium, Museum of Anthropology, Museum of Paleontology and the Museum of Zoology.

My current day-to-day activities involve working with the library to add tools to the Digital Library Extension Service (DLXS) that is used to host archives of digital media derived from museum objects. Recently I have been modifying the DLXS source code to be able to support microdata tags from schema.org to faciliate web indexing of digital media by web search tools such as Google, Yahoo and Bing. The DLXS system is currently used to host over 250 collections of digital media hosted by the UM Digital Library Production Services including many digital media archives from LS&A Museums. I am also working on developing new visualization tools for museum collections using tools such as Tableau to facilitate interacting with the information stored in current museum databases, and I am working to help transition existing collections to the KE EMu database system.

Most of my experiences to date have been in plant genomics, and I look forward to bringing genome scale views of databases and information to the needs in museum informatics at UM.