Bioschemas Workshop

Bioschemas Workshop
Niall Beard
Bioinformatics Education Summit
13th May 2019

Expected Learning Outcomes
• Understand what schema.org is and how it can be applied to a
project
• Understand what Bioschemas is, how it differs from
schema.org, and what vocabularies are available
• Know the benefits and limitations to using schema.org
• Gain an understanding of how to apply (bio/)schema.org to
your site.

Workshop style
• Please do interrupt me if:
– You have any questions
– If you have difficulty reading the slides
– If I’m not speaking clearly enough
– Or if I am going to fast/slow

Search Engines
User InformationConnect

Search Engines
Query text
Demographic
Location
Device Type
Document content
Web traffic
Link count
Freshness
----
21 ‘signals’

Search Engines
Query text
Demographic
Location
Device Type
Document content
Web traffic
Link count
Freshness
----
21 ‘signals’
Algorithms to guess
matches
????????
Text Matching
Named Entity Recog
TF-IDF
NLP

Take out some of the guesswork…
• Search engines need to predict what a page is
about…
• What if instead, search engines allow the
information providers to explicitly define their
pages contents
• Rather than relying on algorithmic guesswork!

Slide courtesy of Alasdair Gray

Schema.org
• A lightweight way of structuring data online
• Created by a consortium of search engines to improve
experience and search efficacy
•Thousands of different vocabularies to describe information
online

Metadata model
ie. Recipe type

<div itemscope itemtype="http://schema.org/Recipe">
<div itemprop="nutrition” itemscope
itemtype="http://schema.org/NutritionInformation">
Nutrition facts:
<span itemprop="calories">144 kcal</span>,
</div>
Ingredients:
- <span itemprop="recipeIngredient">800g small new potato</span>
- <span itemprop="recipeIngredient">3 shallot</span>
. . .

Readable by search engines
Content Content Content
Schema.org Schema.org Schema.org

A training event – marked up in schema.org
– as shown by Google
https://search.google.com/structured-data/testing-tool

https://toolbox.google
m/datasetsearch

Search engines favour websites
containing schema.org in their search
results

Readable by Registries
Resource Resource Resource
Schema.org Schema.org Schema.org

Schema.org is community made
• Schema.org is made up of decentralized
extensions from different industries

• Extensions that see good usage get ‘folded-in’
to the core schema.org vocabularies

• To take advantage of schema.org for
Bioinformatics, we need to make our own
community
Bioinformatics
/ Life science
Community

Bioschemas
See; “The FAIR Guiding Principles for scientific data management and stewardship”,
Mark D Wilkinson et al, 2016

• … Bioschemas is a community to propose Life
science specifications to schema.org
Bioinformatics
/ Life science
Community

Bioschemas
• Bioschemas is a community project which;
– Creates Types for Life science resources
• Proteins, Samples, Beacons, Tools, Training, etc
– Create Profiles to Refine & Enhance Types
• Marginality
• Cardinality
• Controlled Vocabularies
– Creates tools to make bioschemas easier to
create, validate, and extract

Types
• Types = New vocabularies to propose to schema.org
– Some are Biological Types
– Some are Generic Types that are
useful to Life scientists
– These new types will be hosted at
bio.schema.org
– Currently at:
http://bio.sdo-bioschemas-227516.appspot.com

http://bio.sdo-bioschemas-227516.appspot.com/BioChemEntity

Profiles
• Profiles = Refinement & Interoperability Layer
- Because every industry and domain shares
in these specifications…
- Every domain includes its own properties
- So we inherit lots of properties we don’t
care about
Schema.org is messy!

Profiles - Tidying up Schema.org
• For example;
– Dataset inherits from schema.org/CreativeWork
– CreativeWork (and therefore Dataset) contains
properties for:
• Character
• IsFamilyFriendly
• Material (e.g. leather, wool, cotton, paper)
• Genre
• Bioschemas offers an indication of how relevant /
recommended each property is, by grouping into
• Minimum | Recommended | Optional

Profiles
• Profiles = Refinement & Interoperability Layer
- schema.orgs generality means it does not
recommend which ontologies to annotate
with
- Lack of restrictions on cardinality make it
difficult to parse the data (if you’re not a
huge search engine)
Schema.org is not great for interoperability!

Profiles - Improving interoperability
• Bioschemas profiles include cardinality
restrictions and controlled vocabularies
tailored to our use-cases

Profile Development process
• Determining the schema is a process of
empirical surveying and expert opinion.
• We do a Cross-walk to find what fields are
missing and use this to gauge marginality

Should it be
Minimum /
Optional /
Recommended
Should there
be one or
many of them?
Should values
be restricted
to a controlled
vocab?
If we already
have it:
Do we want to
keep it?
Agree on answers
for each of these
questions
Go through each
attribute (row) of
the schema
If we don’t
have it:
Do we want to
include it?
Column G
Column G Column H Column I
Is the
description
provided okay?
Do we want to
rewrite it?
Column F

• Discussions through our public mailing list

We use Github to request new properties,
identify and manage bug fixing, and publicly
present our decision making

Case Study
TeSS:
Training materials,
Events, and Courses
Part 4

ELIXIR All Hands 2018, June 2018, Berlin, Germany

The ELIXIR Training Portal - TeSS
https://tess.elixir-europe.org

TeSS
• A training portal that indexes metadata from across the
web.
•Presents a wide selection of openly available training
resources across the bioinformatics discipline.
•Displays these in a navigable – easy-to-find manner; in a
feature rich environment.

View upcoming events of interest
https://tess.elixir-europe.org/events

Find training materials from around the Web
https://tess.elixir-europe.org/materials

TeSS Features
Search and
Filter
Institutional Login Events
• 270+ Upcoming events
• 800+Training materials
• Filter with 10+
different facets
Login with ELIXIRAAI using
your institutional or Google
credentials with 1-click sign-
on, to:
• Favourite resources
• Add new events &
materials
• Create new training
workflows
Stay informed about
upcoming events of
interest
• E-mail subscription
• Import into
calendar
applications

TeSS Features
Link with other
registries
Ontological
Classification
Events map
• Training events and
materials can be linked
with resources from
other registries.
BioportalAnnotatorWeb
service predicts topics of
resources added toTeSS.
These can be
approved/rejected easily by
our curation group
View filtered events
plotted on a map to
find the most
accessible & relevant
events
Tools & Data services
from bio.tools
Databases, standard,
& policies
from fairsharing.org

Content sourcing
• Rely on community to register resources?
• Community needs to be moderated (to avoid spammers)
• Hard to get critical mass of community involvement
• Rely on curators to enter content?
• Curators need to be paid / incentivized
• Data entry is boring
• A drop in curation/moderation attention can lead to inaccurate,
malevolent, or insufficient content
• Instead develop a solution that
• Takes metadata directly from sources
• Adds any resources to TeSS as they appear
• Updates any resources that have changed

How TeSS works
Front End
Automated
Aggregator
Custom Scraper
Custom Scraper
Custom Scraper
Extract metadata
from training
material and
events pages
Back End
Metadata
Catalogue
Events
Materials
Workflows
Finds relevant resources
Training
Workflows
Search
Interface
Workflow
Viewer
Online Training Resources
User enters form
data

•There are several techniques we can use to extract metadata
from content provider websites. This depends on what’s on
the site.
•Interface with an API
• Handy but rare, difficult for websites to implement
Content aggregators must write bespoke API Client for each
• Structured data already embedded in page (RSS, ICS)
• Limited amount of data
•HTML Scraping
• Fragile technique that can break when there are changes to the
website.
Automatic extraction techniques

Trade-off between ease of adopting
and usefulness to aggregators
Ease to
implement
on a website
Usefulness to aggregator

Content Provider extraction technique
statistics
Events Materials Total
Schema.org /
Bioschemas
9 6 15
HTML 3 5 8
XML/JSON/YAML/CS
V
4 3 7
iCal 5 -- 5
JSON API -- 2 2
RSS 1 -- 1
Total 38

Content aggregation via Bioschemas
Front End
Automated
Aggregator
Schema.orgScrape
Custom Scraper
Custom Scraper
Extract metadata
from training
material and
events pages
Back End
Metadata
Catalogue
Events
Materials
Workflows
Finds relevant resources
Training
Workflows
Search
Interface
Workflow
Viewer
Online Training Resources

Tools and
Techniques for
Implementation
Part 4

Technique for adding Bioschemas to a
website
• 1. Identify an
appropriate schema(s)
for your content type
• 1.a If it doesn’t exist,
e-mail the mailing list
(W3C, or add to
Github Issue tracker)
Issue tracker
https://github.com/BioSch
emas/specifications
Mailing List
https://www.w3.org/co
mmunity/bioschemas/

website
• 2. Draw a table and
write down your
metadata fields on the
left hand side and the
schema.org properties
on the right.
• Map the ones that
correlate

website
• 3a. Use the Bioschemas
generator to create a
JSON-LD snippet that
you can (hopefully)
copy and paste into
your site. (This would
mean creating one for
every new schema.org
record you want to add)
http://www.macs.hw.ac.uk/SWeL/BioschemasGenerator/

website
• 3b. If you can modify
your site, paste in the
JSON-LD template of
the schema (from 3a),
and render the
metadata variables as
values to the keys
Mapping

website
• 3c. If your site is using
a CMS such as
Wordpress or Drupal,
explore whether there
is an appropriate
schema.org plugins
you can use (or ask on
the bioschemas
mailing list)

Tutorials
• Bioschemas Training Portal
– There is a step-by-step tutorial on there
for adding schema.org to jekyll pages /
github page sites.
– Hopefully there will be more to come
https://bioschemas.gitbook.io/training-portal

Tools
• Bioschemas Generator
– Form-based tool to generate valid Bioschemas
JSON-LD
– http://www.macs.hw.ac.uk/SWeL/BioschemasGener
ator/
• Validata [under construction]
– Web application for validating bioschemas markup
https://bioschemas.org/software/

Tools
• GoCrawlt
– JSON-LD schema.org extractor
• Buzzbang [on hold]
– Search engine that crawls the web for Bioschemas
JSON-LD
https://bioschemas.org/software/

Freebies from Schema.org
• Google Search Console
– Shows you what schema.org data Google is picking
up from your site, any errors, and advice on how to
fix them
– https://search.google.com/search-console

Freebies from Schema.org
• Google Structured Data Testing Tool
– Extracts the schema.org from a given web-page or
from a code-snippet, validates it, and shows you
what errors there are
– https://search.google.com/structured-data/testing-
tool

Freebies from Schema.org ecosystem
• 3rd party plug-ins
– Lots available to help
add schema.org to your
framework

Bioschemas Workshop

Related slideshows

More Related Content

Bioschemas Workshop

Editor's Notes