SlideShare a Scribd company logo
Bioschemas Workshop
Niall Beard
Bioinformatics Education Summit
13th May 2019
Preliminary Agenda
Expected Learning Outcomes
• Understand what schema.org is and how it can be applied to a
project
• Understand what Bioschemas is, how it differs from
schema.org, and what vocabularies are available
• Know the benefits and limitations to using schema.org
• Gain an understanding of how to apply (bio/)schema.org to
your site.
Workshop style
• Please do interrupt me if:
– You have any questions
– If you have difficulty reading the slides
– If I’m not speaking clearly enough
– Or if I am going to fast/slow
What is…
Search Engines
User InformationConnect
Search Engines
User InformationConnect
Query text
Demographic
Location
Device Type
Document content
Web traffic
Link count
Freshness
----
21 ‘signals’
Search Engines
User InformationConnect
Query text
Demographic
Location
Device Type
Document content
Web traffic
Link count
Freshness
----
21 ‘signals’
Algorithms to guess
matches
????????
Text Matching
Named Entity Recog
TF-IDF
NLP
Take out some of the guesswork…
• Search engines need to predict what a page is
about…
• What if instead, search engines allow the
information providers to explicitly define their
pages contents
• Rather than relying on algorithmic guesswork!
Slide courtesy of Alasdair Gray
Schema.org
• A lightweight way of structuring data online
• Created by a consortium of search engines to improve
experience and search efficacy
•Thousands of different vocabularies to describe information
online
Metadata model
ie. Recipe type
Bioschemas Workshop
<div itemscope itemtype="http://schema.org/Recipe">
<div itemprop="nutrition” itemscope
itemtype="http://schema.org/NutritionInformation">
Nutrition facts:
<span itemprop="calories">144 kcal</span>,
</div>
Ingredients:
- <span itemprop="recipeIngredient">800g small new potato</span>
- <span itemprop="recipeIngredient">3 shallot</span>
. . .
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": ”Recipe",
"name": ”Potato Salad",
“NutritionInformation”: {
"calories”: “144 kcal”,
"recipeIngredient”: “800g small new potato”,
"recipeIngredient”: “3 shallot”
. . .
Readable by search engines
Content Content Content
Schema.org Schema.org Schema.org
Bioschemas Workshop
Bioschemas Workshop
A training event – marked up in schema.org
– as shown by Google
https://search.google.com/structured-data/testing-tool
https://toolbox.google
m/datasetsearch
Search engines favour websites
containing schema.org in their search
results
Readable by Registries
Resource Resource Resource
Schema.org Schema.org Schema.org
Schema.org is community made
• Schema.org is made up of decentralized
extensions from different industries
Schema.org is community made
• Extensions that see good usage get ‘folded-in’
to the core schema.org vocabularies
Schema.org is community made
• To take advantage of schema.org for
Bioinformatics, we need to make our own
community
Bioinformatics
/ Life science
Community
Part 2
Bioschemas
See; “The FAIR Guiding Principles for scientific data management and stewardship”,
Mark D Wilkinson et al, 2016
Schema.org is community made
• … Bioschemas is a community to propose Life
science specifications to schema.org
Bioinformatics
/ Life science
Community
Bioschemas
• Bioschemas is a community project which;
– Creates Types for Life science resources
• Proteins, Samples, Beacons, Tools, Training, etc
– Create Profiles to Refine & Enhance Types
• Marginality
• Cardinality
• Controlled Vocabularies
– Creates tools to make bioschemas easier to
create, validate, and extract
Types
• Types = New vocabularies to propose to schema.org
– Some are Biological Types
– Some are Generic Types that are
useful to Life scientists
– These new types will be hosted at
bio.schema.org
– Currently at:
http://bio.sdo-bioschemas-227516.appspot.com
Biological Types
Bioschemas Workshop
http://bio.sdo-bioschemas-227516.appspot.com/BioChemEntity
Profiles
• Profiles = Refinement & Interoperability Layer
- Because every industry and domain shares
in these specifications…
- Every domain includes its own properties
- So we inherit lots of properties we don’t
care about
Schema.org is messy!
Profiles - Tidying up Schema.org
• For example;
– Dataset inherits from schema.org/CreativeWork
– CreativeWork (and therefore Dataset) contains
properties for:
• Character
• IsFamilyFriendly
• Material (e.g. leather, wool, cotton, paper)
• Genre
• Bioschemas offers an indication of how relevant /
recommended each property is, by grouping into
• Minimum | Recommended | Optional
Profiles
• Profiles = Refinement & Interoperability Layer
- schema.orgs generality means it does not
recommend which ontologies to annotate
with
- Lack of restrictions on cardinality make it
difficult to parse the data (if you’re not a
huge search engine)
Schema.org is not great for interoperability!
Profiles - Improving interoperability
• Bioschemas profiles include cardinality
restrictions and controlled vocabularies
tailored to our use-cases
Profiles and their adoption
Profile Development process
• Determining the schema is a process of
empirical surveying and expert opinion.
• We do a Cross-walk to find what fields are
missing and use this to gauge marginality
Profile Development process
Should it be
Minimum /
Optional /
Recommended
Should there
be one or
many of them?
Should values
be restricted
to a controlled
vocab?
If we already
have it:
Do we want to
keep it?
Agree on answers
for each of these
questions
Go through each
attribute (row) of
the schema
If we don’t
have it:
Do we want to
include it?
Column G
Column G Column H Column I
Is the
description
provided okay?
Do we want to
rewrite it?
Column F
• Discussions through our public mailing list
Profile Development process
Profile Development process
We use Github to request new properties,
identify and manage bug fixing, and publicly
present our decision making
Case Study
TeSS:
Training materials,
Events, and Courses
Part 4
ELIXIR All Hands 2018, June 2018, Berlin, Germany
The ELIXIR Training Portal - TeSS
https://tess.elixir-europe.org
TeSS
• A training portal that indexes metadata from across the
web.
•Presents a wide selection of openly available training
resources across the bioinformatics discipline.
•Displays these in a navigable – easy-to-find manner; in a
feature rich environment.
View upcoming events of interest
https://tess.elixir-europe.org/events
Find training materials from around the Web
https://tess.elixir-europe.org/materials
TeSS Features
Search and
Filter
Institutional Login Events
• 270+ Upcoming events
• 800+Training materials
• Filter with 10+
different facets
Login with ELIXIRAAI using
your institutional or Google
credentials with 1-click sign-
on, to:
• Favourite resources
• Add new events &
materials
• Create new training
workflows
Stay informed about
upcoming events of
interest
• E-mail subscription
• Import into
calendar
applications
TeSS Features
Link with other
registries
Ontological
Classification
Events map
• Training events and
materials can be linked
with resources from
other registries.
BioportalAnnotatorWeb
service predicts topics of
resources added toTeSS.
These can be
approved/rejected easily by
our curation group
View filtered events
plotted on a map to
find the most
accessible & relevant
events
Tools & Data services
from bio.tools
Databases, standard,
& policies
from fairsharing.org
Content sourcing
• Rely on community to register resources?
• Community needs to be moderated (to avoid spammers)
• Hard to get critical mass of community involvement
• Rely on curators to enter content?
• Curators need to be paid / incentivized
• Data entry is boring
• A drop in curation/moderation attention can lead to inaccurate,
malevolent, or insufficient content
• Instead develop a solution that
• Takes metadata directly from sources
• Adds any resources to TeSS as they appear
• Updates any resources that have changed
How TeSS works
Front End
Automated
Aggregator
Custom Scraper
Custom Scraper
Custom Scraper
Extract metadata
from training
material and
events pages
Back End
Metadata
Catalogue
Events
Materials
Workflows
Finds relevant resources
Training
Workflows
Search
Interface
Workflow
Viewer
Online Training Resources
User enters form
data
•There are several techniques we can use to extract metadata
from content provider websites. This depends on what’s on
the site.
•Interface with an API
• Handy but rare, difficult for websites to implement
Content aggregators must write bespoke API Client for each
• Structured data already embedded in page (RSS, ICS)
• Limited amount of data
•HTML Scraping
• Fragile technique that can break when there are changes to the
website.
Automatic extraction techniques
Trade-off between ease of adopting
and usefulness to aggregators
Ease to
implement
on a website
Usefulness to aggregator
Content Provider extraction technique
statistics
Events Materials Total
Schema.org /
Bioschemas
9 6 15
HTML 3 5 8
XML/JSON/YAML/CS
V
4 3 7
iCal 5 -- 5
JSON API -- 2 2
RSS 1 -- 1
Total 38
Content aggregation via Bioschemas
Front End
Automated
Aggregator
Schema.orgScrape
Custom Scraper
Custom Scraper
Extract metadata
from training
material and
events pages
Back End
Metadata
Catalogue
Events
Materials
Workflows
Finds relevant resources
Training
Workflows
Search
Interface
Workflow
Viewer
Online Training Resources
Tools and
Techniques for
Implementation
Part 4
Technique for adding Bioschemas to a
website
• 1. Identify an
appropriate schema(s)
for your content type
• 1.a If it doesn’t exist,
e-mail the mailing list
(W3C, or add to
Github Issue tracker)
Issue tracker
https://github.com/BioSch
emas/specifications
Mailing List
https://www.w3.org/co
mmunity/bioschemas/
Technique for adding Bioschemas to a
website
• 2. Draw a table and
write down your
metadata fields on the
left hand side and the
schema.org properties
on the right.
• Map the ones that
correlate
Technique for adding Bioschemas to a
website
• 3a. Use the Bioschemas
generator to create a
JSON-LD snippet that
you can (hopefully)
copy and paste into
your site. (This would
mean creating one for
every new schema.org
record you want to add)
http://www.macs.hw.ac.uk/SWeL/BioschemasGenerator/
Technique for adding Bioschemas to a
website
• 3b. If you can modify
your site, paste in the
JSON-LD template of
the schema (from 3a),
and render the
metadata variables as
values to the keys
Mapping
Technique for adding Bioschemas to a
website
• 3c. If your site is using
a CMS such as
Wordpress or Drupal,
explore whether there
is an appropriate
schema.org plugins
you can use (or ask on
the bioschemas
mailing list)
Tutorials
• Bioschemas Training Portal
– There is a step-by-step tutorial on there
for adding schema.org to jekyll pages /
github page sites.
– Hopefully there will be more to come
https://bioschemas.gitbook.io/training-portal
Tools
• Bioschemas Generator
– Form-based tool to generate valid Bioschemas
JSON-LD
– http://www.macs.hw.ac.uk/SWeL/BioschemasGener
ator/
• Validata [under construction]
– Web application for validating bioschemas markup
https://bioschemas.org/software/
Tools
• GoCrawlt
– JSON-LD schema.org extractor
• Buzzbang [on hold]
– Search engine that crawls the web for Bioschemas
JSON-LD
https://bioschemas.org/software/
Freebies from Schema.org
• Google Search Console
– Shows you what schema.org data Google is picking
up from your site, any errors, and advice on how to
fix them
– https://search.google.com/search-console
Freebies from Schema.org
• Google Structured Data Testing Tool
– Extracts the schema.org from a given web-page or
from a code-snippet, validates it, and shows you
what errors there are
– https://search.google.com/structured-data/testing-
tool
Freebies from Schema.org ecosystem
• 3rd party plug-ins
– Lots available to help
add schema.org to your
framework
Slide courtesy of Alasdair Gray

More Related Content

Bioschemas Workshop

  • 1. Bioschemas Workshop Niall Beard Bioinformatics Education Summit 13th May 2019
  • 3. Expected Learning Outcomes • Understand what schema.org is and how it can be applied to a project • Understand what Bioschemas is, how it differs from schema.org, and what vocabularies are available • Know the benefits and limitations to using schema.org • Gain an understanding of how to apply (bio/)schema.org to your site.
  • 4. Workshop style • Please do interrupt me if: – You have any questions – If you have difficulty reading the slides – If I’m not speaking clearly enough – Or if I am going to fast/slow
  • 7. Search Engines User InformationConnect Query text Demographic Location Device Type Document content Web traffic Link count Freshness ---- 21 ‘signals’
  • 8. Search Engines User InformationConnect Query text Demographic Location Device Type Document content Web traffic Link count Freshness ---- 21 ‘signals’ Algorithms to guess matches ???????? Text Matching Named Entity Recog TF-IDF NLP
  • 9. Take out some of the guesswork… • Search engines need to predict what a page is about… • What if instead, search engines allow the information providers to explicitly define their pages contents • Rather than relying on algorithmic guesswork!
  • 10. Slide courtesy of Alasdair Gray
  • 11. Schema.org • A lightweight way of structuring data online • Created by a consortium of search engines to improve experience and search efficacy •Thousands of different vocabularies to describe information online
  • 14. <div itemscope itemtype="http://schema.org/Recipe"> <div itemprop="nutrition” itemscope itemtype="http://schema.org/NutritionInformation"> Nutrition facts: <span itemprop="calories">144 kcal</span>, </div> Ingredients: - <span itemprop="recipeIngredient">800g small new potato</span> - <span itemprop="recipeIngredient">3 shallot</span> . . .
  • 15. <script type="application/ld+json"> { "@context": "http://schema.org", "@type": ”Recipe", "name": ”Potato Salad", “NutritionInformation”: { "calories”: “144 kcal”, "recipeIngredient”: “800g small new potato”, "recipeIngredient”: “3 shallot” . . .
  • 16. Readable by search engines Content Content Content Schema.org Schema.org Schema.org
  • 19. A training event – marked up in schema.org – as shown by Google https://search.google.com/structured-data/testing-tool
  • 21. Search engines favour websites containing schema.org in their search results
  • 22. Readable by Registries Resource Resource Resource Schema.org Schema.org Schema.org
  • 23. Schema.org is community made • Schema.org is made up of decentralized extensions from different industries
  • 24. Schema.org is community made • Extensions that see good usage get ‘folded-in’ to the core schema.org vocabularies
  • 25. Schema.org is community made • To take advantage of schema.org for Bioinformatics, we need to make our own community Bioinformatics / Life science Community
  • 27. Bioschemas See; “The FAIR Guiding Principles for scientific data management and stewardship”, Mark D Wilkinson et al, 2016
  • 28. Schema.org is community made • … Bioschemas is a community to propose Life science specifications to schema.org Bioinformatics / Life science Community
  • 29. Bioschemas • Bioschemas is a community project which; – Creates Types for Life science resources • Proteins, Samples, Beacons, Tools, Training, etc – Create Profiles to Refine & Enhance Types • Marginality • Cardinality • Controlled Vocabularies – Creates tools to make bioschemas easier to create, validate, and extract
  • 30. Types • Types = New vocabularies to propose to schema.org – Some are Biological Types – Some are Generic Types that are useful to Life scientists – These new types will be hosted at bio.schema.org – Currently at: http://bio.sdo-bioschemas-227516.appspot.com
  • 34. Profiles • Profiles = Refinement & Interoperability Layer - Because every industry and domain shares in these specifications… - Every domain includes its own properties - So we inherit lots of properties we don’t care about Schema.org is messy!
  • 35. Profiles - Tidying up Schema.org • For example; – Dataset inherits from schema.org/CreativeWork – CreativeWork (and therefore Dataset) contains properties for: • Character • IsFamilyFriendly • Material (e.g. leather, wool, cotton, paper) • Genre • Bioschemas offers an indication of how relevant / recommended each property is, by grouping into • Minimum | Recommended | Optional
  • 36. Profiles • Profiles = Refinement & Interoperability Layer - schema.orgs generality means it does not recommend which ontologies to annotate with - Lack of restrictions on cardinality make it difficult to parse the data (if you’re not a huge search engine) Schema.org is not great for interoperability!
  • 37. Profiles - Improving interoperability • Bioschemas profiles include cardinality restrictions and controlled vocabularies tailored to our use-cases
  • 38. Profiles and their adoption
  • 39. Profile Development process • Determining the schema is a process of empirical surveying and expert opinion. • We do a Cross-walk to find what fields are missing and use this to gauge marginality
  • 40. Profile Development process Should it be Minimum / Optional / Recommended Should there be one or many of them? Should values be restricted to a controlled vocab? If we already have it: Do we want to keep it? Agree on answers for each of these questions Go through each attribute (row) of the schema If we don’t have it: Do we want to include it? Column G Column G Column H Column I Is the description provided okay? Do we want to rewrite it? Column F
  • 41. • Discussions through our public mailing list Profile Development process
  • 42. Profile Development process We use Github to request new properties, identify and manage bug fixing, and publicly present our decision making
  • 44. ELIXIR All Hands 2018, June 2018, Berlin, Germany
  • 45. The ELIXIR Training Portal - TeSS https://tess.elixir-europe.org
  • 46. TeSS • A training portal that indexes metadata from across the web. •Presents a wide selection of openly available training resources across the bioinformatics discipline. •Displays these in a navigable – easy-to-find manner; in a feature rich environment.
  • 47. View upcoming events of interest https://tess.elixir-europe.org/events
  • 48. Find training materials from around the Web https://tess.elixir-europe.org/materials
  • 49. TeSS Features Search and Filter Institutional Login Events • 270+ Upcoming events • 800+Training materials • Filter with 10+ different facets Login with ELIXIRAAI using your institutional or Google credentials with 1-click sign- on, to: • Favourite resources • Add new events & materials • Create new training workflows Stay informed about upcoming events of interest • E-mail subscription • Import into calendar applications
  • 50. TeSS Features Link with other registries Ontological Classification Events map • Training events and materials can be linked with resources from other registries. BioportalAnnotatorWeb service predicts topics of resources added toTeSS. These can be approved/rejected easily by our curation group View filtered events plotted on a map to find the most accessible & relevant events Tools & Data services from bio.tools Databases, standard, & policies from fairsharing.org
  • 51. Content sourcing • Rely on community to register resources? • Community needs to be moderated (to avoid spammers) • Hard to get critical mass of community involvement • Rely on curators to enter content? • Curators need to be paid / incentivized • Data entry is boring • A drop in curation/moderation attention can lead to inaccurate, malevolent, or insufficient content • Instead develop a solution that • Takes metadata directly from sources • Adds any resources to TeSS as they appear • Updates any resources that have changed
  • 52. How TeSS works Front End Automated Aggregator Custom Scraper Custom Scraper Custom Scraper Extract metadata from training material and events pages Back End Metadata Catalogue Events Materials Workflows Finds relevant resources Training Workflows Search Interface Workflow Viewer Online Training Resources User enters form data
  • 53. •There are several techniques we can use to extract metadata from content provider websites. This depends on what’s on the site. •Interface with an API • Handy but rare, difficult for websites to implement Content aggregators must write bespoke API Client for each • Structured data already embedded in page (RSS, ICS) • Limited amount of data •HTML Scraping • Fragile technique that can break when there are changes to the website. Automatic extraction techniques
  • 54. Trade-off between ease of adopting and usefulness to aggregators Ease to implement on a website Usefulness to aggregator
  • 55. Content Provider extraction technique statistics Events Materials Total Schema.org / Bioschemas 9 6 15 HTML 3 5 8 XML/JSON/YAML/CS V 4 3 7 iCal 5 -- 5 JSON API -- 2 2 RSS 1 -- 1 Total 38
  • 56. Content aggregation via Bioschemas Front End Automated Aggregator Schema.orgScrape Custom Scraper Custom Scraper Extract metadata from training material and events pages Back End Metadata Catalogue Events Materials Workflows Finds relevant resources Training Workflows Search Interface Workflow Viewer Online Training Resources
  • 58. Technique for adding Bioschemas to a website • 1. Identify an appropriate schema(s) for your content type • 1.a If it doesn’t exist, e-mail the mailing list (W3C, or add to Github Issue tracker) Issue tracker https://github.com/BioSch emas/specifications Mailing List https://www.w3.org/co mmunity/bioschemas/
  • 59. Technique for adding Bioschemas to a website • 2. Draw a table and write down your metadata fields on the left hand side and the schema.org properties on the right. • Map the ones that correlate
  • 60. Technique for adding Bioschemas to a website • 3a. Use the Bioschemas generator to create a JSON-LD snippet that you can (hopefully) copy and paste into your site. (This would mean creating one for every new schema.org record you want to add) http://www.macs.hw.ac.uk/SWeL/BioschemasGenerator/
  • 61. Technique for adding Bioschemas to a website • 3b. If you can modify your site, paste in the JSON-LD template of the schema (from 3a), and render the metadata variables as values to the keys Mapping
  • 62. Technique for adding Bioschemas to a website • 3c. If your site is using a CMS such as Wordpress or Drupal, explore whether there is an appropriate schema.org plugins you can use (or ask on the bioschemas mailing list)
  • 63. Tutorials • Bioschemas Training Portal – There is a step-by-step tutorial on there for adding schema.org to jekyll pages / github page sites. – Hopefully there will be more to come https://bioschemas.gitbook.io/training-portal
  • 64. Tools • Bioschemas Generator – Form-based tool to generate valid Bioschemas JSON-LD – http://www.macs.hw.ac.uk/SWeL/BioschemasGener ator/ • Validata [under construction] – Web application for validating bioschemas markup https://bioschemas.org/software/
  • 65. Tools • GoCrawlt – JSON-LD schema.org extractor • Buzzbang [on hold] – Search engine that crawls the web for Bioschemas JSON-LD https://bioschemas.org/software/
  • 66. Freebies from Schema.org • Google Search Console – Shows you what schema.org data Google is picking up from your site, any errors, and advice on how to fix them – https://search.google.com/search-console
  • 67. Freebies from Schema.org • Google Structured Data Testing Tool – Extracts the schema.org from a given web-page or from a code-snippet, validates it, and shows you what errors there are – https://search.google.com/structured-data/testing- tool
  • 68. Freebies from Schema.org ecosystem • 3rd party plug-ins – Lots available to help add schema.org to your framework
  • 69. Slide courtesy of Alasdair Gray

Editor's Notes

  1. Collection of schemas can be used to describe online objects
  2. Schema.org very lightweight
  3. Going clockwise from top right – we have international organizations, communities surrounding technologies, national institutions, and other academic institutions. All output training events and/or materials and share via their own websites. Many, many opportunities in many, many locations.
  4. 273 Upcoming events – 7540 collected previously.
  5. 831 Training materials