SlideShare a Scribd company logo
An Identifier Scheme for the
Digitising Scotland Project
Alasdair J G Gray
Department of Computer Science,
Heriot-Watt University,
Edinburgh
@gray_alasdair
www.macs.hw.ac.uk/~ajg33
Özgür Akgün, Uni. of St Andrews
Ahamd Alsadeeqi, Heriot-Watt Uni.
Peter Christen, Australian National Uni.
Tom Dalton, Uni. of St Andrews
Alan Dearle, Uni. of St Andrews
Chris Dibben, Uni. of Edinburgh
Eilidh Garret, Uni. of Essex
Graham Kirby, Uni. of St Andrews
Alice Reid, Uni. of Cambridge
Lee Williamson, Uni. of Edinburgh
Digitising Scotland Project
Large scale family reconstruction
studies and Pedigrees
• Transcription of data
• Linking of data
Performed at scale
• Whole nation
• Large timeframe
1 June 2017 ADRN Conference 2
Project Team
Backgrounds
• Demographers • Historians • Computer Scientists
Distributed team
1 June 2017 ADRN Conference 3
St Andrews Cambridge Edinburgh Edinburgh Australia
Transcribing Scotland’s Vital
Records: 1855 – 1974
• 24M records
• Birth
• Marriage
• Death
• 18M individuals
41 June 2017 ADRN Conference
Data Linkage Challenges
Low quality data
Probabilistic matches
Scalability
Skewed name
distributionsJohn Grant
Fisherman
Fiona Sinclair
Ian Grant
Smithy
Born: 1861
Stuart Adam
Wheelwright
Morag Scott
Flora Adam
Seamstress
Born: 1866
Married: 1884
John Grant
Farmer
Fiona Sinclaire
Iain Grant
Born: 1860
1 June 2017 ADRN Conference 5
Linking Skye Data
1 June 2017 ADRN Conference 6
Discussing records
Eilidh, I’m having problems with the
Skye record B-BABY-8293.
Peter, which transcribed certificate
is that?
It is the record for Chris Dibben,
born 18 March 1893.
That is the child on record 5457. It
should link to the death on record
5754, 4 December 1959.
Thanks, found it now. It is record
D-DEATH-2182.
1 June 2017 ADRN Conference 7
Existing Identifier Schemes
Historians: Example: 5457
• Incremental integer
• Easily confused with other record
types
• Identifies certificate not actors
• Based on order of transcription
• Not derived from data
• Unique for a file
• Excel spreadsheet
Record Linkage: Example: B-BABY-8293
• Encode type of certificate and
actor on certificate
• Four digits generated by linkage
process
• Different from those used by the
historians
• Different for each run of linkage
pre-processing
1 June 2017 ADRN Conference 8
Desiderata for Identifiers
1. Identifier for each
actor on a certificate
2. Exchangeable between
researchers
3. Unique generation
process from the data
4. Immutable to data
changes, e.g. typo
discovered in data
5. Human derivable from
data records
6. Human interpretable
7. Compact to enable
efficient computation
8. Susceptible to blocking
9. Globally unique
10.Consistent approach
for all record types
11.Compatible with pre-
existing NRS approach
12.Compatibility with
Open Data Standards
1 June 2017 ADRN Conference 9
Identifier Scheme
B1903_164_00_baby
1 June 2017 ADRN Conference 10
typeYear_district_subdistrict_entryNumber_role
Certificate Roles
Birth
• baby
• mother
• father
• registrar
• informant
Marriage
• groom
• groom_father
• groom_mother
• bride
• bride_father
• bride_mother
• witness1
• witness2
• officiant
• registrar
Death
• deceased
• mother
• father
• spouse1…spousen
• informant
• doctor
• registrar
1 June 2017 ADRN Conference 11
Conclusions
• Agreed identifier scheme
typeYear_district_subdistrict_entryNumber_role
• Meets desiderata
• Reliant on “clean” parts of certificate
• Compatible with NRS
• Improved team communications
Alasdair Gray
www.macs.hw.ac.uk/~ajg33/
A.J.G.Gray@hw.ac.uk
@gray_alasdair
Acknowledgements:
• Julia Jennings
• Christine Jones
• Diego Ramiro-Farinas
1 June 2017 ADRN Conference 12

More Related Content

An Identifier Scheme for the Digitising Scotland Project

  • 1. An Identifier Scheme for the Digitising Scotland Project Alasdair J G Gray Department of Computer Science, Heriot-Watt University, Edinburgh @gray_alasdair www.macs.hw.ac.uk/~ajg33 Özgür Akgün, Uni. of St Andrews Ahamd Alsadeeqi, Heriot-Watt Uni. Peter Christen, Australian National Uni. Tom Dalton, Uni. of St Andrews Alan Dearle, Uni. of St Andrews Chris Dibben, Uni. of Edinburgh Eilidh Garret, Uni. of Essex Graham Kirby, Uni. of St Andrews Alice Reid, Uni. of Cambridge Lee Williamson, Uni. of Edinburgh
  • 2. Digitising Scotland Project Large scale family reconstruction studies and Pedigrees • Transcription of data • Linking of data Performed at scale • Whole nation • Large timeframe 1 June 2017 ADRN Conference 2
  • 3. Project Team Backgrounds • Demographers • Historians • Computer Scientists Distributed team 1 June 2017 ADRN Conference 3 St Andrews Cambridge Edinburgh Edinburgh Australia
  • 4. Transcribing Scotland’s Vital Records: 1855 – 1974 • 24M records • Birth • Marriage • Death • 18M individuals 41 June 2017 ADRN Conference
  • 5. Data Linkage Challenges Low quality data Probabilistic matches Scalability Skewed name distributionsJohn Grant Fisherman Fiona Sinclair Ian Grant Smithy Born: 1861 Stuart Adam Wheelwright Morag Scott Flora Adam Seamstress Born: 1866 Married: 1884 John Grant Farmer Fiona Sinclaire Iain Grant Born: 1860 1 June 2017 ADRN Conference 5
  • 6. Linking Skye Data 1 June 2017 ADRN Conference 6
  • 7. Discussing records Eilidh, I’m having problems with the Skye record B-BABY-8293. Peter, which transcribed certificate is that? It is the record for Chris Dibben, born 18 March 1893. That is the child on record 5457. It should link to the death on record 5754, 4 December 1959. Thanks, found it now. It is record D-DEATH-2182. 1 June 2017 ADRN Conference 7
  • 8. Existing Identifier Schemes Historians: Example: 5457 • Incremental integer • Easily confused with other record types • Identifies certificate not actors • Based on order of transcription • Not derived from data • Unique for a file • Excel spreadsheet Record Linkage: Example: B-BABY-8293 • Encode type of certificate and actor on certificate • Four digits generated by linkage process • Different from those used by the historians • Different for each run of linkage pre-processing 1 June 2017 ADRN Conference 8
  • 9. Desiderata for Identifiers 1. Identifier for each actor on a certificate 2. Exchangeable between researchers 3. Unique generation process from the data 4. Immutable to data changes, e.g. typo discovered in data 5. Human derivable from data records 6. Human interpretable 7. Compact to enable efficient computation 8. Susceptible to blocking 9. Globally unique 10.Consistent approach for all record types 11.Compatible with pre- existing NRS approach 12.Compatibility with Open Data Standards 1 June 2017 ADRN Conference 9
  • 10. Identifier Scheme B1903_164_00_baby 1 June 2017 ADRN Conference 10 typeYear_district_subdistrict_entryNumber_role
  • 11. Certificate Roles Birth • baby • mother • father • registrar • informant Marriage • groom • groom_father • groom_mother • bride • bride_father • bride_mother • witness1 • witness2 • officiant • registrar Death • deceased • mother • father • spouse1…spousen • informant • doctor • registrar 1 June 2017 ADRN Conference 11
  • 12. Conclusions • Agreed identifier scheme typeYear_district_subdistrict_entryNumber_role • Meets desiderata • Reliant on “clean” parts of certificate • Compatible with NRS • Improved team communications Alasdair Gray www.macs.hw.ac.uk/~ajg33/ A.J.G.Gray@hw.ac.uk @gray_alasdair Acknowledgements: • Julia Jennings • Christine Jones • Diego Ramiro-Farinas 1 June 2017 ADRN Conference 12

Editor's Notes

  1. Outline: Background of the Digitising Scotland project and its aims Need for an identifier scheme Agreed upon scheme
  2. Large scale family reconstruction studies and Pedigrees
  3. Different backgrounds, different expertise, different terminology Have run a series of workshops to bring us together to understand our approaches and terminologies
  4. Civil registration of births, deaths and marriages in Scotland began on 1 January 1855 All historical vital events records have been converted into digital image format with a supporting index Modern vital events data (from 1974 onwards) are available electronically The DS project will digitise the 24 million Scottish vital events record images (births, marriages and deaths) since 1855. This will allow research access to individual-level information on some 18 million individuals – a large proportion of those who have lived in Scotland. Transcription outsourced. Now starting to receive data. Queens Centre for Data Digitisation and Analysis
  5. Data is of low quality Transcription errors No occupation standards, etc. Skewed name distributions (large proportion of people in a village/town have the same name) Scalability (linking 24M records)
  6. Low quality linkage due to challenges - Skewed name distributions - Need to regularly discuss within team
  7. Different teams using their own identifier schemes No relationship between identifier scheme and original record
  8. Existing approaches reliant on order of transcription CS hash-based approaches reliant on data content – ID changes if data changes
  9. Use information on registration book Registration district on book or on microfiche image Rathven, Banff has no subdistrict Need to capture the different roles