Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update MARC importer language mapping table #9344

Merged
merged 2 commits into from
May 28, 2024

Conversation

hornc
Copy link
Collaborator

@hornc hornc commented May 26, 2024

Closes #8140

Updates the MARC import table mappings to correct LOC deprecated 3 character language codes to their current code.

This will ensure records imported from older MARCs have the up-to-date codes in Open Library.

Technical

This will not modify language codes on existing records (#8139), it only affects new imports.

Testing

Screenshot

Stakeholders

@cdrini
@tfmorris

@hornc hornc added Theme: MARC records Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] labels May 26, 2024
@mekarpeles mekarpeles merged commit 3b5ef43 into internetarchive:master May 28, 2024
4 checks passed
@mekarpeles mekarpeles self-assigned this May 28, 2024
Copy link
Contributor

@tfmorris tfmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this got merged without any of the "stakeholders" review, but here are my review comments.

'jap': 'jpn',
'fra': 'fre',
'gwr': 'ger',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gwr looks like it could be a typo fix, but cro and sze seem unlikely. There's a pretty big archive of MARC records which have been imported, so it should be possible to see how frequently (if at all) these are used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sze looks like it could be an ISO code for the https://en.wikipedia.org/wiki/Seze_language , which has nothing to do with slo, and I think chu -> cro has a similar ambiguity. Without comments, I'm not sure what that mapping is protecting against, but to me it looks like they are more likely to re-assign non-MARC language codes to unrelated languages. I imagine there were some historical records that those changes worked for, but this code can only protect against systematic and likely codes we might encounter through regular older catalog imports.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my natural bias is towards not changing things I don't understand, particularly since the code represents (or should) decades of accumulated knowledge, but I'd be hard pressed to argued for preserving such an ancient bit of cruft.

'it ': 'ita',
# LOC MARC Deprecated code updates
'cam': 'khm', # Khmer
'esp': 'epo', # Esperanto
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like esk (Eskimo) is missing which is one of the codes that drew complaints.

Copy link
Collaborator Author

@hornc hornc May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added the codes which had a clear one-to-one correct mapping -- your list was super helpful showing the corrected mappings. Many deprecated codes don't have a single obvious mapping, which prevents this kind of automated fix, and there seems to be a range of reasons why a code is deprecated. Some seem technical dialect vs language factors like lan -> oci -- that makes me think some cataloged items might lose information if the item is really in the Languedocien dialect and were cataloged correctly, but now they'd be listed under a family (Occitan), with a time period, which may or may not relate to the item. It could go the other way though. I don't know enough of the details, but that struck me as potentially quite a difference.

I think your advice on what is needed to correct the ~217 esk codes in https://github.com/internetarchive/openlibrary/issues/8733#issuecomment-1901168076 is still good, and I'm not sure it can be automated (without the risk of mis-assigning codes based on naive assumptions).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth tossing in a comment about esk (and any other unmappable codes) so that people reviewing the list don't think that they were forgotten.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Theme: MARC records
3 participants