1

I'm not sure how common of a scenario this is, but what are some approaches to creating associated entities before you have the actual main entity they're supposed to be linked to? (basically having separate endpoints for the associated & main entity creation, with the associated entities having to be executed first, at least from a UI standpoint)

Take uploading a bunch of documents related to a medical record before the record is saved - where should the documents reside, assuming they're uploaded to some "structured" cloud storage service like AWS S3?

I was considering a temporary storage directory and then moving them to their appropriate location until I realized S3 has no efficient bulk object move method. Furthermore, this also introduces the issue of having to identify 'ghost' documents - documents that were uploaded but never associated with a record.

Another solution would be to save both the documents and the record at the same time, which simplifies things for the API, however, this leads to some performance issues if the client is browser-based and they're uploading a bunch of really large files.

Any other alternatives I'm missing here?

1
  • 1
    is there some reason you can't create the main resource first? just move any data that you don't have to some sub record? hold the upload until the lasty page of the wizard? etc?
    – Ewan
    Commented Jan 8 at 21:51

2 Answers 2

4

Perhaps you're documenting a visit to the radiologist, so in addition to pulse rate and blood pressure we're recording lots of giant X-ray images in lots of S3 image files.

S3 has no efficient bulk object move method.

Correct.

So upon upload, place the image file in its final resting place from the get go. It might be deleted, but it will never move.

Roll a new GUID for each uploaded file, determine today's date, and assign an S3 key like

    ymd = f"{yyyy}/{mm}/{dd}"
    path = f"my_bucket/image/{ymd}/{guid1}"

Remember those pathnames, storing them in the electronic medical record. When that record is ready, store it as f"my_bucket/emr/{ymd}/{guid2}".


At some point your document retention policy will say that you've kept a file for enough years and it's time to discard it. The date-oriented pathnames will help with that. (Also, they play nicely with pagination of the AWS web console.)


You might possibly want a midnight cron job which reads all recent EMR references to the day's image files, compares that with an S3 directory listing, and does something with "orphan" or "ghost" image files that have a zero ref-count.

3
  • That's an interesting approach, and its making me reconsider using owner-oriented paths (my_bucket/emr/{patient_id}/{guid} vs the solution you described), since I'm not using prefix-based access policies anyway. Thanks for your suggestion. Commented Jan 8 at 23:16
  • Sooo, throwing a {patient_id} into the mix might be a Good Thing, if you have it handy. Dealing with a million opaque GUIDs in one directory isn't the most fun in the world. But if that patient ID only becomes available after an X-ray image is uploaded, then hey, that's life, stick with just the GUID. // In any event, you will always be faced with data management issues like "what was stored recently?" and "what is so ancient that we can discard it?" Since S3 has very poor tools for answering such questions, you kind of need a timestamp somewhere in the pathname.
    – J_H
    Commented Jan 8 at 23:28
  • @string_loginUsername, I commented about the patient Id on DavidT's answer, but make sure it is not something like a medical record number (in the USA). It should be some sort of surrogate key so you aren't putting PII in an S3 key. Commented Jan 10 at 0:18
2

I agree with J_H's answer, but there are several additional points I would like to make:

It is implied that you will be storing meta data about the upload into another data store - someone can always typo an attribute, so adding those attributes to the S3 path is problematic, as you can't easily change them.

Encoding the date into the path of S3 allows you to provide some additional audit/monitoring functions. For example at the end of the day you can confirm that the number of records in S3 matches the number of metadata records in your DB (this can also be done intra-day, but a fudge factor will be needed).

S3 is eventually consistent - there is a delay between uploads being completed and the data being available for download, so a common strategy is to use an event queue after uploading data to S3 so that any processing required after the upload can be retried a couple of times to ensure the data is available.

You may want to move the patient_id to the end of the path i.e.:

my_bucket/image/{ymd}/{document_guid}/{patient_id}

This is a balancing of two concerns:

  • It is possible that someone uploads a file for the wrong patient - it's still possible to search for/find the file if you have the guid (and upload date).
  • In the worst case that some data corruption happens on your primary DB, you have some chance of being able to match up files to patients (you would need to reprocess all the YMD's anyway as part of the recovery process).
1
  • 2
    I think the inclusion of the patient Id is fine as long as it is a surrogate key of some sort, rather than a Medical Record Number that gets included in documentation. Depending on applicable laws, a MRN might be deemed personally identifiable information, which might not be good to include in an S3 key. Just something to be aware of, that's all. Commented Jan 10 at 0:15

Not the answer you're looking for? Browse other questions tagged or ask your own question.