4

intelligent people, noob here!

I am planning on building a multi-user apparatus on s3 for photo/object storage, and I was planning on using s3. I have the whole front-end planned out but I have a question about the bucket system.

Should I have one bucket holding every user, or rather 4-5 buckets with the users distributed across them, or should I have 1 bucket for each user?

Each user will be storing on average about 35 GB as an example, and I want this to be able to run smoothly with as little as 3 users to as large as 300,000,000 in the future (so just as scalable as possible)

Which method should I choose, and what did Dropbox do during their S3 days?

2
  • As someone who has built a system that sounds similar to this before, I've got a few suggestions: First, use guids, not filenames. You don't want random Chinese characters showing up in your s3 filenames. These are far easier to manage in the long run, easier to hack together scripts with, and will save you heartache. Secondly, folder your guids, say in s3://users-filess3.amazonaws.com/12/345/123-456-789-abc-def You will at some point need to dive in to the s3 console to do some debugging, having millions of files in a folder, makes the console unusable.
    – KHobbits
    Commented Feb 4, 2017 at 4:01
  • 1
    Oh of course. It's also insecure and somewhat of a privacy violation to leave file names or the files themselves in clear text. Thanks for the bucket info though! Commented Feb 4, 2017 at 14:44

1 Answer 1

3

You definitely do not need a bucket for each user. Never mind the fact that it seems very unlikely that AWS support would approve a request to increase your account's default total bucket limit from 100 to 300,000,000. Also, initial bucket creation is not intended to be done aggressively or in real-time.

The high-availability engineering of Amazon S3 is focused on get, put, list, and delete operations. Because bucket operations work against a centralized, global resource space, it is not appropriate to create or delete buckets on the high-availability code path of your application. It is better to create or delete buckets in a separate initialization or setup routine that you run less often.

http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html

Design your application so that it doesn't matter whether you use one bucket or several. How? For each user, store the bucket_id where that user's data is stored. Then start with everybody in bucket_id 1 and then later you have the flexibility to put new users in new buckets if that becomes necessary... or if you decide to migrate some users to different buckets... or if you decide to situate users' storage in a bucket nearer to the user's typical location.

S3 will automatically scale its capacity to meet the demands of your traffic. You can make that process easier by designing the paths to your objects so that there is nonsequential assignment of object keys near the left hand side of the key.

S3 scales its capacity by splitting index partitions, so, for example, giving each object a path that begins with the date of the upload would be a really bad idea, because your bucket index develops a hot spot with heavy uploads in a small part of the keyspace.

See http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

For the same reason, don't give your buckets lexically sequential names within a region.


What Dropbox may have been doing is probably not relevant.

0

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .