1

I currently have a solution that requires leadership election for HA. It processes files as they arrive in an SFTP server, and Subscribe to an Active MQ feed; and therefore we don't want multiple running instances trying to act at the same time.

I am currently using Apache Zookeeper with the Curator library to manage leadership election.

I am now moving the application to AWS, and I'm naively thinking why I can't just use Fargate, with an instance count of 1, to replace running multiple instances with leadership election? That way if one instance fails then another instance will be brought up, and there will only ever be one instance running at a time.

Have I missed something here?

10
  • 1
    Do you regard "blindly trusting that Fargate will never actually run 2 copies of a container when the instance count is set to 1" as "missing something"? For avoidance of doubt, I have no particularly knowledge on what Fargate does here but it's possible it's been engineered to behave differently under some circumstances than you seem to be assuming. Commented Jan 25 at 13:26
  • Well, yeah, if they give you the option to choose the max number of running containers to be 1, why wouldn't you assume that? Commented Jan 25 at 13:45
  • 1
    During deployments, depending on configuration, Fargate might start up the new task and wait for it to be healthy before stopping the old one. You can configure it to stop the old one first though, if that's what's needed. However, you'd also need to make sure your container health checks are rock solid; I've seen (admittedly rare) situations where a task lost network connectivity due to some issue with the underlying host, but because the health check didn't check for that, the task was still considered healthy, and it wasn't replaced.
    – PMah
    Commented Jan 25 at 14:07
  • 1
    @simonalexander2005 because distributed computing is hard, and particularly hard over an unreliable network. And all networks are unreliable. AWS can hide a lot of the complexity, but underneath it all you can't magically solve the Two Generals Problem. Commented Jan 25 at 14:09
  • 1
    Can you explain why you mention not running multiple instances in the first paragraph, but you are currently running multiple instances with the leadership election in the 3rd paragraph. This seems to be a contradiction.
    – Jon Raynor
    Commented Jan 25 at 16:22

2 Answers 2

3

Using Fargate is saying, I don't want to deal with the instance. This includes things like instance patching, etc. Those things are done by cloud provider. In some cases, that is perfectly fine. Hopefully, this is meeting your current requirements. You will be able to deploy code as needed, but you don't control the instance itself.

If you want to use Fargate, set the minimum count to 0, desired to 1, and maximum to 1. This should keep only one instance available and scale out to zero instances when there is nothing to process for cost savings. Please note some latency when spinning from 0 to 1 instance.

If the instance becomes unhealthy, it could spin up a new task before terminating the unhealthy one. Please read for more details:

https://aws.amazon.com/blogs/containers/a-deep-dive-into-amazon-ecs-task-health-and-task-replacement/

A thing to note, that if you're automatically having AWS terminating and starting up new instances, the underlying process that is running should be able to handle this and start over where is left off, so it needs to have checkpoints and be able to figure out where to pick up if it was terminated. Remember, the cloud provider is doing this for you, and it could occur during processing. There can also be false positives as well.

Another option is using Amazon Spot Instances. Just spin up a spot instance and let it do it thing. Some simple monitoring can tell you if it's not healthy and needs to be stopped or terminated. This is a nice solution if you have sporadic and/or infrequent processing needs. If this is a process needs to run all the time, a standard EC2 could also be used. There are many choices to choose from for the type of instance. Health monitoring can be used to stop/terminate/start instances as needed by setting up alerts. No need to keep other servers waiting around. In that case, you will control the instance and start and stop it as needed. It could be on a schedule if you know that processing occurs at certain times of the day. If you have specific OS/patching/software requirements EC2s can be a good choice as it allows one to completely control the instance.

2

I think you should assume that, at some point, there will be more than one instance running and potentially processing data at the same time. For example there might be a situation in which your instance is being replaced with a fresh one (e.g. due to behind-the-scenes patching), and the new instance comes online before the old one is fully offline.

Unless your system can handle this kind of situation and recover gracefully, I would look at including another AWS system that has exactly one processing guarantees: FIFO SQS queues. It would make sense to remove the ActiveMQ portion (one less thing to manage) and to publish messages to an SQS FIFO queue with exactly once delivery guarantee to your Fargate job.

You can scale up the processing of the SFTP files if needed by letting Fargate spin up more instances and read the queue as the queue depth increases while still ensuring that exactly-once delivery.

Yes, that means incorporating more AWS-native services, but if you're willing to move to AWS you might as well take advantage of everything they have to offer. If you ever move off of AWS, you'll need to change things again so it's not really that big of a deal IMHO.

3
  • Thanks, that's interesting. The data I'm using is currently provided on Active MQ by a third party service, so I can't change that; and the rate of messages (could be tens or hundreds per second) makes me think that this approach might not work. SFTP is a different thing - I'm currently running a task that polls the SFTP server every x minutes; but then needs to run for up to half an hour processing the file when it arrives - are there services that are more suited to that, so I'm not paying to run a task doing nothing most of the time? Commented Jan 26 at 8:48
  • @simonalexander2005 "are there services that are more suited to that, so I'm not paying to run a task doing nothing most of the time?" Are you familiar with lambdas?
    – JimmyJames
    Commented Jan 26 at 19:20
  • If your ActiveMQ third-party service can provide exactly-once delivery guarantees, then you should be fine with something like Fargate. The only downside is that AWS won't really know about your ActiveMQ delivery behaviour, so you'll need to pay close attention to how you deal with unfinished work. Familiarize yourself with ECS state change events (docs.aws.amazon.com/AmazonECS/latest/developerguide/…) so that you can take appropriate action when notified that your container is going to be killed mid-work. Commented Jan 27 at 2:13

Not the answer you're looking for? Browse other questions tagged or ask your own question.