Transitive Closure in PostgreSQL
At Remind we operate one of the largest communication tools for education in the United States and Canada. We have...
An automated AWS service failing due to a missing IAM permission can have surprising causes. Examining the CloudTrail log the error leaves can be handy to quickly pinpoint any missing or incorrect permissions.
We recently found ourselves debugging an IAM permission set in the context of launching EMR clusters.
Launching a cluster requires an IAM role with an extensive set of permissions – needs to be able to launch the instances, maybe create security groups, create SQS queues and many more.
The default role AWS provides covers all these and much more,
including ec2:TerminateInstances
and sqs:Delete*
on *
– take a look at
aws emr create-default roles help
for a complete list!
To avoid running automated clusters under such a powerful [and potentially damaging] role together with the rest of the infrastructure, initially we ran these in a separate AWS account – which brought a number of different permission issues: buckets shared between accounts, object read permissions, etc..
Eventually we decided to run the EMR clusters in our production account, but we needed a more restricted IAM role first – in particular, we need to limit permissions based on resources. While the docs cover which permissions are needed to run instances and the actual permissions used by the launched EMR clusters can be deduced from the launch parameters, problems can arise if the resources don’t match exactly how the EMR implementation runs the commands – the cluster failing to run with a rather spartan permission denied message.
A solution for debugging the role is CloudTrail: by executing the process and investigating its trace, we can [iteratively] construct such a role.
Our first idea was to set limits with the --tags
parameter of the aws(1) emr
create-cluster
command – since all resources created by EMR are tagged with
these tags, it should be possible to give the EMR role permission to
create/destroy resources tagged like that.
According to the resource-level permission docs,
the tags are passed to every command used to launch the EC2 instances.
So we went ahead and granted RunInstances
(among others) to the EMR service
role, limiting it to the resources with an arbitrary tag of ours:
EMRTag=EMRValue
. This would ensure that we can track every resource created by
EMR by filtering that tag.
However, the clusters wouldn’t launch, dumping the not too informative message “the EMR service role hasn’t enough permissions”. Here’s where investigating what’s going on in CloudTrail can be handy.
CloudTrail uploads the log to an S3 bucket, and can optionally use an SNS topic
as well. create-subscription
can create the bucket for us, set up its policy,
and start the logging process in a single command:
aws cloudtrail create-subscription --name SampleTrail --s3-new-bucket sample-bucket
With this done, we can now attempt to kick off EMR and wait for CloudTrail to upload the logs for the run to the bucket.
Once the EMR launch has been executed and the log has arrived in the S3 bucket, now we can begin analyzing it. There are many tools that can be used for piping to, displaying and filtering the logs – largely a consumer’s choice. In this post we’ll focus on a quite universal tool – the Unix command line.
CloudTrail stores the logs in a one subdir per day fashion so if you don’t feel
like selecting the appropriate period (and the traffic is not that high) you can
download today’s dir using the aws s3 sync
command:
aws s3 sync s3://sample-bucket/AWSLogs/012345678990/CloudTrail/us-east-1/2017/09/07/ logs
where logs
is a local dir to sync to.
The logs are stored compressed, and since uncompressing them can take a large amount of space even with moderate traffic, its a good practice to work with them in their compressed form – this usually means zcat(1), zgrep(1) and zless(1), etc., instead of their z-less counterparts [1].
The format for logs is JSON, so jq(1) [2] is an excellent tool for examining them. You can always [z]less(1) a file as a quick reminder of the event format (jq(1) doesn’t work on compressed files so be sure to pipe it through zcat(1) if needed):
zcat 012345678990_CloudTrail_us-east-1_20170831T1510Z_Oiv3a7oQ66XZHfaJ.json.gz | jq . | less
where jq .
is a noop filter that acts as a formatter:
{
"Records": [
{
"eventVersion": "1.05",
"userIdentity": {
"...": "..."
},
"...": "...",
"eventName": "HeadObject",
"...": "...",
"errorCode": "NoSuchKey",
"errorMessage": "The specified key does not exist.",
"requestParameters": {
"...": "..."
}
}
]
}
We now filter the events we’re looking for – instance creation (RunInstances
)
by EMR (which has its own user agent) that returned an error code of
AccessDenied:
zcat * | jq '.Records[] | select (.eventName == "RunInstances" ) |
select(.userAgent == "elasticmapreduce.aws.internal") |
select(.errorCode == "AccessDenied")'
getting:
{
"eventVersion": "1.05",
"userIdentity": {
"type": "AssumedRole",
"arn": "arn:aws:sts::012345678990:assumed-role/emr-role/CCSSession",
"accountId": "012345678990",
"sessionContext": {
"...": "..."
},
"invokedBy": "elasticmapreduce.aws.internal"
},
"eventSource": "ec2.amazonaws.com",
"eventName": "RunInstances",
"awsRegion": "us-east-1",
"sourceIPAddress": "elasticmapreduce.aws.internal",
"userAgent": "elasticmapreduce.aws.internal",
"requestParameters": {
"instancesSet": {
"items": [
{
"imageId": "ami-01234456",
"minCount": 1,
"maxCount": 1
}
]
},
"groupSet": {
"items": [
{
"groupId": "sg-01456788"
}
]
},
"userData": "<sensitiveDataRemoved>",
"instanceType": "r3.2xlarge",
"blockDeviceMapping": {
"items": [
{
"deviceName": "/dev/sdb",
"virtualName": "ephemeral0"
}
]
},
"availabilityZone": "us-east-1a",
"monitoring": {
"enabled": false
},
"disableApiTermination": false,
"instanceInitiatedShutdownBehavior": "terminate",
"iamInstanceProfile": {
"arn": "arn:aws:iam::012345678990:instance-profile/emrrole-instance-profile"
},
"ebsOptimized": false
}
A look at the event quickly reveals the issue – no tags in sight! In order to
learn what happened with the tags, we go back to zcat * | jq .Records[] | less
and search for EMRTag (“/EMRTag
”), which is the tag we ran the create-cluster
command with. We then find a CreateTags
event:
{
"...": "...",
"eventName": "CreateTags",
"sourceIPAddress": "elasticmapreduce.aws.internal",
"userAgent": "elasticmapreduce.aws.internal",
"requestParameters": {
"resourcesSet": {
"items": [
{
"resourceId": "i-0bf5979193b89e6d8"
}
]
},
"tagSet": {
"items": [
{
"key": "aws:elasticmapreduce:job-flow-id",
"value": "j-1GFIHT09EFRK0"
},
{
"key": "EMRTag",
"value": "EMRValue"
},
"..."
]
}
}
}
and we conclude that tags aren’t created on instance creation, but later as
a separate event – which dooms our idea of limiting by tag for the
RunInstances
API call.
However, the event information gives us some ideas on how to limit the
permissions. We decided on limiting by security group (RunInstances
needs
permissions on the security groups it’s going to put the instances on) since
these groups are exclusively for EMR – so it can launch as many instances as it
needs, as long as it places them in that groups.
As another example, after a while we found out that although EMR wasn’t erroring
out on launch anymore, each run generated some failed CreateQueue
SQS events.
We quickly sought for events matching those:
zcat * | jq '.Records[] | select(.eventName == "CreateQueue") |
select(.errorCode == "AccessDenied")' | less
getting:
{
"userIdentity": {
"...": "..."
},
"eventSource": "sqs.amazonaws.com",
"eventName": "CreateQueue",
"errorCode": "AccessDenied",
"errorMessage": "User: arn:aws:sts::012345678990:assumed-role/EMRRole/ElasticMapReduceSession is not authorized to perform: sqs:createqueue on resource: arn:aws:sqs:us-east-1:012345678990:AWS-ElasticMapReduce-j-16QCFP740K4YR",
"...": "..."
}
Those are queues that are created and dropped by EMR when the cluster is
launched with --enable-debugging
– although it’s not fatal. Again, the event
gives us an idea on how to give the proper permissions – sqs.*
on any queue
named AWS-ElasticMapReduce-*
, reserving those for EMR:
{
"Action": ["sqs:*"],
"Effect": "Allow",
"Resource": ["arn:aws:sqs:us-east-1:012345678990:AWS-ElasticMapReduce-*"]
}
We’ve then managed to reduce a wide open permission (not limited by resource) to a permission over a specific set of resources (limited by SQS prefix).
After the change, we can search for the successful events (careful with the filter ordering in this one):
zcat * | jq '.Records[] | select(.eventName == "CreateQueue") |
select(.errorCode == "AccessDenied" | not) |
select(.requestParameters.queueName | contains("AWS-ElasticMapReduce"))' |
less
and confirm we got it right:
{
"userIdentity": {
"...": "..."
},
"eventSource": "sqs.amazonaws.com",
"eventName": "CreateQueue",
"requestParameters": {
"queueName": "AWS-ElasticMapReduce-j-1YCVQDMLXT42Y",
"attribute": {
"ReceiveMessageWaitTimeSeconds": "10",
"DelaySeconds": "0",
"MessageRetentionPeriod": "86400",
"MaximumMessageSize": "262144",
"VisibilityTimeout": "30"
}
},
"responseElements": {
"queueUrl": "https://sqs.us-east-1.amazonaws.com/897883143566/AWS-ElasticMapReduce-j-1YCVQDMLXT42Y"
},
"...": "..."
}
This raises the question – given a role, is it possible to restrict its permissions as much as possible in a (mostly) automated way?
We’ve been pondering on the idea of a tool that could launch a process and watch for its effects on CloudTrail and AccessAdvisor – even if the process isn’t fully automated, it could provide a listing of the exact resources accessed, making it easier for the operator to extract the minimal set of permissions from it.
A tool that comes close to this is Netflix’ Aadvark, which works as an aggregator and front-end for AccessAdvisor.
In the event of an error due to a missing IAM permission, we can find all the information needed for debugging in the CloudTrail log the failed event leaves behind.
Since a high AWS traffic can quickly lead to a large log in a short amount of time, we want tools for filtering the information we’re looking for. In this post, we’ve demonstrated how to filter through the JSON log combining two CLI tools (aws(1) and jq(1)); but any log aggregator with filtering capabilities can do the job as well.
[1] A notable exception is ack(1) which can’t work on compressed files.
[2] https://stedolan.github.io/jq/