Connect SFTP to S3 in AWS glue

AWS has a lot of data transfer tools, but none of them can actually transfer from SFTP to S3 out of the box.

Luckily Glue is very flexible, and it is possible to run a pure python script there.

Without further ado, a basic python script, which can run in Glue (as well as locally), and will read all files in the root of a SFTP server to upload them into a S3 bucket.

import boto3
import paramiko

s3 = boto3.resource("s3")
bucket = s3.Bucket(name="destination-bucket")
bucket.load()


ssh = paramiko.SSHClient()
# In prod, add explicitly the rsa key of the host instead of using the AutoAddPolicy:
# ssh.get_host_keys().add('example.com', 'ssh-rsa', paramiko.RSAKey(data=decodebytes(b"""AAAAB3NzaC1yc2EAAAABIwAAAQEA0hV...""")))
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())

ssh.connect(
    hostname="sftp.example.com",
    username="thisdataguy",
    password="very secret",
)

sftp = ssh.open_sftp()

for filename in sftp.listdir():
    print(f"Downloading {filename} from sftp...")
    # mode: ssh treats all files as binary anyway, to 'b' is ignored.
    with sftp.file(filename, mode="r") as file_obj:
        print(f"uploading  {filename} to s3...")
        bucket.put_object(Body=file_obj, Key=f"destdir/{filename}")
        print(f"All done for {filename}")

There is only one thing to take care of. Paramiko is not available by default in Glue, so in the job setup you need to point the Python lib path to a downloaded wheel of paramiko on S3.

Advertisement

Get started with AWs and python

When you start for the first (or even second) time with AWS, it is a bit tricky to get your head around all the bits and bolts than need to be connected together. If on top of this you try to work with AWS in Beijing from outside China, the web GUI makes your work even harder because of slowness or even timeouts.

This scripts set up for you a full set of resources (vpc, route table, security group, subnet, internet gateway, instance with the relevant associations and attachments) for easy testing or bootstrapping of your infrastructure.

It is mostly meant as a testing help, so it does not handle all the options possible, but I find it invaluable to get started. You just need the AWS basics:

and it will do the rest for you. You need to provide a tag name (defaults to ‘roles’) and value, and all resources will be created and located via this tag, to allow for easy spawning and tearing down.

usage: fullspawn.py3 [-h] [--tag TAG] [--up | --down] [--wet | --dry]
 [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--cidr CIDR]
 [--ami AMI] [--keypair KEYPAIR] [--profile PROFILE]
 [--instance INSTANCE]
 role

Spawns a full AWS self-contained infrastructure.

positional arguments:
 role Tag value used for marking and fetching resources.

optional arguments:
 -h, --help show this help message and exit
 --tag TAG, -t TAG Tag name used for marking and fetching resources.
 (default: roles)
 --up, -u Creates a full infra. (default: up)
 --down, -d Destroys a full infra. (default: up)
 --wet, -w Actually performs the action. (default: dry)
 --dry Only shows what would be done, not doing anything.
 (default: dry)
 --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
 Verbosity level. (default: WARNING)
 --cidr CIDR The network range for the VPC, in CIDR notation. For
 example, 10.0.0.0/16 (default: 10.0.42.0/28)
 --ami AMI The AMI id for your instance. (default: ami-33734044)
 --keypair KEYPAIR A keypair aws knows about. (default: yourkey)
 --profile PROFILE Profile to use for credentials. Will use AWS_PROFILE
 environment variable if set. (default: default)
 --instance INSTANCE Instance type. (default: t2.micro)

For instance:

# Let's see what would happen when creating a full infra...
./fullspawn.py3 -t tag --up --dry testing
# Look good let's do it.
./fullspawn.py3 -t tag --up --wet testing
# oops, this was a stupid tag name
./fullspawn.py3 -t tag --down --wet testing

You probably want to have a look at some variables inside the script, setting a few defaults which might not be relevant for you. I am thinking about the ami (AMI), the keypair (KEYPAIR) and the ingress rules (INGRESS) all defined before the argparse calls.

The code is available on github.

Enjoy!