Connect SFTP to S3 in AWS glue

AWS has a lot of data transfer tools, but none of them can actually transfer from SFTP to S3 out of the box.

Luckily Glue is very flexible, and it is possible to run a pure python script there.

Without further ado, a basic python script, which can run in Glue (as well as locally), and will read all files in the root of a SFTP server to upload them into a S3 bucket.

import boto3
import paramiko

s3 = boto3.resource("s3")
bucket = s3.Bucket(name="destination-bucket")
bucket.load()


ssh = paramiko.SSHClient()
# In prod, add explicitly the rsa key of the host instead of using the AutoAddPolicy:
# ssh.get_host_keys().add('example.com', 'ssh-rsa', paramiko.RSAKey(data=decodebytes(b"""AAAAB3NzaC1yc2EAAAABIwAAAQEA0hV...""")))
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())

ssh.connect(
    hostname="sftp.example.com",
    username="thisdataguy",
    password="very secret",
)

sftp = ssh.open_sftp()

for filename in sftp.listdir():
    print(f"Downloading {filename} from sftp...")
    # mode: ssh treats all files as binary anyway, to 'b' is ignored.
    with sftp.file(filename, mode="r") as file_obj:
        print(f"uploading  {filename} to s3...")
        bucket.put_object(Body=file_obj, Key=f"destdir/{filename}")
        print(f"All done for {filename}")

There is only one thing to take care of. Paramiko is not available by default in Glue, so in the job setup you need to point the Python lib path to a downloaded wheel of paramiko on S3.

Advertisement

7 thoughts on “Connect SFTP to S3 in AWS glue

  1. Thanks for this post! How were you able to get paramaiko to run in glue? Some of the background libraries that it uses use CPython which isn’t supported in glue. I haven’t been able to get it to import successfully.

  2. I was not able to get this working just by installing paramiko, it asks for dependencies like bcrypt, pycparser, cffi , cryptograpy.

    Before this I tried to install pysftp. no luck there as well. I am trying glue because my lambda is timing out due to large file size in SFTP. pysftp works with lambda though, but not in glue.

    ANy help would be appreciated.

  3. I keep getting this error-
    ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL: “https://xxx.s3.amazonaws.com/test.csv”.

  4. I keep getting this error-
    ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL: “https://xxx.s3.amazonaws.com/test.csv”.

    Any pointers?

  5. I have the exact code as this example and keep getting this error:

    ssh.connect(
    File “/home/spark/.local/lib/python3.10/site-packages/paramiko/client.py”, line 356, in connect
    to_try = list(self._families_and_addresses(hostname, port))
    File “/home/spark/.local/lib/python3.10/site-packages/paramiko/client.py”, line 202, in _families_and_addresses
    addrinfos = socket.getaddrinfo(
    File “/usr/local/lib/python3.10/socket.py”, line 955, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    socket.gaierror: [Errno -2] Name or service not known

    Any idea what I’m doing wrong?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s