AWS has a lot of data transfer tools, but none of them can actually transfer from SFTP to S3 out of the box.
Luckily Glue is very flexible, and it is possible to run a pure python script there.
Without further ado, a basic python script, which can run in Glue (as well as locally), and will read all files in the root of a SFTP server to upload them into a S3 bucket.
import boto3
import paramiko
s3 = boto3.resource("s3")
bucket = s3.Bucket(name="destination-bucket")
bucket.load()
ssh = paramiko.SSHClient()
# In prod, add explicitly the rsa key of the host instead of using the AutoAddPolicy:
# ssh.get_host_keys().add('example.com', 'ssh-rsa', paramiko.RSAKey(data=decodebytes(b"""AAAAB3NzaC1yc2EAAAABIwAAAQEA0hV...""")))
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(
hostname="sftp.example.com",
username="thisdataguy",
password="very secret",
)
sftp = ssh.open_sftp()
for filename in sftp.listdir():
print(f"Downloading {filename} from sftp...")
# mode: ssh treats all files as binary anyway, to 'b' is ignored.
with sftp.file(filename, mode="r") as file_obj:
print(f"uploading {filename} to s3...")
bucket.put_object(Body=file_obj, Key=f"destdir/{filename}")
print(f"All done for {filename}")
There is only one thing to take care of. Paramiko is not available by default in Glue, so in the job setup you need to point the Python lib path
to a downloaded wheel of paramiko on S3.
Thanks for this post! How were you able to get paramaiko to run in glue? Some of the background libraries that it uses use CPython which isn’t supported in glue. I haven’t been able to get it to import successfully.
I did not recall having any issue using it. I uploaded the wheel https://pypi.org/project/paramiko/#files to S3, and then gave the S3 path as “Python library path” to this script (which is of type python shell). It just worked after that.
ModuleNotFoundError: No module named ‘paramiko’ i am getting this error. after installing pip paramiko also. may i know what’s wrong with this
I was not able to get this working just by installing paramiko, it asks for dependencies like bcrypt, pycparser, cffi , cryptograpy.
Before this I tried to install pysftp. no luck there as well. I am trying glue because my lambda is timing out due to large file size in SFTP. pysftp works with lambda though, but not in glue.
ANy help would be appreciated.
I keep getting this error-
ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL: “https://xxx.s3.amazonaws.com/test.csv”.
I keep getting this error-
ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL: “https://xxx.s3.amazonaws.com/test.csv”.
Any pointers?
I have the exact code as this example and keep getting this error:
ssh.connect(
File “/home/spark/.local/lib/python3.10/site-packages/paramiko/client.py”, line 356, in connect
to_try = list(self._families_and_addresses(hostname, port))
File “/home/spark/.local/lib/python3.10/site-packages/paramiko/client.py”, line 202, in _families_and_addresses
addrinfos = socket.getaddrinfo(
File “/usr/local/lib/python3.10/socket.py”, line 955, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
Any idea what I’m doing wrong?