Implement downloader in python too.

Passing cookies to wget on the command line is a security risk. On a shared machine, other users can see your full command line. Passing it in using the --load-cookies option is too tedious—the file format required is archaic and hard to replicate by hand. So, we simply implement the downloader in python too. In any case, this makes for a more cohesive user experience.
author: Arun Isaac 2025-01-04 01:34:35 +0000
committer: Arun Isaac 2025-01-04 01:39:23 +0000
commit: 1718ff1bf4611e05d2fb952240369e27fe1504bd (patch)
tree: e02862b8e3542547a022e7c6fe9c5f3f37010d46
parent: 22d1c69a4921d01e19c178cbfef8d361f5f731f6 (diff)
download: globus-weblinks-1718ff1bf4611e05d2fb952240369e27fe1504bd.tar.gz
globus-weblinks-1718ff1bf4611e05d2fb952240369e27fe1504bd.tar.lz
globus-weblinks-1718ff1bf4611e05d2fb952240369e27fe1504bd.zip
3 files changed, 35 insertions, 12 deletions
diff --git a/README.md b/README.md
index 189b83d..0b0cbfa 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@ This python script is a quick hack to download data from Globus via HTTPS and wi
 
 # Dependencies
 
-[globus-sdk](https://pypi.org/project/globus-sdk/) is the only dependency. The easiest way is to use GNU Guix. You will need the [guix-bioinformatics channel](https://git.genenetwork.org/guix-bioinformatics/about/).
+[globus-sdk](https://pypi.org/project/globus-sdk/) and requests are the only dependencies. The easiest way to obtain them is to use GNU Guix. You will need the [guix-bioinformatics channel](https://git.genenetwork.org/guix-bioinformatics/about/).
 ```
 guix shell -m manifest.scm
 ```
@@ -11,18 +11,19 @@ guix shell -m manifest.scm
 
 Log in to the Globus web app, go to the `Collections` page, and find the collection you are interested in. When you click on it, you will be taken to an `Overview` page which will show the `UUID` of the collection. That is the endpoint ID.
 
-# Authorize app and get HTTPS links for all files in your collection
+# Extract cookies from browser
 
-Run the globus-weblinks script passing in your endpoint ID. The script will prompt you for authorization. Once authorized, it will print out HTTPS links to all your files. Write the links to a file.
+You will need cookies to authenticate the HTTPS download. You need to extract these cookies from a browser session. This is a somewhat cumbersome process. Here's one way to do it. In the Globus web app, download any file from your collection whilst inspecting network traffic. Copy the HTTPS request for the file by right clicking it and selecting "Copy as cURL". One of the parameters in the copied curl command should be the cookie header we need. Put it in a file `cookies.json` like so:
 ```
-./globus-weblinks YOUR-ENDPOINT-ID > weblinks
+{
+  "mod_globus_OIDC": "aloooooooooooongrandomcookiestring"
+}
 ```
 
-# Download your files using wget
+# Download your files
 
-You can now download your files using `wget`. But first, you will need cookies to authenticate the download. We need to extract these cookies from a browser session. This is a somewhat cumbersome process. Here's one way to do it. In the Globus web app, download any file from your collection whilst inspecting network traffic. Copy the HTTPS request for the file by right clicking it and selecting "Copy as cURL". One of the parameters in the copied curl command should be the cookie header we need. Use it with wget like so.
+Now, all that remains is to download your files. Do it like so:
 ```
-wget --header 'Cookie: mod_globus_OIDC=aloooooooooooongrandomcookiestring' -i weblinks
+./globus-weblinks YOUR-ENDPOINT-ID cookies.json
 ```
-
 Enjoy!
diff --git a/globus-weblinks b/globus-weblinks
index 74fd6f0..92f598d 100755
--- a/globus-weblinks
+++ b/globus-weblinks
@@ -1,9 +1,12 @@
 #! /usr/bin/env python3
 
 import argparse
-from pathlib import PurePath
+from pathlib import Path, PurePath
+import json
+import requests
 import sys
 import globus_sdk
+from urllib.parse import urlparse
 
 # This is the tutorial client ID from
 # https://globus-sdk-python.readthedocs.io/en/stable/tutorial.html.
@@ -31,12 +34,31 @@ def find_files(transfer_client, endpoint_id, path=PurePath("/")):
         else:
             yield path / file["name"]
 
+def download_file(url, cookies):
+    filepath = Path(urlparse(url).path).relative_to("/")
+    filepath.parent.mkdir(parents=True, exist_ok=True)
+    with open(filepath, "wb") as f:
+        for chunk in (requests.get(url, cookies=cookies, stream=True)
+                      .iter_content(chunk_size=1024*1024)):
+            f.write(chunk)
+
 parser = argparse.ArgumentParser(description="Get web links for Globus collection")
 parser.add_argument("endpoint_id", metavar="endpoint-id", help="Endpoint ID of collection")
+parser.add_argument("cookies", help="JSON file with cookies from Globus web app")
 args = parser.parse_args()
 
 transfer_client = globus_sdk.TransferClient(
     authorizer=globus_sdk.AccessTokenAuthorizer(get_transfer_token()))
 endpoint = transfer_client.get_endpoint(args.endpoint_id)
-for path in find_files(transfer_client, args.endpoint_id):
-    print(endpoint["https_server"] + str(path))
+urls = [endpoint["https_server"] + str(path)
+        for path in find_files(transfer_client, args.endpoint_id)]
+total = len(urls)
+print(f"Found {total} files")
+
+with open(args.cookies) as f:
+    cookies = json.load(f)
+
+for i, url in enumerate(urls, 1):
+    print(f"{i}/{total}: Downloading {url}")
+    download_file(url, cookies)
+print("Download complete!")
diff --git a/manifest.scm b/manifest.scm
index d07898a..87ee2ec 100644
--- a/manifest.scm
+++ b/manifest.scm
@@ -1,2 +1,2 @@
 (specifications->manifest
- (list "python" "python-globus-sdk"))
+ (list "python" "python-globus-sdk" "python-requests"))
author	Arun Isaac	2025-01-04 01:34:35 +0000
committer	Arun Isaac	2025-01-04 01:39:23 +0000
commit	1718ff1bf4611e05d2fb952240369e27fe1504bd (patch)
tree	e02862b8e3542547a022e7c6fe9c5f3f37010d46
parent	22d1c69a4921d01e19c178cbfef8d361f5f731f6 (diff)
download	globus-weblinks-1718ff1bf4611e05d2fb952240369e27fe1504bd.tar.gz globus-weblinks-1718ff1bf4611e05d2fb952240369e27fe1504bd.tar.lz globus-weblinks-1718ff1bf4611e05d2fb952240369e27fe1504bd.zip