Multipart Upload
Intro
Multipart Upload is a way to upload large files to an S3 compatible Object Storage like GDX Cloud by splitting them into smaller parts and uploading each part in parallel.
Uploading multiple pieces simultaneously improves upload speed and provides better reliability and resumability in case of network errors or interruptions. In fact, if the upload of a single part fails, the other parts remain unaffected and the process can resume at any time. The parts are then combined into a single object on the server.
This has several advantages:
- Upload speed: uploading more parts in parallel makes the uploading process faster.
- Failure recovery: if the connection is dropped while uploading, the object is still safe. Parts that have previously been uploaded are retained and the upload can resume with the missing parts.
- Pause the upload: The upload of the object can be stopped and resumed as needed, without requiring the entire upload to be restarted.
This feature is ideal for uploading large objects, maximizing network throughput, or for uploading files in an unstable network where failures are common.
Multipart Upload step by step
The multipart upload process involves the following steps:
- Initiate the upload: Start the upload process by sending a CreateMultipartUpload API. This request returns an upload ID, which is used to identify the upload in subsequent API requests.
- Upload parts: Upload parts of the file in parallel by sending UploadPart API requests with the upload ID and a part number. The part number should start at 1 and increment for each part.
- Complete the upload: Once all parts are uploaded, send a CompleteMultipartUpload API request with the upload ID and a list of part numbers and their corresponding ETags (a hash of the data).
- Verify the upload: Confirm the successful completion of the upload by downloading the entire file using GetObject and comparing it to the original file.
Usage
Let’s see how to work with this feature using the GDX Cloud Console or AWS s3api CLI commands.
Using the GDX Cloud Console
- Sign in to the GDX Cloud Console
- Navigate to your bucket
- Click the “Upload” button
- Select large files to upload
- The multipart upload will be handled automatically for files over a certain size
The console provides a user-friendly interface for managing multipart uploads with progress tracking and automatic retry handling.
Using AWS S3 API
You can use the AWS s3api CLI commands to manually control multipart uploads:
Create a multipart upload
Let’s start initializing a multipart upload:
aws s3api create-multipart-upload --endpoint https://s3.gdx.datnass.com --bucket my-gdx-cloud-bucket --key dork.png
This will print a UploadId in the output, let’s take note of that.
Show the ongoing multipart uploads
If at any point we need to check which multipart uploads are still ongoing:
aws s3api list-multipart-uploads --endpoint https://s3.gdx.datnass.com --bucket my-gdx-cloud-bucket --key dork.png
Upload a few parts
Then we can upload a few parts, let’s make two:
aws s3api upload-part --endpoint https://s3.gdx.datnass.com --bucket my-gdx-cloud-bucket --key dork.log --upload-id <UploadId> --part-number 1 --body ~/dork-part1.logaws s3api upload-part --endpoint https://s3.gdx.datnass.com --bucket my-gdx-cloud-bucket --key dork.log --upload-id <UploadId> --part-number 2 --body ~/dork-part2.log
These will print the ETag values in the output, let’s take note of them.
Upload a part by copying from another object
When uploading a part, we can also omit the body and specify an existing object as a source to copy from:
aws s3api upload-part-copy --endpoint https://s3.gdx.datnass.com --bucket my-gdx-cloud-bucket --key dork.log --upload-id <UploadId> --part-number 1 --copy-source "my-gdx-cloud-bucket/my-source-object"
Alternatively, we can choose to only copy a portion of the source object. For example, we could copy the only first 1024 bytes:
aws s3api upload-part-copy --endpoint https://s3.gdx.datnass.com --bucket my-gdx-cloud-bucket --key dork.log --upload-id <UploadId> --part-number 1 --copy-source "my-gdx-cloud-bucket/my-source-object" --copy-source-range bytes=0-1023
Just like ordinary part uploading, you will need to take note of the value of ETag printed.
Show the uploaded parts
If at any point we need to check which parts have already been uploaded:
aws s3api list-parts --endpoint https://s3.gdx.datnass.com --bucket my-gdx-cloud-bucket --key dork.png --upload-id <UploadId>
Complete a multipart upload
Finally, once all the parts have been uploaded we can create the final object out of them, by completing the upload:
aws s3api complete-multipart-upload --endpoint https://s3.gdx.datnass.com --bucket my-gdx-cloud-bucket --key dork.png --upload-id <UploadId> --multipart-upload "Parts=[{ETag=<ETag first part>,PartNumber=1},{ETag=<ETag second part>,PartNumber=2}]"
Limits
The multipart upload has some size limitations as summarized in the following table:
| Item | Limit |
|---|---|
| Maximum object size | 5 TiB |
| Maximum number of parts per upload | 10,000 |
| Part numbers | 1 to 10,000 (inclusive) |
| Minimum part size | 5 MiB |
| Maximum part size | 5 GiB |
| Maximum number of parts returned for a list parts request | 1000 |
| Maximum number of multipart uploads returned in a list multipart uploads request | 1000 |