Setting up AWS cloud environment for Singapore Healthcare AI Datathon 2021

WANG Han
PhD student in Saw Swee Hock School of Public Health
Organizing chair of Robotics Innovation Challenge (RIC)
National University of Singapore (NUS)
Specialized in Healthcare data analytics with AI
Supervised by Prof. Feng Mengling
My Email: ephwha at nus.edu.sg / wh1996fz at live.com

Setting up AWS cloud environment for Singapore Healthcare AI Datathon 2021

18 Nov 2021 - Datathon

Last year when I was preparing Huawei Cloud for our annual Singapore Healthcare AI Datathon event, I did not think of writing a summary, which could have greatly helped the setup this year.

This year, I am setting up AWS as one of the cloud environments, so I decide to write down the process.

Elastic Cloud Compute (EC2)

We provide one EC2 for every team to carry out their computing tasks.

First of all, AWS has a default vCPU limit for every account, which in our case is not enough. A limit increase needs to be requested. Here is the user guide for this.

To pick the right instance type, there are a few considerations.

Storage: Participants will need to store their medical imaging datasets, which can go up to a few hundred GBs.
GPU: Some teams need to train deep learning models, so GPU is necessary.
Environment: Common tools and packages should be pre-installed for the teams to save their set-up time.

In this year’s datathon, we choose AWS’s g4dn.12xlarge instance that has 4* NVIDIA T4 GPUs, 48 vCPUs and 196GB system memory. g4dn.12xlarge also comes with 900 GB NVMe SSD storage that can be mounted. Take note that this NVMe SSD storage will be deleted when the instance is stopped, so don’t keep any data (except for temporary ones) inside unless the instance will be kept running until the end of datathon. For teams who are not using GPU, we choose m5.8xlarge instances.

Launching the first instance (example of GPU instance)

Search for “Deep Learning AMI (Ubuntu 18.04) Version 52.0” AMI, which has all the mainstream deep learning environment set up
Choose the instance type g4dn.12xlarge
Configure the storage needed, increase the root volume to 1000 GB if budget allows (default is 110 GB which is almost full as the environment already takes up around 100 GB)
Configure one security group with port 22 open in the inbound rules (either use the default security group or create a new one)

Setting up the environment

sudo apt update
Install Jupyter Lab:source activate python3 && pip install jupyterlab
Install R and R Kernel for Jupyter: conda install -c conda-forge r-irkernel
Install other packages that are required
Test to make sure Jupyter Lab, Python kernels and R kernel are working properly with port forwarding, and GPU can be properly seen by your deep learning library.
Download the mimic-cxr-jpg dataset directly to the machine

After S3 is set up

Create an IAM role with AmazonS3ReadOnlyAccess and AmazonSSMManagedInstanceCore and associate to the EC2 in the console
Test to make sure the EC2 can see and download file from S3

After RDS is set up

Create Python and R example notebooks to show how to connect to the RDS

Backup the instance

Create an image from the EC2
1. the size of the EBS volume can be increased when you launch new instance from it

Launching more instances

Use Python AWS SDK to create instances and private keys in batch
Associate the IAM role to all the EC2 in the console
Take note of the IP address and the private keys for each EC2 for distribution

Relational Database Service (RDS)

We build our EHR databases on RDS for every team to query from. This year, we provide MIMIC-IV with ED and CXR, as well as eICU.

Preparing a backup of the databases

Login to our own machine that has the EHR databases installed
pg_dump each database into a single .sql file
1. The eicu database in our server does not have views, a separate dump from eicu-views.pgsql needs to be imported

Create an EC2

Make sure the EC2 has enough space (e.g. 512GB) in the root volume
Install postgresql & postgresql-client on the EC2 to enable psql command (example for PostgreSQL 12)
Make sure port 22 are open in the inbound rules of the security group to allow SSH access
scp the .sql backup files from your own machine to the EC2

Create an RDS

Launch an RDS instance (start with a low-end instance to save cost, upgrade to a better instance a few days before datathon starts)
1. Set the maintenance window to low-usage hours (like 4AM)
Create a separate security group for RDS that has only one inbound rule for the security group (5432 port)
Connect to the RDS via local pgadmin4 by SSH tunneling using the EC2
Create the empty databases in the RDS (e.g. mimiciv, eicu)

Importing the database backup from EC2 (based on official guide from AWS)

Import the database with psql from the EC2

Roles management

Set up the respective team and teammaster roles, grant necessary priviledges
Create individual team roles with different passwords, in order for them to create tables in their own schemas
Let team login via pgadmin4 using their own username and password

Simple Storage Service (S3)

We put our medical imaging datasets in S3 for teams to download if they need. (reference)

Create an S3 bucket
Create an IAM role with only permission policy as AmazonS3FullAccess, download the access key
Install AWS CLI on your local machine
Configure AWS CLI credentials using access key ID and secret access key from the IAM role created with AmazonS3FullAccess (default region name: ap-southeast-1, default output format: json)
Copy the data from our local machine to S3 with aws s3 cp local_file s3://bucket_name
Create an VPC endpoint gateway for S3 for EC2 to directly talk to it