18 Nov 2021 - Datathon
Last year when I was preparing Huawei Cloud for our annual Singapore Healthcare AI Datathon event, I did not think of writing a summary, which could have greatly helped the setup this year.
This year, I am setting up AWS as one of the cloud environments, so I decide to write down the process.
We provide one EC2 for every team to carry out their computing tasks.
First of all, AWS has a default vCPU limit for every account, which in our case is not enough. A limit increase needs to be requested. Here is the user guide for this.
To pick the right instance type, there are a few considerations.
In this year’s datathon, we choose AWS’s g4dn.12xlarge
instance that has 4* NVIDIA T4 GPUs, 48 vCPUs and 196GB system memory. g4dn.12xlarge
also comes with 900 GB NVMe SSD storage that can be mounted. Take note that this NVMe SSD storage will be deleted when the instance is stopped, so don’t keep any data (except for temporary ones) inside unless the instance will be kept running until the end of datathon. For teams who are not using GPU, we choose m5.8xlarge
instances.
Launching the first instance (example of GPU instance)
g4dn.12xlarge
Setting up the environment
sudo apt update
source activate python3 && pip install jupyterlab
conda install -c conda-forge r-irkernel
mimic-cxr-jpg
dataset directly to the machineAfter S3 is set up
AmazonS3ReadOnlyAccess
and AmazonSSMManagedInstanceCore
and associate to the EC2 in the consoleAfter RDS is set up
Backup the instance
Launching more instances
We build our EHR databases on RDS for every team to query from. This year, we provide MIMIC-IV
with ED and CXR, as well as eICU
.
Preparing a backup of the databases
pg_dump
each database into a single .sql file
eicu
database in our server does not have views, a separate dump from eicu-views.pgsql
needs to be importedCreate an EC2
postgresql & postgresql-client
on the EC2 to enable psql
command (example for PostgreSQL 12)scp
the .sql backup files from your own machine to the EC2Create an RDS
pgadmin4
by SSH tunneling using the EC2Importing the database backup from EC2 (based on official guide from AWS)
psql
from the EC2Roles management
team
and teammaster
roles, grant necessary priviledgesWe put our medical imaging datasets in S3 for teams to download if they need. (reference)
AmazonS3FullAccess
, download the access keyaccess key ID
and secret access key
from the IAM role created with AmazonS3FullAccess
(default region name: ap-southeast-1, default output format: json)aws s3 cp local_file s3://bucket_name