View more stories by categories: DataBits

Mary Marek-Spartz, MSP LTER – University of Minnesota

The Minneapolis-St. Paul urban LTER site spans the seven-county metropolitan area of one of the most spatially distributed population centers in the country (in terms of humans per square mile). This means that nearly all research projects operating within our site use and produce spatial data. At 2975 square miles, the Twin Cities Metro Area is covered by some large data layers from all kinds of state and University organizations, some at the 1-meter resolution. This makes the GIS workloads of MSP LTER researchers cumbersome for most personal computers, but not so cumbersome that it justifies vying for space at our Supercomputing Institute, and all of the prerequisites that come along with using that resource. Most GIS users at our site are using R spatial packages with the RStudio IDE as an alternative to proprietary, platform-specific GIS software (Yay!). Depending on the workstation setup of the researcher, there may not be enough memory allocated to their RStudio configuration for large-ish GIS tasks, and you get the dreaded “vector memory reached” error.

It would be easy enough to spin up a virtual machine with our University IT department and install R, Python, and geospatial libraries on it, but many researchers have become very comfortable with the RStudio IDE for running R (and even python) and visualizing data. And who can blame them? RStudio is already extremely popular in the ecology community, and is quickly branching out into platforms beyond R, even planning to change their name to “Posit” next year. 

However, installing RStudio Server (or Posit Server?) on a University-run server proved challenging. Well, installing it was easy, RStudio server has an open-source distribution, but running it with all the admin permission restrictions in place at our institution was going to be difficult to say the least. Not to mention, our IT department, like many IT departments, was/is fielding countless requests with many faculty, staff, and students working remotely. 

We needed a way to call up a high-powered computer in the cloud, configure all the necessary GIS packages and RStudio on the machine, and then give the researcher who needed it a link to the server where they would be greeted with the comfortable RStudio IDE in the browser for copying their scripts right in. We needed a solution that was fast, flexible and affordable (albeit not free like using University machines). 

I turned to Amazon Lightsail, an AWS product that provides pay-per-use Virtual Private Server instances of various sizes for a monthly price. Lightsail users can select the type of machine they want, the region, the operating system, and provide a start-up script for initiation. Using a tutorial I found on YouTube, I was able to launch an instance and install RStudio Server on it in minutes. I then had access to the RStudio interface in the browser at the server’s IP address at port 8787, which is the port RStudio Server uses. 

At this point, the instance was only ready for tabular data analysis in RStudio, it was not outfitted with the proper GIS libraries: GDAL (Geodatabase Abstraction Library), PROJ (for projections), and GEOS. My instance was running Amazon Linux 2, which had a version of GDAL available as an extension package, but this version was too old for most of the R geospatial packages we use. I needed to configure a custom install of GDAL from osgeo.org on the instance. I ended up with the startup script below:

#!/bin/bash
# from https://www.youtube.com/watch?v=zJuFpqB01u4&t=4s
# install R
sudo amazon-linux-extras install R4
sudo yum install -y R # may be part of the above command
# install RStudio-Server
wget https://download2.rstudio.org/server/centos7/x86_64/rstudio-server-rhel-2022.02.3-492-x86_64.rpm
sudo yum install -y --nogpgcheck rstudio-server-rhel-2022.02.3-492-x86_64.rpm
sudo yum install -y curl-devel
# not a real user or password
sudo useradd me0001
sudo echo me0001:password | chpasswd

## GDAL:
# from https://gist.github.com/mojodna/2f596ca2fca48f08438e
# and https://github.com/aws/elastic-beanstalk-roadmap/issues/199
# and https://gist.github.com/HerveNivon/fe3a327bc28b142e51beb38ef11844c0
# need to install gnu compiler and proj/geos dependencies for gdal from extras

sudo yum -y update
sudo yum-config-manager --enable epel
sudo amazon-linux-extras install epel -y
sudo yum -y install make automake gcc gcc-c++ libcurl-devel proj-devel geos-devel proj-nad proj-epsg

GDAL_VERSION=2.4.4
cd /tmp
curl -L "http://download.osgeo.org/gdal/${GDAL_VERSION}/gdal-${GDAL_VERSION}.tar.gz" | tar zxf -
cd "gdal-${GDAL_VERSION}/"
./configure --prefix=/usr/local --without-python
make -j4
sudo make install
cp /usr/local/lib/libgdal.so.20* /usr/lib64/

I can start up as much as a 32 GB instance configured with the above script and hand access over to a researcher for a few days so they can complete their GIS task. Such a machine would cost a few bucks a day, and I can hand it back over to AWS when we are done using it to avoid racking up a large bill for the month. To safeguard against spending, I acquired an AWS login through the University account that is connected to our project’s budget string and provides alerts when we exceed a specified amount of spending during the month. 

In the coming months, I plan to write into the startup script a command that terminates the instance after two days automatically and to make use of AWS containers to mount some of our most used GIS data to the servers for easy import into RStudio. I am also planning to get a certificate and domain name, possibly from the University so we can have a umn.edu address. My hope is this solution will work at the scale of the MSP LTER for a while and provide an option with few barriers to entry as the users of the instance do not need to know any linux or AWS to use the RStudio Server environment.