Save lives using Cloud Analytics and ML

C012_9780128031926.pdf

CHAPTER 12

Using Cloud-Based Analytics. t o Save Lives

P. Dhingra

Microsoft Corporation, Seattle, WA, USA

K. Tolle

Microsoft Research, Seattle, WA, USA

D. Gannon

School of Informatics and Computing, Indiana University, Bloomington, IN, USA


CHAPTER 12

INTRODUCTION

A common problem with today’s early warning systems for flood and tsu namis is that they are highly inaccurate or too coarse grained to be useful to first responders and the public. When information is overstated, people tend to become complacent and fail to follow directives provided in future warn ings when there is real danger. If the information is too coarse, people are unsure of how they can protect themselves in a disaster situation and first

responders are unsure of how best to deploy their resources. Ed Clark from the National Oceanic and Atmospheric Administration’s (NOAA) National Water Center in Tuscaloosa, Alabama, states the potential future flood damage this way, “In the next 30 years or so, 2538 people in the United States will lose their life[sic] to flooding and there will be approxi mately $300billion in losses in infrastructure and property damage.” (Key note Presentation, “Water Resources and the Cyber-Infrastructure Revolution,” 3rd Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) Conference on Hydroinformatics, Tuscaloosa, AL, July 15th, 2015.) Worldwide, the Center for Disease Control reports that flood-related deaths are higher than any other type of natural disaster representing 40% of the total deaths (Centers for Disease Control and Prevention, 2014). With climate change, the severe weather may be more likely (http://news.nationalgeographic.com/news/2013/13/130215- severe-storm-climate-change-weather-science/), and there is real a possi bility that death toll numbers could be even higher.

Cloud Computing in Ocean and Atmospheric Sciences

ISBN 978-0-12-803192-6

http://dx.doi.org/10.1016/B978-0-12-803192-6.00012-8

Copyright © 2016 Elsevier Inc. All rights reserved.

221

Cloud Computing in Ocean and Atmospheric Sciences, First Edition, 2016, 221-244

Author's personal copy

222 Cloud Computing in Ocean and Atmospheric Sciences

The time to save people from a flood caused by a tsunami, superstorm, or extreme precipitation event is before it happens. Timely, precise, and gran ular predictions about natural disasters enabled by cloud computing and cyberinfrastructure are uniquely positioned to reduce the potential devas tating impacts. Incorporating terrestrial information such as water levels, water table concentrations (ground water), and even the amount of perme able surfaces could help predict floods, mudslides, or sinkholes but would not necessarily improve the precision of the weather forecast itself. We can make prediction more precise by using high-powered computation and running different ensemble models in parallel. Extreme weather impact estimates can now be improved by using machine learning. And with advanced analytics and mobile personalization, individualized weather warnings can be sent via cell phones.

All of these are possible if you have access to the appropriate instrument data, high storage capacity, large processing power, easy-to-use statistical machine-learning software, visualization and simulation tools, and the power of the Internet, WiFi, and cellular networks. This chapter discusses an end-to end cyberinfrastructure that can make improved early warning sys

tems and near real-time disaster prediction possible.

BACKGROUND

Previously, an effort of this scale could only be undertaken with govern ment institutional infrastructure and financial support. And certainly in this case, government infrastructure and government research laborato ries working in conjunction with government-funded academic researchers have laid the groundwork for what is now possible (Plale et al., 2006).

Open data initiatives supporting the US Government (https://www. data.gov) are providing oceans of data to the public, researchers, and indus try. And it is data, along with on-demand, large-scale, networked computing resources, that will provide the means to create a new generation of applica tions to help tackle some of the toughest problems we face today—in par ticular, those that threaten our lives and our livelihoods.

But instead of a system hosted on government infrastructure, the system we propose can be set up in the cloud by anyone. With cloud computing, a startup company or a small research team can rent all the resources needed to collect data from millions of sensors, store petabytes of data, process data through pipelines, apply statistical modeling and machine learning to do

Cloud Computing in Ocean and Atmospheric Sciences, First Edition, 2016, 221-244

Author's personal copy

Using Cloud-Based Analytics to Save Lives 223

forecasting and create relevant customizable notification services. It may be the case that direct access to the sensors may not be possible because they are owned and operated by academic or government entities. In this case, those entities could make the streams available to individual research teams through a high-throughput data hub.





CLOUD COMPUTING: ENABLING PUBLIC, PRIVATE, AND ACADEMIC PARTNERSHIPS

Governments provide the funding and resources for the vast majority of oceanographic, terrestrial, and atmospheric data. The analysis of this data takes place largely in government research laboratories or academia. More over, typically, the processing, storage, and computation take place on prem ises. However, to support the next generation of data gathering and analysis each institution must secure the funding to increase the local computing power and the additional physical infrastructure required to house, power, and cool these systems.

The increasing amount of openly shared data which fosters collaboration and discovery between academia, research institutions, and government agen cies is a vicious circle. The vicious circle is that during these collaborations each institution conducting research requires more and more compute capacity to conduct, often redundant, experimentation on publicly available data sources.


As data increases in size and complexity, marshalling the needed com pute resources becomes a barrier to research if it is undertaken by each institution independently. Cloud computing makes it possible for a large collaboration to share resources that can be scaled up to meet the demand and scaled back when not needed. By accessing a single data store, multiple experiments can be conducted by geographically dispersed teams with no replication of compute and data resources. Data can be stored once and accessed from separate virtual machines (VMs) anywhere in the cloud.


Simply put; cloud computing is a much more efficient and cost-effective solution over that of expanding the capacity of aging data centers. So if this is the case, why have institutions not generally adopted cloud infrastructure to increase research capacity? There are several reasons:

• Inexperience with cloud computing

• Fault/disaster tolerance concerns

• Ensuring capacity

• Inability to use existing tools/applications/languages

• Security and privacy issues

Cloud Computing in Ocean and Atmospheric Sciences, First Edition, 2016, 221-244

Author's personal copy

224 Cloud Computing in Ocean and Atmospheric Sciences




To adequately address these issues it is necessary for us to review some basics about cloud computing and also describe some more recent advances.

A Cloud Overview

Commercial cloud solutions are available from a number of providers with the most well-known being Amazon Web Services, Microsoft Azure, Google Cloud, and IBM Cloud Services. The examples we use in this paper are based on Microsoft’s Azure. Although many of the features we describe are generic, we will also make use of some Azure-specific capabilities.

Cloud offerings typically fall into three categories: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS) as shown in Fig. 12.1 below.

Infrastructure as Service (IaaS)

Public cloud systems provide users with the ability to “rent” virtual machines (VMs) running a variety of operating systems. For example, from the Azure Gallery, one can instantly create Windows, Linux, Structured Query Lan guage (SQL) Server, Oracle, SharePoint, IBM, SAP, and BizTalk VMs. In addition to the VMs that come standard with most cloud providers, users can also leverage the open-source community. On Azure, the VM Depot (https://vmdepot.msopentech.com) is a community-driven catalog of open-source VM images which contain a variety of Linux and FreeBSD VM images.

Platform as a Service (PaaS)

In PaaS, the cloud is not only provides the “iron” or computational hard ware, it also provides the operating system and some high-level services. This solution still requires the user to provide the software, but removes the hassle of managing operating systems (O/S), such as applying O/S patches. This leaves applications developers free to focus on building solutions rather than the underlying software infrastructure.

Software as Service (SaaS)

SaaS goes one step beyond PaaS. In this instance, all of the software is built by a cloud provider. Users benefit from their favorite applications running in the cloud without having to install or update them. An example of such a system is Office 365 in which users log into and use Office applications like Excel. Though the experience is like using Excel on their own com

puter, the reality is that the application interface, the storage, and processing are all taking place in the cloud.


Deployment and Security Model

Commercial clouds have the ability to be very secure platforms. How secure is configurable by the user. There are three options in Azure: Public, Private, and Hybrid cloud. Each is described below.

Public Cloud

The public cloud is the least secure because all of the hardware, platform, and software services are visible via Internet by the cloud-provider data center. Public cloud is the most cost-effective and perfect solution for data that do not require extensive security.

Private Cloud

When an organization selects the private cloud option, they are provided with a dedicated system for computation, storage, software, and networking. Private cloud is suitable for any entity that wants to share data internally but protect it from external access. With the private cloud, an organization can ensure that organizational data and the hardware it sits on cannot be accessed outside the network. Private clouds do have a premium cost associated with these services.

Cloud Computing in Ocean and Atmospheric Sciences, First Edition, 2016, 221-244

Author's personal copy

226 Cloud Computing in Ocean and Atmospheric Sciences

Hybrid Cloud

Selecting the hybrid cloud can provide the best of both worlds. Workloads that need less security can be processed in the public cloud, but data that are more sensitive and processing can be run on premises or in the private cloud as needed.

Most large organizations have a mix of requirements and will therefore benefit from features of hybrid cloud. For example, a company trying to determine the best reservoir height to maximize profit while protecting communities from flood risk can leverage the public flood models in NFIE in the public cloud while maintaining their intellectual property workload in their private cloud.

Another example is support for government agencies. Much of the data they own and collect is made available to researchers to facilitate projects like the project described here. However, there are also related data that absolutely must remain secure to protect national security. This is a perfect application for a hybrid cloud solution. In the case of Azure, this is supported by an ability to easily provide hybrid connectivity and support services between public and private cloud deployments via ExpressRoute (http://azure.microsoft.com/en-us/documentation/services/

expressroute/) and other services described as follows.

ExpressRoute

Azure ExpressRoute creates private connections between Azure data cen ters and on-premises infrastructure. These connections do not go over the public Internet and offer more reliability, faster speeds, lower latencies, and higher security than typical Internet connections. This service can also lower cloud-computing costs if users frequently transfer large data between public cloud and on-premises servers.

Service Bus

Azure Service Bus (http://azure.microsoft.com/en-us/documentation/ser vices/service-bus/) provides the ability of message passing between public and private cloud applications. This allows the owner of private data services to filter out and provide as public only those streams that they desire to make public.

Azure Backup

Azure Backup Service (http://azure.microsoft.com/en-us/documentation/ services/backup/) (ABS) provides a single solution for backup for private, public, and hybrid cloud solutions. In hybrid systems, it is possible to con figure ABS to back up on-premises VMs and well as public cloud VMs. Many, though not all, cloud-computing systems offer certified privacy and security compliance. If an extremely high level of security is needed, researchers are encouraged to ensure that the cloud provider they use has a suitable level of privacy compliance, the researchers select privacy and secu rity options when configuring their cloud solution accordingly, and that the cloud provider has been accredited with meeting compliance audits by agencies relevant to their data.

For instance, Microsoft Azure has successfully been audited for Content Delivery and Security Association (CDSA), Criminal Justice Information Services (CJIS), Health Insurance Portability and Accountability Act (HIPAA), as well as many other security and privacy certifications (http://azure. microsoft.com/en-us/support/trust-center/compliance/).

Azure Site Recovery

Azure Site Recovery (http://azure.microsoft.com/en-us/documentation/ services/site-recovery/) provides business continuity and disaster recovery (BCDR) strategy by orchestrating replication, failover, and recovery of vir tual machines and physical servers. It enables one to quickly copy VMs in an Azure Data Center or on-premises data center.

Learning About the Cloud

Using cloud computing is often transparent to users, and management of these systems is similar and, in some cases, much easier than managing a local resource. There are training courses offered by cloud providers and third-party trainers that can bring knowledgeable data center personnel quickly up to speed as well as training that is more targeted for researchers themselves. For example, Microsoft Research offers an Azure for Research online course which is specifically targeted at onboarding researchers and students on how to use the cloud for scientific endeavors.

Fault Tolerance

Cloud providers automatically replicate data so access can be much more reli able than most on-premises data centers. For instance, Azure not only mirrors data and services in one location, it also replicates them in a different geo graphic location to ensure that if one cloud-computing center is compro mised, the data and services associated with it can be preserved—even if the primary data center housing that system is destroyed by a disaster.

Compute Capacity

As previously mentioned, compute capacity in the cloud is scalable and “on demand”. There are several aspects to this capability. First, most cloud pro viders have data centers that are based on virtualized servers. Applications are configured as services that run in virtual machines (VMs) on the servers. Typically, an application service receives input commands as web requests, produces output in the form of data stored in the cloud, and replies to the requester as web responses. The application is code typically deployed as a

VM image that the cloud O/S can load onto an appropriate server. There are several different ways to create the VM image. The “tradi tional” approach is to configure a complete copy of an O/S (Linux or Win dows) and install the application there. A more “modern” approach is based on the concept of a container which is a package containing the application service and its needed resources but not the entire O/S. The container is then deployed on a running VM that provides the core O/S resources. Many containers can be hosted on a single VM. Deploying a containerized application can be 10 to 100 times faster than deploying a full VM. Further more, a single containerized application can be run without change on data centers from many different cloud providers. In both the VM and the con tainer approach to packaging, it is standard practice to store the image in a repository so that it can be easily accessed, downloaded, and deployed on a server in any of the cloud-provider data centers.

Depending on the requirements of the application, the user can select the appropriate server type. For example, if an application requires one cen tral processing unit (CPU) core and 4 gigabytes (GB) of memory, a corre sponding server can be configured. If the application requires 32 cores and 128GB of memory, that too can be made available as a VM host. A second dimension of scalability is the number and location of the servers hosting copies of the application VM or container. If many users of a particular application request access at the same time, it may be necessary to deploy multiple instances of the application across the data center or different geo graphic locations. Cloud providers have standard tools to “auto-scale” an application if the load is heavy. This same mechanism can be used to reduce the number of running instances if the load drops.

Finally, many applications profit from being factored into smaller com ponents which can each be containerized and deployed separately. For example, the application may have a Web-server component that deals with user interaction. The Web server may hand off “work” to separate computational components or it may invoke the services of other applica tions such as machine learning systems, databases, mapping services, etc. The application then becomes a web of communicating subsystems. Although this sounds complex, most large-scale cloud services are designed in this manner because it more efficiently scales the application and reduces maintenance.

Also, “data fabrics” enable the sharing of data across VMs in such a way that one need not replicate data—saving on storage costs as well as compute resources. And as mentioned before, these data sources are protected from loss. They can also be replicated intentionally, so that different datasets can be located closest to users who require the most immediate access.

Azure Storage

Azure provides a range of scalable storage for big data applications: blobs, tables, queues, and file systems. In addition, Azure storage stack includes support for relational databases, non relational databases, and caching solu tions. We will discuss these in more detail in the following.

Azure Blob for Storing Unstructured Data

Blob storage is used to store unstructured data that can be accessed via Hypertext Transfer Protocol/HTTP Secure (HTTP/HTTPS) protocols. Data can have controlled access and be made available either publicly or privately. Blobs are stored in “containers,” and a single Azure’s blob storage container can store up to 500 terabytes of data. There is no limit to the num ber of containers you can create.

Azure Table Storage for Structured, Nonrelational Data Azure table storage is a NoSQL (http://nosql-database.org/) data store in Azure for storing large amount of structured, nonrelational data. It can store terabytes of data as well as quickly serve the data to a Web application. This type of storage is suitable for those datasets that are already denormalized, do not need complicated joins, foreign key relation, or stored procedures.

Tools

Because cloud systems allow users to set up VMs, the tools and applications they use can be set to run on the cloud. In fact, with the container approach to application deployment it is easy to build and test a container running on a laptop. Virtually any tool or software library that the application needs can be installed in the container. The most popular container repository is on Docker Hub (https://www.docker.com/), and there are many prebuilt containers there with many of the standard programming tools (Python, C, Fortran, MySQL, etc.).

From such a hub, one can download a container and add an applica tion with very little effort. It is also important to realize that the cloud providers have additional services pre-installed in the cloud that are easy to integrate into applications. For example, Microsoft Azure provides Azure Machine Learning, a complete machine learning tool kit, special services for managing event streams as well as services for building mobile applications.