Data-as-a- service must become the new standard for datasets

Using open source software, cloud services and a data-as-a-service strategy, companies can get more value from their data, faster.

This year, Amazon S3 turned 13 years old. It has become the standard way in which data first lands in the cloud due to its limitless scalability, simple administration and low cost. Today, Amazon Web Services (AWS) offers more than 100 services, with offerings for virtually every layer of the technology stack. Over the years, AWS has steadily expanded its services from «simple storage» to «servers by the hour» and «serverless» services, while software engineers have enjoyed ever-expanding services that improve their work-life and make them more productive.

We now call this idea «as-a-service». What we mean is that the complex and burdensome aspects of infrastructure are hidden and managed on the user’s behalf, allowing them to focus on more meaningful work. In addition, there is no wait time: With the click of a button, users can have thousands of instances provisioned on their behalf. Just as Google forever changed our expectations for simple, fast access to information, AWS has forever changed how we expect to consume infrastructure and other core technologies.

If infrastructure and technology have improved the user experiences of Amazon’s online shoppers, access to data has gone in the opposite direction. It has steadily become more difficult to access, more challenging to use and more prone to security threats. As companies have moved their data into specialized technology offerings, like data lakes, NoSQL and various cloud services, the challenges of accessing and managing data have become hopelessly complex. As a result, data consumers are unable to gain access to data themselves and instead go to IT for their needs, where they take their place in line, waiting their turn.

Just as software engineers once waited weeks and months to have their servers racked and stacked by IT before they could begin to deploy their applications, data consumers wait weeks and months to have data provisioned for their needs. This is a massive opportunity cost for companies today. There are more than 200 million data consumers globally – even assuming modest costs per individual, the associated productivity losses quickly accumulate to hundreds of billions of dollars each year.

The idea of «as-a-service» which Amazon championed for infrastructure to the benefit of software engineers now needs to be applied to enterprise data to the benefit of data consumers. This includes data scientists, analysts, business intelligence users and others that depend on access to data to do their jobs daily.

Just as software developers can provision infrastructure and services for a new application, on-demand and with virtually zero lead time, data consumers should be able to provision data for training a machine-learning model, working with their favorite tools, without relying on IT to do this work on their behalf. New dashboards should be able to be created in a few minutes rather than weeks and months.

Dataset bottlenecks

Data is far more massive, complex and variable than infrastructure and software services. While a Fortune 500 company may deal in thousands of instances on their favorite cloud platform, an individual analytics job can easily involve dozens of data sources and billions of data points, as well as transformations and enrichment in advance of the actual analysis.

Another bottleneck in the scarcity of data engineers in companies today. For each data engineer there are typically more than 100 data consumers. As a result, every data consumer ends up standing in line, waiting for their turn with IT and data engineers are always putting out the next fire rather than working on larger, more strategic initiatives.

Through a combination of open source technologies and best practices, companies can develop a data-as-a-service strategy. Through this approach, data engineers become more productive in their support of data consumers, which ensures governance, security and availability of the service. In addition, data consumers can spend the majority of their time doing what they do best: Making sense of the data to help the business operate more effectively.

What are the building blocks of data-as-a-service?

First, companies need to move away from making endless copies of data they move around between different technologies and environments. Examples include things like extracts, cubes, data marts and aggregation tables, which are created to give different users faster access to a subset of enterprise data.

Instead, companies should develop a strategy where datasets are provisioned on-demand using advanced capabilities that provide high-performance access to data from any source and simultaneously apply transformation, ensure access controls and mask sensitive data dynamically. While this idea has been around for many years, it has been plagued with complexity, slow performance and no ability to provide self-service for the data consumer. Today, advances in hardware and new open-source projects like Apache Arrow simplify and accelerate access to data, making this approach feasible in a way it has never been before.

Companies also need to think in terms of a central, vetted enterprise catalog of their data assets. Ask your analysts where they would find data to answer a question about our customers in Europe over the past 180 days and the answer is frequently: «We would ask IT.» But things are very different in their personal lives – if they were searching for hotels near the stadium of their favorite sports team, they would simply ask Google and find the answer instantly. It should be just as easy to find data at work as it is at home.

Data consumers frequently require customized datasets that have not yet been created, such as datasets focused on a period of time, geography or business unit. Traditionally, data consumers would wait for IT to create a data mart on their behalf. With data-as-a-service, the data consumer can do this work themselves.

Data-as-a-service is a strategy that companies can implement in the cloud, on-premises or in a hybrid model. Companies manage their data in many different silos, including relational databases, data warehouses, data marts, NoSQL databases and object stores like Amazon S3. By following a data-as-a-service strategy, companies can make all their data assets available to data consumers. Using open source software, cloud services and a data-as-a-service strategy, companies can get more value from their data, faster.

Fuente: Kelly Stirman