Accelerate Data Science: Why a Graph should be at the heart of your Data Platform
A majority of Data Science takes data about previous customer interactions and uses it to interrogate the current state, predict future activity and confirm hypotheses on changes to implement. Ensuring the availability and accuracy of the data required to make such predictions is critical, but Data Scientists often have to perform data curation and cleaning themselves, and reportedly spent 80% of their time doing so. Cloud technologies can be leveraged by data engineers to curate and clean data, and provide a venue for execution for data science workloads.
Instead, we can solve the problems that prohibit implementing data-driven processes – inefficient data science output and high team turnover – by providing an environment which reduces the friction of data curation by instead performing that curation using information architecture and data engineering to model the data into ontologies, and ingest it into a graph.
The graph acts as an overview of data across the business area silos and SaaS services, into a single up-to-date graph, to be used by data scientists instead of the individual data sources. Using this model, data sources do not require changes in order to integrate them, instead, this is performed at graph ingest, ideally through a reliable update process such as change data capture (CDC). As such, the graph is a reliable abstraction of the underlying data sources, providing a single point of query for data scientists across the business.
We have reached an inflexion point in cloud computing, where the question of “will this system scale” is not typically in question – commodity cloud offerings like AWS Aurora and GCP BigTable enable petabytes of relational data to be stored and queried, with offloading to elastic query systems such as AWS EMR and GCP BigQuery for batched workloads.
With this level of scaling and query power now a commodity, at the hands of everyone, at a pay-as-you-go metered price point that scales with the size of the platform, the more difficult question is how to effectively mobilise teams to utilise the vast amounts of data around the business. Architecturally, the solution is to stream data from across business silos and from SaaS providers, into a data platform that runs on a public cloud, where workloads can elastically spin up and down resources as required. This obviates the need for a large capital outlay on IT equipment, while still providing resources to support peak workloads.
Thus, instead of working hard to scale-up data systems, we face a greater magnitude of issues scaling teams – specifically, when every solution computationally scales to your needs, we need to look into the ease at which the system can scale with your team.
This is where choice of components inside of your data platform make a significant difference, and why having a graph at its heart gives you an edge.
We can demonstrate the benefits by looking at the typical workflow of a data scientist:
- Analyse data sources and formulate hypotheses
- Gather, clean and curate data
- Query data to test hypotheses
- Iterate on queries to refine findings
- Present most significant findings to stakeholders
- Operationalise findings into predictive models using machine learning
- Productionise models into business area systems
When your data scientists are spending only 20% of their time doing the actual science, there are two ways to achieve more output – hire more data scientists, or provide them with an environment where they can spend less time curating data.
An immediate solution of hiring more data scientists is becoming more difficult, but it is not impossible. However, the bigger problem is the issue of retaining data scientists – when the environment to do data science requires what is essentially data engineering, rather than refining hypotheses, staff turnover is much higher, and growing well performing teams is difficult.