When it comes to working with data, we rarely know beforehand the right way to derive insights and extract value. Often, it is impossible to know exactly what can be done with data until we start exploring it. This makes it difficult to set expectations and goals for a data-intensive project. These challenges are made even more complex when data scientists don’t control or have access to data sources, or when the shape and content of the data changes over time.
As a result, code and architectural decisions can become chaotic, ad-hoc, and difficult-to-maintain. Data scientists tend to be set to the side without a real plan to integrate their efforts with the rest of the business or other software developers. The science part of data science seems to get in the way of using trusted best practices from the software industry.
But it doesn’t have to be this way.
In this talk, we’ll cover the differences between traditional software development and development for data intensive products. I’ll give an overview of data collection, pipeline engineering, data exploration, and productionization of machine learning algorithms. While doing so, I’ll discuss what makes data products different, and talk about tools and rituals that organizations and teams can adopt to address the specific difficulties in data oriented projects.
Properly applied, these practices improve alignment between data scientists, developers, and stakeholders, and improve the speed of development, quality, and maintainability of the product.