Skip to main content
CoC

Test your data like you test your code

Duration:
45 minutes

Abstract

I will introduce the concept of data unit tests and why they are important in the workflow of data scientists when building data products. In this talk, you will learn a new tool you can use to ensure the quality of the products you build.

TalkPyData: Data Engineering

Description

When data scientists build data products, they usually need to combine multiple data sources to train their models and then serve predictions. Making sure that the code and the data will be as expected throughout the full lifetime of the project is complex. To ensure the quality of the code, it is a best practice in software engineering to use automatic testing, this has a large corpus of support material. However, ensuring the quality of the data input and output holistically is not yet as well covered.

In this talk, I will explain the concept of data unit tests and why they are important. Then I will present an overview of the current libraries helping to build data unit tests. Finally, I will explain how we integrated it into our workflow at GetYourGuide.


The speaker

Theodore Meynard

Theodore Meynard

Theodore Meynard is a data scientist at GetYourGuide. He works on our ranking algorithm to help customers to find the best activities to book and locations to explore. He is one of the co-organisers of the Pydata Berlin meetup. When he is not programming, he loves riding his bike looking for the best bakery-patisserie in town.