Architecting Data: A Programmer's Guide to Synthetic Data
- Duration:
- 180 minutes
Abstract
Finding good datasets or web assets to build data products or websites with, respectively, can be time-consuming. For instance, data professionals might require data from heavily regulated industries like healthcare and finance. In contrast, software developers might want to skip the tedious task of collecting images, text, and videos for a website. Luckily, both scenarios can now benefit from the same solution, Synthetic Data.
Synthetic Data is artificially generated data created with machine learning models, algorithms, and simulations, and this workshop is designed to show you how to enter that synthetic world by teaching you how to create a full-stack tech product with five interrelated projects. These projects include reproducible data pipelines, a dashboard, machine learning models, a web interface, and a documentation site. So, if you want to enhance your data projects or find great assets to build websites with, come and spend 3 fun and knowledge-rich hours in this workshop.
Description
Audience
This tutorial is targeted at intermediate-level programmers looking to get started using synthetic data in their projects. The session will be particularly useful for data professionals, full-stack web developers, and educators searching for new ways to enhance their workflows and improve their projects.
Prerequisites
- 1 year of programming experience with Python
- Being comfortable with loops, functions, lists comprehensions, and if-else statements.
- At least 5 GB of free space in their computers.
Outline
Total time budgeted - 3 hours
- Introduction and Setup (~10 minutes)
- Environment set up. An optional free-to-use environment will be provided in Binder, GitPod, Google Colab, and GitHub Codespaces
- Agenda for the session
- Instructors intro
- Motivation for the workshop
- Section I - Building Blocks (~40 minutes)
- Introduction to Synthetic Data
- What is it and why use it?
- How to generate synthetic data with plain Python code?
- Introduction to the different frameworks available
- Creating a synthetic data generator module
- Exercise (5 min)
- Analytics
- Analysing and comparing real data vs synthetic data
- Creating an analytical proof of concept product with synthetic data
- Exercise (5 min)
- Introduction to Synthetic Data
- 10-minute break
- Section II - Engineering (~60 minutes)
- Data Engineering
- Task - Create synthetic datasets and build ETL pipelines for different use cases
- Synthetic Data Use Case - Generating data with errors to simulate how data professionals receive data in the real world
- Exercise (5 min)
- Software Engineering
- Task - Develop a simple website using different Python frameworks such as FastAPI and jinja templates
- Synthetic Data Use Case - Generatewebsite's assets including images, videos, and text
- Exercise (5-minutes)
- Data Engineering
- 10-minute break
- Section III - Machine Learning (~30 minutes)
- Quick intro to Machine Learning
- Task - Create and evaluate different models and pipelines
- Synthetic Data Use Cases
- Data Augmentation
- Increase in Privacy
- Evaluation of Machine Learning Models
- Exercise (5 min)
- Concluding Thoughts