From telemetry data to CSVs with Python, Spark and Azure Databricks

Transform GBs of telemetry data into CSV format leveraging Pandas, Pyspark and Azure Databricks

Nicolò Giso

Data Internet of Things (IoT) Public Cloud Use Case python

See in schedule: Wed, Jul 28, 15:15-15:45 CEST (30 min) Download/View Slides

Tenova is an engineering company working alongside client-partners to design and develop innovative technologies and services that improve their business, creating solutions that help metals and mining companies to reduce costs, save energy, limit environmental impact and improve working conditions for their employees.

In the context of Industry 4.0, Tenova provides each equipment with a field gateway, named Tenova Edge, to collect telemetry data, perform edge analytics with AI models and send data to the Tenova Platform (hosted on Microsoft Azure) for further elaborations.

To develop analytics solutions, data scientists and process engineers need the data in a manageable format.

Furthermore, continuous retraining of AI models is necessary to guarantee high performances and reliable results.

For all of these reasons, we needed to implement an ETL solution to transform the raw data in formats ready for analysis and retraining. In particular, the key requirement was to convert the JSON Lines files coming from the field in CSV files ready to be used.

The CSV files have to satisfy the following conditions:
- each file contains the data for a device
- only one file for device per day
- each file has a midnight row containing for each cell the value recorded at midnight or the last value of the previous day (SPOILER: here it’s where the fun happens!)

For this purpose, we have implemented a series of Databricks Notebooks, run daily by Azure DataFactory, that leveraging Pyspark and Pandas manipulates the raw JsonLines files in nicely formatted CSVs.

Type: Talk (30 mins); Python level: Intermediate; Domain level: Beginner


Nicolò Giso

Tenova

Passionate in Data Science and Machine Learning, involved in data science projects in Tenova Digital Team from ETL through modeling to deployment.

Some of the projects in which I took part are:
- Implementation of custom model for image recognition
- Modelization of furnace processes to optimize performances
- Building ML services infrastructure leveraging Microsoft Azure Cloud services such as Azure Machine Learning Workspace, Azure Databricks and Azure DevOps