Data lake: Design for schema evolution

Data lake design for Schema Evolution

Prakshi Yadav

Architecture Best Practice Big Data Data Public Cloud

See in schedule: Thu, Jul 29, 11:30-12:00 CEST (30 min)

Designing a data lake necessitates well-researched storage, management, scalability, and availability solutions. However, managing schema evolution remains a difficult task. The structure of data differs from one company to the next, making it difficult to generalize a solution to the schema evolution problem.

At Episource, we faced a similar challenge - our data of interest is the output from our NLP engine. Episource's machine learning and natural language processing platform processes millions of pages of medical documents, with up to 15 ML/DL models working together to produce the results. The result of such a challenging pipeline is a complex nested JSON series. With each major update, our NLP engine evolves, causing the inference data structure to evolve as well. As data grew in size and complexity, storing it and making it searchable became a pressing necessity. We needed a solution that kept schema compatibility, versioning, and data integrity intact. We wanted to make sure data reads and writes were unaffected by the Schema mismatch problem.

After several iterations and proofs of concept, we settled on a solution that uses the AVRO format to evolve our data's schema. Avro is a format similar to Parquet but can also accommodate schema evolution. To keep track of changes made to the system, schema versions are saved in a Schema registry. To read the AVRO data stored in S3, our data lake uses Athena, a distributed SQL engine based on Presto. The solution makes use of python libraries to glue various components of this pipeline.

The following are some of the things that a participant can expect to learn during this talk:

1. In a data lake, best practices for storage, control, scalability, and availability
2. Managing schema evolution in a data lake
3. The ability to use both "schema-on-write" and "schema-on-read"

Type: Talk (30 mins); Python level: Beginner; Domain level: Beginner

Prakshi Yadav

Episource LLC

Prakshi's technical background involves designing application architectures to build a bridge between business requirements and technical requirements at Episource especially architecture handling BIg Data processing gracefully. Designing architectures to optimize common quality attributes such as flexibility, scalability, security, and manageability. Specialties: AWS Cloud, Big Data tools, Serverless Computing, DevOps, MLOps

Data lake: Design for schema evolution