Data catalog and data lake based on MongoDB: Building tech stack from scratch

RU / Day 1 / 18:30 / Track 2

To favorites

The problem: cataloging a large number of unmanaged data sources. The audience: data engineers, data analysts, data solutions developers, data solution architects.

Ivan's talk will be about the work on creating a DataCrafter data catalog based on MongoDB, based on large heterogeneous public data of complex formats from unmanaged sources.

The catalog includes such rarely implemented features as:

automatic data schema creation;
automatic classification/identification of gender types (cadastral numbers, email, company IDs, links, etc.);
automated documentation;
automatic data quality assessment.

The focus of the talk will be on experiments preceding the creation of the catalog, technology stacks, problems being solved, and limitations.

All talks