The problem: cataloging a large number of unmanaged data sources. The audience: data engineers, data analysts, data solutions developers, data solution architects.
Ivan's talk will be about the work on creating a DataCrafter data catalog based on MongoDB, based on large heterogeneous public data of complex formats from unmanaged sources.
The catalog includes such rarely implemented features as:
- automatic data schema creation;
- automatic classification/identification of gender types (cadastral numbers, email, company IDs, links, etc.);
- automated documentation;
- automatic data quality assessment.
The focus of the talk will be on experiments preceding the creation of the catalog, technology stacks, problems being solved, and limitations.