Start of main content

Data catalog and data lake based on MongoDB: Building tech stack from scratch

Day 1

RU

The problem: cataloging a large number of unmanaged data sources. The audience: data engineers, data analysts, data solutions developers, data solution architects.

Ivan's talk will be about the work on creating a DataCrafter data catalog based on MongoDB, based on large heterogeneous public data of complex formats from unmanaged sources.

The catalog includes such rarely implemented features as:

  • automatic data schema creation;
  • automatic classification/identification of gender types (cadastral numbers, email, company IDs, links, etc.);
  • automated documentation;
  • automatic data quality assessment.

The focus of the talk will be on experiments preceding the creation of the catalog, technology stacks, problems being solved, and limitations.

    Speakers

    Invited experts