Talks SmartData 2021 conference

Vladimir Ozerov

Querify Labs

Alexey Goncharuk

Querify Labs

Day 2 / 20:00 / Track 2 / RU /

How to design a high-performance distributed SQL engine

Distributed SQL engines must process data across multiple servers. In this talk, Vladimir will tell, using Apache Flink and Presto as an example, how distributed SQL engines are arranged, and what approaches they use to increase query performance.

Andrey Terekhov

Yandex

Day 3 / 20:00 / Track 2 / RU /

How data delivery works in Yandex and why we're no longer afraid to transfer JSONs

Almost any company operating data finds it necessary to store and process data in different systems depending on the tasks. In such a world, a request arises for a service that can quickly and efficiently transfer data between these worlds. To solve this problem, Yandex has developed Data Transfer, a cross-system data replication service and Andrey plans to talk about it.

Nikolay Golov

ManyChat

Day 2 / 18:30 / Track 2 / RU /

Design steps of building analytical data platform in the clouds

Imagine that a company needs to build a powerful analytical platform. ManyChat created such configuration, choosing the latest tools for maximum convenience and minimizing the cost of ownership. Nikolay plans to describe the selection process at each step of building the platform, possible risks, and the final experience.

Dmitry Bugaychenko

Sber

Day 4 / 17:00 / Track 3 / RU /

D-people workplace — Sber's experience

Dmitry will tell how to organize the access and work with data for different specialists — engineers, analysts, data scientists. He will also tell how approaches to the allocation of computing resources and access organization have evolved, how changed the tool set and modeling approaches, how the approaches to the output of results into industrial operation developed.

Tejas Chopra

Netflix

Day 2 / 17:00 / Track 1 / EN /

An experience report on strategies for working with Cloud Storage

This talk focuses on techniques employed in hybrid storage systems to reduce cloud footprint and improve efficiencies.

Sabir Akhadov

Databricks Inc

Day 1 / 18:30 / Track 1 / EN /

Delta Lake data layout optimization

In this talk Sabir will walk you through physical data layout optimizations available with Delta Lake. It will discuss factors that make a query execute fast.

Aleksandr Volochnev

Datastax

Day 3 / 17:00 / Track 3 / RU /

Workshop. Building an efficient data model for high-performance applications with Apache Cassandra (part 1)

As part of this report, we will study the Data Modeling Methodology, step by step consider the basic principles of creating an effective data model. Let's get acquainted with typical cases and common mistakes, learn the rules that will help you get the most out of your DBMS, and avoid common problems.

Aleksandr Volochnev

Datastax

Day 3 / 18:30 / Track 3 / RU /

Workshop. Building an efficient data model for high-performance applications with Apache Cassandra (part 2)

Ash Berlin-Taylor

Astronomer.io

Day 1 / 17:00 / Track 1 / EN /

Apache Airflow 2.3 and beyond: What comes next?

Find out what is coming down the pipe for Apache Airflow in version 2.3 and beyond.

Andy Pavlo

Carnegie Mellon University

Day 1 / 20:00 / Track 1 / EN /

Lessons learned from using machine learning to optimize database configurations

In this talk, Andy will discuss the challenges in using ML to optimize DBMS knobs and the solutions we developed to address them. My presentation will be in the context of the OtterTune database tuning service. Andy will also highlight the insights learned from real-world installations of OtterTune for MySQL, Postgres, and Oracle.

Jacek Laskowski

Day 3 / 17:00 / Track 1 / EN /

Apache Spark as an in-memory-only data processing engine?

Is it possible to set up Spark so it never touches hard drives and hence be memory-fast? That's the question that Jacek is going to answer during the talk. You'll know a bit about the internals of Apache Spark and what parts are or could be memory-only and what challenges it poses.

Artem Yudovin

Profitero

Day 2 / 20:00 / Track 3 / RU /

From one big ETL job to experimenting with data pipelines

Let's talk about how, before making any changes to the pipeline in a production environment, you need to assess the potential impact on the system. You will find out that sometimes the pipeline is so complex and entangled in dependencies that it is almost impossible to predict the ending without experimenting.

Artem Aliev

Huawei

Day 1 / 20:00 / Track 3 / RU /

Trino (Presto) DB: Zero copy lakehouse

We'll talk about Trino. You'll learn about work with the data from primary sources, combining and enriching them, subsecond requests. We'll also talk about hidden opportunities, new functionalities, what we have in a project, or his forks.

Ton Badal

Synthesized

Day 2 / 18:30 / Track 1 / EN /

Optimizing test data coverage in functional testing

In this talk, Ton will discuss how to get faster and more secure access to data for testing purposes, by generating private data that (a) emulates the state of a dataset/database and (b) increases testing coverage. There are several tools available on OSS, but usually, the devil is in the detail.

Dmitry Ibragimov

Leroy Merlin

Day 3 / 20:00 / Track 3 / RU /

NiFi on a large scale: Architecture, monitoring, best practices

This talk will look at the NiFi ETL tool — its pros and cons, tools and methods for monitoring, and the development process for a large number of teams.

Viktor Kessler

Dremio

Day 3 / 20:00 / Track 1 / RU /

Dremio SQL Lakehouse: Fast data for all

In this talk, we'll discuss the topics of Data Lake open architecture, Apache Parquet, and Apache Arrow data formats. Why do we need Apache Iceberg and Deltalake table formats, and how the Nessie project will help build SQL Lakehouse on Data Lake.

Sergey Yarymov

MTS

Day 4 / 20:00 / Track 2 / RU /

How we build Feature Store

BigData MTS has grown and matured, but some of the problems that it received while developing ML still remain. And, as it turned out, they are not alone in their fight against them.

Gianluigi Vigano

Vertica

Maurizio Felici

Vertica

Marco Gessner

Vertica

Day 2 / 20:00 / Track 1 / EN /

How to bring advanced analytics to hybrid data storage with Vertica

In this session, we will go in deep, with practical examples, on how to map external data with Vertica, which are the Vertica options to push down the queries to external data repositories and the technologies behind it. Differences between Vertica and some other solutions will also be explained.

Dmitry Anoshin

Microsoft

Day 4 / 18:30 / Track 3 / RU /

Two types of data engineers

The data engineer's role is very important and critical. What skills should he have, how well should he know the code, algorithms, and data science? Dmitry was able to identify 2 types of data engineers and will tell about them during this session.

Kirill Ovchinnikov

MTS

Day 2 / 17:00 / Track 3 / RU /

Data processing and verification for computer vision in MTS sales offices throughout Russia

In this talk, Kirill will tell how MTS was able to launch an AI-service of computer vision on EDGE devices in 500 offices of companies. What pitfalls the team faced and how they were able to keep the entire fleet of devices up to date, process, and verify data from all offices.

Nikolay Markov

Aligned Research Group

Maksim Statsenko

Yandex

Natalia Khapaeva

MTS

Nikolay Troshnev

Valdis Pukis

Evolution

Day 4 / 20:00 / Track 3 / RU /

Round table: What if not Hadoop

The participants in the discussion will try to raise various tricky questions in the spirit of "how convenient is it to store raw data NOT in HDFS" and "is it possible to simply transfer everyone to the SQL engine". And also "is it possible to call the daemon with the words Data Mesh, Delta Lake, Anchor" and "how to make Kappa architecture in real life and what is it all about".

Nikolay Valiotti

Valiotti Analytics

Day 4 / 20:00 / Track 1 / RU /

Self-service BI: Data model building practices

Analysis of self-service BI application in terms of data model building.

Ekaterina Kolpakova

Citymobil

Day 3 / 18:30 / Track 2 / RU /

A tale about how we build DWH: From MySQL replicas to Exasol + ClickHouse

In this talk, Ekaterina wants to talk about why Citymobil chose Exasol as the DBMS for the warehouse, and Data Vault as the data model.

Vishnu Chanderraju

eyeota.com

Day 3 / 18:30 / Track 1 / EN /

Spark Yoga — saving time & money with lean data pipelines

The talk will focus exclusively on Apache Spark & cost-based engineering will be the theme.

Artem Shutak

Mail.ru Group

Day 3 / 17:00 / Track 2 / RU /

Insert into ClickHouse and not die

There are several options for how to insert data to ClickHouse correctly, and even more how to do it incorrectly. We'll talk about how to add data to ClickHouse, what pitfalls we can face, and how to avoid them.

Ivan Begtin

Infoculture

Day 1 / 18:30 / Track 2 / RU /

Data catalog and data lake based on MongoDB: Building tech stack from scratch

Ivan's talk will be about the work on creating a DataCrafter data catalog based on MongoDB, based on large heterogeneous public data of complex formats from unmanaged sources.

Pasha Finkelstein

JetBrains

Day 4 / 18:30 / Track 2 / RU /

Workshop. Making engineers' lives easier with Big Data Tools

During this talk, we will discuss what data engineer's life consists of and how do we help them with Big Data Tools.

Itai Admi

Treeverse

Day 4 / 18:30 / Track 1 / EN /

Create a git-like experience for Data Lake analytics

Learn how lakeFS simplifies the management of a Data Lake by enabling git-like operations over files in object storage. See how common processes like experimentation, reproducing data and ensuring data quality are simplified with workflows centered around branching, committing, and the merging of data.

Evgeny Ermakov

Yandex Go

Nikolay Grebenshchikov

Yandex Go

Day 1 / 17:00 / Track 3 / RU /

Greenplum and Anchor modeling: How dreams shutter against reality

In this talk, Evgeny and Nikolay would like to tell how dreams of architectural beauty shutter against reality.

Evgeny Nikolaev

Avito

Day 1 / 17:00 / Track 2 / RU /

DWH as a product

The data storage appeared in Avito more than 7 years ago. During this time, the business has grown several times, and the infrastructure has become more complex. Evgeny will tell how the product approach to platform development helps to solve dozens of analytical problems every day without the multiple growths of the DWH team.

Roman Kondakov

Querify Labs

Day 2 / 17:00 / Track 2 / RU /

How to employ Apache Calcite for building a SQL layer for any system

The brief talk about how to implement a SQL layer for any storage using Apache Calcite.

Dmitry Zuev

Ozon

Day 1 / 20:00 / Track 2 / RU /

"Functional" Spark

In this talk, Dmitry will tell how to write in Spark functionally using Scala at maximum speed.

Kirill Rybachuk

Cherry Labs

Day 4 / 17:00 / Track 2 / RU /

ML model lifecycle at Cherry Labs

How to build an ML pipeline in a computer vision startup.

Valerie Wiedemann

EXASOL

Christian Langmayr

Exasol

Day 4 / 17:00 / Track 1 / EN /

How an analytical database stopped me smoking: A practical story with Exasol

You'll be introduced to Exasol, the world's fastest analytical database. You will discover how Exasol can simplify your life and make having a data warehouse fun again.

Mikhail Solodyagin

Tele2

Sergey Yunk

Tele2

Vadim Suhanov

Tele2

Day 2 / 18:30 / Track 3 / RU /

Airflow 2.х SaaS

Airflow SaaS implementation in K8s private cloud and experience of migration from Airflow 1.x to Airflow 2.x SaaS.

Denis Efarov

Mail.ru Group

Day 1 / 18:30 / Track 3 / RU /

Hadoop 3: Erasure coding catastrophe

Erasure coding in Hadoop 3: a story about how the pursuit of the smart economy can turn out to be (almost) a disaster, and how to avoid it. Based on real data petabytes and a sea of tears.