Talks SmartData 2021 conference

Vladimir Ozerov Querify Labs
Vladimir Ozerov
Querify Labs 
Alexey Goncharuk  Querify Labs
Alexey Goncharuk
Querify Labs 
Day 2 / 20:00  / Track 2 / RU / Хардкор. Сложный низкоуровневый доклад, требующий от слушателя знаний технологии.

How to design a high-performance distributed SQL engine

Distributed SQL engines must process data across multiple servers. In this talk, Vladimir will tell, using Apache Flink and Presto as an example, how distributed SQL engines are arranged, and what approaches they use to increase query performance.

Andrey Terekhov Yandex
Andrey Terekhov
Yandex 
Day 3 / 20:00  / Track 2 / RU / Для практикующих инженеров

How data delivery works in Yandex and why we're no longer afraid to transfer JSONs

Almost any company operating data finds it necessary to store and process data in different systems depending on the tasks. In such a world, a request arises for a service that can quickly and efficiently transfer data between these worlds. To solve this problem, Yandex has developed Data Transfer, a cross-system data replication service and Andrey plans to talk about it.

Nikolay Golov ManyChat
Nikolay Golov
ManyChat 
Day 2 / 18:30  / Track 2 / RU / Для практикующих инженеров

Design steps of building analytical data platform in the clouds

Imagine that a company needs to build a powerful analytical platform. ManyChat created such configuration, choosing the latest tools for maximum convenience and minimizing the cost of ownership. Nikolay plans to describe the selection process at each step of building the platform, possible risks, and the final experience.

Dmitry Bugaychenko Sber
Dmitry Bugaychenko
Sber 
Day 4 / 17:00  / Track 3 / RU / Введение в технологию

D-people workplace — Sber's experience

Dmitry will tell how to organize the access and work with data for different specialists — engineers, analysts, data scientists. He will also tell how approaches to the allocation of computing resources and access organization have evolved, how changed the tool set and modeling approaches, how the approaches to the output of results into industrial operation developed.

Tejas Chopra Netflix
Tejas Chopra
Netflix 
Day 2 / 17:00  / Track 1 / EN / Для практикующих инженеров

An experience report on strategies for working with Cloud Storage

This talk focuses on techniques employed in hybrid storage systems to reduce cloud footprint and improve efficiencies.

Sabir Akhadov Databricks Inc
Sabir Akhadov
Databricks Inc 
Day 1 / 18:30  / Track 1 / EN / Для практикующих инженеров

Delta Lake data layout optimization

In this talk Sabir will walk you through physical data layout optimizations available with Delta Lake. It will discuss factors that make a query execute fast.

Aleksandr Volochnev Datastax
Aleksandr Volochnev
Datastax 
Day 3 / 17:00  / Track 3 / RU / Для практикующих инженеров

Workshop. Building an efficient data model for high-performance applications with Apache Cassandra (part 1)

As part of this report, we will study the Data Modeling Methodology, step by step consider the basic principles of creating an effective data model. Let's get acquainted with typical cases and common mistakes, learn the rules that will help you get the most out of your DBMS, and avoid common problems.

Aleksandr Volochnev Datastax
Aleksandr Volochnev
Datastax 
Day 3 / 18:30  / Track 3 / RU / Для практикующих инженеров

Workshop. Building an efficient data model for high-performance applications with Apache Cassandra (part 2)

As part of this report, we will study the Data Modeling Methodology, step by step consider the basic principles of creating an effective data model. Let's get acquainted with typical cases and common mistakes, learn the rules that will help you get the most out of your DBMS, and avoid common problems.

Ash Berlin-Taylor Astronomer.io
Ash Berlin-Taylor
Astronomer.io 
Day 1 / 17:00  / Track 1 / EN / Для практикующих инженеров

Apache Airflow 2.3 and beyond: What comes next?

Find out what is coming down the pipe for Apache Airflow in version 2.3 and beyond.

Andy Pavlo Carnegie Mellon University
Andy Pavlo
Carnegie Mellon University 
Day 1 / 20:00  / Track 1 / EN / Введение в технологию

Lessons learned from using machine learning to optimize database configurations

In this talk, Andy will discuss the challenges in using ML to optimize DBMS knobs and the solutions we developed to address them. My presentation will be in the context of the OtterTune database tuning service. Andy will also highlight the insights learned from real-world installations of OtterTune for MySQL, Postgres, and Oracle.

Jacek Laskowski
Jacek Laskowski
 
Day 3 / 17:00  / Track 1 / EN / Введение в технологию

Apache Spark as an in-memory-only data processing engine?

Is it possible to set up Spark so it never touches hard drives and hence be memory-fast? That's the question that Jacek is going to answer during the talk. You'll know a bit about the internals of Apache Spark and what parts are or could be memory-only and what challenges it poses.

Artem Yudovin Profitero
Artem Yudovin
Profitero 
Day 2 / 20:00  / Track 3 / RU / Для практикующих инженеров

From one big ETL job to experimenting with data pipelines

Let's talk about how, before making any changes to the pipeline in a production environment, you need to assess the potential impact on the system. You will find out that sometimes the pipeline is so complex and entangled in dependencies that it is almost impossible to predict the ending without experimenting.

Artem Aliev Huawei
Artem Aliev
Huawei 
Day 1 / 20:00  / Track 3 / RU / Хардкор. Сложный низкоуровневый доклад, требующий от слушателя знаний технологии.

Trino (Presto) DB: Zero copy lakehouse

We'll talk about Trino. You'll learn about work with the data from primary sources, combining and enriching them, subsecond requests. We'll also talk about hidden opportunities, new functionalities, what we have in a project, or his forks.

Ton Badal Synthesized
Ton Badal
Synthesized 
Day 2 / 18:30  / Track 1 / EN / Введение в технологию

Optimizing test data coverage in functional testing

In this talk, Ton will discuss how to get faster and more secure access to data for testing purposes, by generating private data that (a) emulates the state of a dataset/database and (b) increases testing coverage. There are several tools available on OSS, but usually, the devil is in the detail.

Dmitry Ibragimov Leroy Merlin
Dmitry Ibragimov
Leroy Merlin 
Day 3 / 20:00  / Track 3 / RU / Для практикующих инженеров

NiFi on a large scale: Architecture, monitoring, best practices

This talk will look at the NiFi ETL tool — its pros and cons, tools and methods for monitoring, and the development process for a large number of teams.

Viktor Kessler Dremio
Viktor Kessler
Dremio 
Day 3 / 20:00  / Track 1 / RU / Для практикующих инженеров

Dremio SQL Lakehouse: Fast data for all

In this talk, we'll discuss the topics of Data Lake open architecture, Apache Parquet, and Apache Arrow data formats. Why do we need Apache Iceberg and Deltalake table formats, and how the Nessie project will help build SQL Lakehouse on Data Lake.

Sergey Yarymov MTS
Sergey Yarymov
MTS 
Day 4 / 20:00  / Track 2 / RU / Для практикующих инженеров

How we build Feature Store

BigData MTS has grown and matured, but some of the problems that it received while developing ML still remain. And, as it turned out, they are not alone in their fight against them.

Gianluigi Vigano Vertica
Gianluigi Vigano
Vertica 
Maurizio Felici Vertica
Maurizio Felici
Vertica 
Marco Gessner Vertica
Marco Gessner
Vertica 
Day 2 / 20:00  / Track 1 / EN / Для практикующих инженеров

How to bring advanced analytics to hybrid data storage with Vertica

In this session, we will go in deep, with practical examples, on how to map external data with Vertica, which are the Vertica options to push down the queries to external data repositories and the technologies behind it. Differences between Vertica and some other solutions will also be explained.

Dmitry Anoshin Microsoft
Dmitry Anoshin
Microsoft 
Day 4 / 18:30  / Track 3 / RU / Введение в технологию

Two types of data engineers

The data engineer's role is very important and critical. What skills should he have, how well should he know the code, algorithms, and data science? Dmitry was able to identify 2 types of data engineers and will tell about them during this session.

Kirill Ovchinnikov MTS
Kirill Ovchinnikov
MTS 
Day 2 / 17:00  / Track 3 / RU / Для практикующих инженеров

Data processing and verification for computer vision in MTS sales offices throughout Russia

In this talk, Kirill will tell how MTS was able to launch an AI-service of computer vision on EDGE devices in 500 offices of companies. What pitfalls the team faced and how they were able to keep the entire fleet of devices up to date, process, and verify data from all offices.

Nikolay Markov Aligned Research Group
Nikolay Markov
Aligned Research Group 
Maksim Statsenko Yandex
Maksim Statsenko
Yandex 
Natalia Khapaeva MTS
Natalia Khapaeva
MTS 
Nikolay Troshnev
Nikolay Troshnev
 
Valdis Pukis Evolution
Valdis Pukis
Evolution 
Day 4 / 20:00  / Track 3 / RU / Введение в технологию

Round table: What if not Hadoop

The participants in the discussion will try to raise various tricky questions in the spirit of "how convenient is it to store raw data NOT in HDFS" and "is it possible to simply transfer everyone to the SQL engine". And also "is it possible to call the daemon with the words Data Mesh, Delta Lake, Anchor" and "how to make Kappa architecture in real life and what is it all about".

Nikolay Valiotti Valiotti Analytics
Nikolay Valiotti
Valiotti Analytics 
Day 4 / 20:00  / Track 1 / RU / Введение в технологию

Self-service BI: Data model building practices

Analysis of self-service BI application in terms of data model building.

Ekaterina Kolpakova Citymobil
Ekaterina Kolpakova
Citymobil 
Day 3 / 18:30  / Track 2 / RU / Для практикующих инженеров

A tale about how we build DWH: From MySQL replicas to Exasol + ClickHouse

In this talk, Ekaterina wants to talk about why Citymobil chose Exasol as the DBMS for the warehouse, and Data Vault as the data model.

Vishnu Chanderraju eyeota.com
Vishnu Chanderraju
eyeota.com 
Day 3 / 18:30  / Track 1 / EN / Для практикующих инженеров

Spark Yoga — saving time & money with lean data pipelines

The talk will focus exclusively on Apache Spark & cost-based engineering will be the theme.

Artem Shutak Mail.ru Group
Artem Shutak
Mail.ru Group 
Day 3 / 17:00  / Track 2 / RU / Для практикующих инженеров

Insert into ClickHouse and not die

There are several options for how to insert data to ClickHouse correctly, and even more how to do it incorrectly. We'll talk about how to add data to ClickHouse, what pitfalls we can face, and how to avoid them.

Ivan Begtin Infoculture
Ivan Begtin
Infoculture 
Day 1 / 18:30  / Track 2 / RU / Для практикующих инженеров

Data catalog and data lake based on MongoDB: Building tech stack from scratch

Ivan's talk will be about the work on creating a DataCrafter data catalog based on MongoDB, based on large heterogeneous public data of complex formats from unmanaged sources.

Pasha Finkelstein JetBrains
Pasha Finkelstein
JetBrains 
Day 4 / 18:30  / Track 2 / RU / Введение в технологию

Workshop. Making engineers' lives easier with Big Data Tools

During this talk, we will discuss what data engineer's life consists of and how do we help them with Big Data Tools.

Itai Admi Treeverse
Itai Admi
Treeverse 
Day 4 / 18:30  / Track 1 / EN / Для практикующих инженеров

Create a git-like experience for Data Lake analytics

Learn how lakeFS simplifies the management of a Data Lake by enabling git-like operations over files in object storage. See how common processes like experimentation, reproducing data and ensuring data quality are simplified with workflows centered around branching, committing, and the merging of data.

Evgeny Ermakov Yandex Go
Evgeny Ermakov
Yandex Go 
Nikolay Grebenshchikov Yandex Go
Nikolay Grebenshchikov
Yandex Go 
Day 1 / 17:00  / Track 3 / RU / Для практикующих инженеров

Greenplum and Anchor modeling: How dreams shutter against reality

In this talk, Evgeny and Nikolay would like to tell how dreams of architectural beauty shutter against reality.

Evgeny Nikolaev Avito
Evgeny Nikolaev
Avito 
Day 1 / 17:00  / Track 2 / RU / Введение в технологию

DWH as a product

The data storage appeared in Avito more than 7 years ago. During this time, the business has grown several times, and the infrastructure has become more complex. Evgeny will tell how the product approach to platform development helps to solve dozens of analytical problems every day without the multiple growths of the DWH team.

Roman Kondakov Querify Labs
Roman Kondakov
Querify Labs 
Day 2 / 17:00  / Track 2 / RU / Хардкор. Сложный низкоуровневый доклад, требующий от слушателя знаний технологии.

How to employ Apache Calcite for building a SQL layer for any system

The brief talk about how to implement a SQL layer for any storage using Apache Calcite.

Dmitry Zuev Ozon
Dmitry Zuev
Ozon 
Day 1 / 20:00  / Track 2 / RU / Для практикующих инженеров

"Functional" Spark

In this talk, Dmitry will tell how to write in Spark functionally using Scala at maximum speed.

Kirill Rybachuk Cherry Labs
Kirill Rybachuk
Cherry Labs 
Day 4 / 17:00  / Track 2 / RU / Для практикующих инженеров

ML model lifecycle at Cherry Labs

How to build an ML pipeline in a computer vision startup.

Valerie Wiedemann EXASOL
Valerie Wiedemann
EXASOL 
Christian Langmayr Exasol
Christian Langmayr
Exasol 
Day 4 / 17:00  / Track 1 / EN / Введение в технологию

How an analytical database stopped me smoking: A practical story with Exasol

You'll be introduced to Exasol, the world's fastest analytical database. You will discover how Exasol can simplify your life and make having a data warehouse fun again.

Mikhail Solodyagin Tele2
Mikhail Solodyagin
Tele2 
Sergey Yunk Tele2
Sergey Yunk
Tele2 
Vadim Suhanov Tele2
Vadim Suhanov
Tele2 
Day 2 / 18:30  / Track 3 / RU / Для практикующих инженеров

Airflow 2.х SaaS

Airflow SaaS implementation in K8s private cloud and experience of migration from Airflow 1.x to Airflow 2.x SaaS.

Denis Efarov Mail.ru Group
Denis Efarov
Mail.ru Group 
Day 1 / 18:30  / Track 3 / RU / Для практикующих инженеров

Hadoop 3: Erasure coding catastrophe

Erasure coding in Hadoop 3: a story about how the pursuit of the smart economy can turn out to be (almost) a disaster, and how to avoid it. Based on real data petabytes and a sea of tears.