Start of main content

Talks

A tale about how we build DWH: From MySQL replicas to Exasol + ClickHouse
Day 3
08:30 PM
RU
In this talk, Ekaterina wants to talk about why Citymobil chose Exasol as the DBMS for the warehouse, and Data Vault as the data model.
- #architecture
- #datavault
- #dwh
- #etl
- Ekaterina Kolpakova
  Citymobil
Conversation on TV
Day 3
09:15 PM
RU
Television broadcast from the main studio of the SmartData conference.
- Video of the talk
Workshop. Building an efficient data model for high-performance applications with Apache Cassandra (part 2)
Day 3
08:30 PM
RU
As part of this report, we will study the Data Modeling Methodology, step by step consider the basic principles of creating an effective data model. Let's get acquainted with typical cases and common mistakes, learn the rules that will help you get the most out of your DBMS, and avoid common problems.
- #datamodeling
- #storage
- #workshop
- Video of the talk
- Aleksandr Volochnev
  Datastax
Conversation on TV
Day 3
07:45 PM
RU
Television broadcast from the main studio of the SmartData conference.
- Video of the talk
Big Data Tools: Demo
Day 1
08:00 PM
RU
In this talk, we'll discuss ETL workflows inside the Big Data Tools plugin. With this plugin, you can conveniently work with Zeppelin laptops, monitor Spark and Hadoop applications, and preview cloud file systems and HDFS files right from IntelliJ-based IDEs.
- #partner
- #techtalk
- Oleg Chirukhin
  JetBrains
Create a git-like experience for Data Lake analytics
Day 4
08:30 PM
EN
Learn how lakeFS simplifies the management of a Data Lake by enabling git-like operations over files in object storage. See how common processes like experimentation, reproducing data and ensuring data quality are simplified with workflows centered around branching, committing, and the merging of data.
- #datavirtualisation
- #tooling
- Itai Admi
  Treeverse
Opening
Day 2
06:50 PM
RU
We will talk about the schedule, sessions, and share the information. Join the broadcast to find out what's on the air soon!
- Video of the talk
- Pasha Finkelstein
  JetBrains
- Sergey Boytsov
  JetBrains
Data processing and verification for computer vision in MTS sales offices throughout Russia
Day 2
07:00 PM
RU
In this talk, Kirill will tell how MTS was able to launch an AI-service of computer vision on EDGE devices in 500 offices of companies. What pitfalls the team faced and how they were able to keep the entire fleet of devices up to date, process, and verify data from all offices.
- #edge
- #internetofthings
- Kirill Ovchinnikov
  MTS
Apache Airflow 2.3 and beyond: What comes next?
Day 1
07:00 PM
EN
Find out what is coming down the pipe for Apache Airflow in version 2.3 and beyond.
- #airflow
- #tooling
- Ash Berlin-Taylor
  Astronomer.io
Insert into ClickHouse and not die
Day 3
07:00 PM
RU
There are several options for how to insert data to ClickHouse correctly, and even more how to do it incorrectly. We'll talk about how to add data to ClickHouse, what pitfalls we can face, and how to avoid them.
- #dataingestion
- #optimization
- #storage
- Artem Shutak
  Odnoklassniki
Dremio SQL Lakehouse: Fast data for all
Day 3
10:00 PM
RU
In this talk, we'll discuss the topics of Data Lake open architecture, Apache Parquet, and Apache Arrow data formats. Why do we need Apache Iceberg and Deltalake table formats, and how the Nessie project will help build SQL Lakehouse on Data Lake.
- #datalake
- #lakehouse
- #queryengine
- #queryoptimization
- #tooling
- Viktor Kessler
  Dremio
Workshop: adding SQL to your app in 30 minutes
Day 3
09:30 PM
RU
Panel discussion is not recorded Apache Calcite is a framework which allows to add SQL interface to your app. In this live coding session we will teach imaginary DBMS to make SQL-requests.
- #partner
- #smoking_room
- Vladimir Ozerov
  Querify Labs
Conversation on TV
Day 2
07:45 PM
RU
Television broadcast from the main studio of the SmartData conference.
- Video of the talk
D-people workplace — Sber's experience
Day 4
07:00 PM
RU
Dmitry will tell how to organize the access and work with data for different specialists — engineers, analysts, data scientists. He will also tell how approaches to the allocation of computing resources and access organization have evolved, how changed the tool set and modeling approaches, how the approaches to the output of results into industrial operation developed.
- #process
- Dmitry Bugaychenko
  Sber
Airflow 2.х SaaS
Day 2
08:30 PM
RU
Airflow SaaS implementation in K8s private cloud and experience of migration from Airflow 1.x to Airflow 2.x SaaS.
- #airflow
- #cloud
- #k8s
Conversation on TV
Day 4
07:45 PM
RU
Television broadcast from the main studio of the SmartData conference.
- Video of the talk
An experience report on strategies for working with Cloud Storage
Day 2
07:00 PM
EN
This talk focuses on techniques employed in hybrid storage systems to reduce cloud footprint and improve efficiencies.
- #architecture
- #cloud
- #storageoptimization
- Video of the talk
- Tejas Chopra
  Netflix
DWH as a product
Day 1
07:00 PM
RU
The data storage appeared in Avito more than 7 years ago. During this time, the business has grown several times, and the infrastructure has become more complex. Evgeny will tell how the product approach to platform development helps to solve dozens of analytical problems every day without the multiple growths of the DWH team.
- #dataasaproduct
- #process
- Evgeny Nikolaev
  Avito
Lessons learned from using machine learning to optimize database configurations
Day 1
10:00 PM
EN
In this talk, Andy will discuss the challenges in using ML to optimize DBMS knobs and the solutions we developed to address them. My presentation will be in the context of the OtterTune database tuning service. Andy will also highlight the insights learned from real-world installations of OtterTune for MySQL, Postgres, and Oracle.
- #databaseoptimization
- #datastorage
- #perfomance
- #tuning
- Video of the talk
- Andy Pavlo
  Carnegie Mellon University
Conversation on TV
Day 1
07:45 PM
RU
Television broadcast from the main studio of the SmartData conference.
- Video of the talk
How data delivery works in Yandex and why we're no longer afraid to transfer JSONs
Day 3
10:00 PM
RU
Almost any company operating data finds it necessary to store and process data in different systems depending on the tasks. In such a world, a request arises for a service that can quickly and efficiently transfer data between these worlds. To solve this problem, Yandex has developed Data Transfer, a cross-system data replication service and Andrey plans to talk about it.
- #architecture
- #dataingestion
- Andrey Terekhov
  Yandex
SmartData 2021 Closing
Day 4
11:00 PM
RU
We take stock, remember the bright moments and talk about our plans. Join the broadcast, so you don't miss anything!
- Video of the talk
- Kseniya Tomak
  Dodo Engineering
- Sergey Boytsov
  JetBrains
How to design a high-performance distributed SQL engine
Day 2
10:00 PM
RU
Distributed SQL engines must process data across multiple servers. In this talk, Vladimir will tell, using Apache Flink and Presto as an example, how distributed SQL engines are arranged, and what approaches they use to increase query performance.
- #queryengine
- #queryoptimization
- #tooling
- Vladimir Ozerov
  Querify Labs
- Alexey Goncharuk
  Querify Labs
Conversation on TV
Day 1
09:15 PM
RU
Television broadcast from the main studio of the SmartData conference.
- Video of the talk
Self-service BI: Data model building practices
Day 4
10:00 PM
RU
Analysis of self-service BI application in terms of data model building.
- #datamodeling
- Nikolay Valiotti
  Valiotti Analytics
How to bring advanced analytics to hybrid data storage with Vertica
Day 2
10:00 PM
EN
In this session, we will go in deep, with practical examples, on how to map external data with Vertica, which are the Vertica options to push down the queries to external data repositories and the technologies behind it. Differences between Vertica and some other solutions will also be explained.
- #architecture
- #database
- #datavirtualization
- #process
- #queryengine
- #storage
From one big ETL job to experimenting with data pipelines
Day 2
10:00 PM
RU
Let's talk about how, before making any changes to the pipeline in a production environment, you need to assess the potential impact on the system. You will find out that sometimes the pipeline is so complex and entangled in dependencies that it is almost impossible to predict the ending without experimenting.
- #etl
- #pipepline
- #process
- Artem Yudovin
  Profitero
SmartData 2021 Opening
Day 1
06:45 PM
RU
We will talk about the schedule, sessions, and share the information. Join the broadcast to find out what's on the air soon!
- Video of the talk
- Pasha Finkelstein
  JetBrains
- Igor Mosyagin
  Klarna
Discussion: Quasi-Mutable Data Storages
Day 4
07:00 PM
RU
Panel discussion is not recorded!

We will talk about Hudi, DeltaLake, Iceberg, and other storages. Quasi-mutable data storage formats are not only trending, but also mysterious. In this discussion we can figure out what's on the market and where is it all going.
- #smoking_room
- Pasha Finkelstein
  JetBrains
- Oleg Chirukhin
  JetBrains
Delta Lake data layout optimization
Day 1
08:30 PM
EN
In this talk Sabir will walk you through physical data layout optimizations available with Delta Lake. It will discuss factors that make a query execute fast.
- #storage
- #storageoptimization
- Sabir Akhadov
  Databricks Inc
Opening
Day 3
06:50 PM
RU
We will talk about the schedule, sessions, and share the information. Join the broadcast to find out what's on the air soon!
- Video of the talk
- Kseniya Tomak
  Dodo Engineering
- Maksim Statsenko
  Yandex
Workshop. Building an efficient data model for high-performance applications with Apache Cassandra (part 1)
Day 3
07:00 PM
RU
As part of this report, we will study the Data Modeling Methodology, step by step consider the basic principles of creating an effective data model. Let's get acquainted with typical cases and common mistakes, learn the rules that will help you get the most out of your DBMS, and avoid common problems.
- #datamodeling
- #storage
- #workshop
- Video of the talk
- Aleksandr Volochnev
  Datastax
NiFi on a large scale: Architecture, monitoring, best practices
Day 3
10:00 PM
RU
This talk will look at the NiFi ETL tool — its pros and cons, tools and methods for monitoring, and the development process for a large number of teams.
- #monitoring
- #ops
- Dmitry Ibragimov
  Leroy Merlin
Conversation on TV
Day 4
09:15 PM
RU
Television broadcast from the main studio of the SmartData conference.
- Video of the talk
Two types of data engineers
Day 4
08:30 PM
RU
The data engineer's role is very important and critical. What skills should he have, how well should he know the code, algorithms, and data science? Dmitry was able to identify 2 types of data engineers and will tell about them during this session.
- #process
- Dmitry Anoshin
  Microsoft
Apache Spark as an in-memory-only data processing engine?
Day 3
07:00 PM
EN
Is it possible to set up Spark so it never touches hard drives and hence be memory-fast? That's the question that Jacek is going to answer during the talk. You'll know a bit about the internals of Apache Spark and what parts are or could be memory-only and what challenges it poses.
- #spark
- #tooling
- Video of the talk
- Jacek Laskowski
Trino (Presto) DB: Zero copy lakehouse
Day 1
10:00 PM
RU
We'll talk about Trino. You'll learn about work with the data from primary sources, combining and enriching them, subsecond requests. We'll also talk about hidden opportunities, new functionalities, what we have in a project, or his forks.
- #datavirtualisation
- #queryengine
- #queryoptimization
- #tooling
- Artem Aliev
  Huawei
How we build Feature Store
Day 4
10:00 PM
RU
BigData MTS has grown and matured, but some of the problems that it received while developing ML still remain. And, as it turned out, they are not alone in their fight against them.
- #featurestore
- Video of the talk
- Sergey Yarymov
  MTS
Interview with Pasha Finkelstein
Day 3
07:00 PM
RU
Panel discussion is not recorded!

Pasha managed to work in different IT areas — system administration, development, management, data engineering, and now he works on Big Data Tools at JetBrains.
- #smoking_room
- Pasha Finkelstein
  JetBrains
- Oleg Chirukhin
  JetBrains
Round table: What if not Hadoop
Day 4
10:00 PM
RU
The participants in the discussion will try to raise various tricky questions in the spirit of "how convenient is it to store raw data NOT in HDFS" and "is it possible to simply transfer everyone to the SQL engine". And also "is it possible to call the daemon with the words Data Mesh, Delta Lake, Anchor" and "how to make Kappa architecture in real life and what is it all about".
- #datalake
- #hadoop
- #storage
- Video of the talk
MLOps at Ozon
Day 3
08:00 PM
RU
In this talk, Dmitry will talk about the specifics of DS teams work and their infrastructure at Ozon.
- #partner
- #techtalk
- Dmitry Gronsky
  Ozon
"Functional" Spark
Day 1
10:00 PM
RU
In this talk, Dmitry will tell how to write in Spark functionally using Scala at maximum speed.
- #developer
- #scala
- Dmitry Zuev
  Ozon
Workshop. Making engineers' lives easier with Big Data Tools
Day 4
08:30 PM
RU
During this talk, we will discuss what data engineer's life consists of and how do we help them with Big Data Tools.
- #bigdatatools
- #spark
- #tools
- #workshop
- Video of the talk
- Pasha Finkelstein
  JetBrains
Optimizing test data coverage in functional testing
Day 2
08:30 PM
EN
In this talk, Ton will discuss how to get faster and more secure access to data for testing purposes, by generating private data that (a) emulates the state of a dataset/database and (b) increases testing coverage. There are several tools available on OSS, but usually, the devil is in the detail.
- #dataops
- #datatesting
- Ton Badal
  Synthesized
Opening
Day 4
06:50 PM
RU
We will talk about the schedule, sessions, and share the information. Join the broadcast to find out what's on the air soon!
- Video of the talk
- Kseniya Tomak
  Dodo Engineering
- Sergey Boytsov
  JetBrains
Hadoop 3: Erasure coding catastrophe
Day 1
08:30 PM
RU
Erasure coding in Hadoop 3: a story about how the pursuit of the smart economy can turn out to be (almost) a disaster, and how to avoid it. Based on real data petabytes and a sea of tears.
- #hadoop
- Denis Efarov
  Odnoklassniki
Projector: What It Is and How It Works
Day 2
08:00 PM
RU
Projector is a self-hosted technology that launches IntelliJ-based IDEs and Swing-based apps on a server, providing you with access to them from anywhere using browsers and native apps. Let's find out how it works and what's inside.
- #partner
- #techtalk
- Oleg Chirukhin
  JetBrains
Building cross-IDs for web analytics
Day 2
09:15 PM
RU
In this talk Arthur will discuss all aspects of building a remote user authentication system on the web, taking into account current technical and legal realities.
- #analitics
- #cross-device
- #legal
- Artur Hachuyan
  Tazeros
How to employ Apache Calcite for building a SQL layer for any system
Day 2
07:00 PM
RU
The brief talk about how to implement a SQL layer for any storage using Apache Calcite.
- #queryengine
- #queryoptimization
- Roman Kondakov
  Querify Labs
Data catalog and data lake based on MongoDB: Building tech stack from scratch
Day 1
08:30 PM
RU
Ivan's talk will be about the work on creating a DataCrafter data catalog based on MongoDB, based on large heterogeneous public data of complex formats from unmanaged sources.
- Ivan Begtin
  Infoculture
How an analytical database stopped me smoking: A practical story with Exasol
Day 4
07:00 PM
EN
You'll be introduced to Exasol, the world's fastest analytical database. You will discover how Exasol can simplify your life and make having a data warehouse fun again.
- #database
- #storage
- Valerie Wiedemann
  EXASOL
- Christian Langmayr
  Exasol
Greenplum and Anchor modeling: How dreams shutter against reality
Day 1
07:00 PM
RU
In this talk, Evgeny and Nikolay would like to tell how dreams of architectural beauty shutter against reality.
- #anchor
- #architecture
- #datavault
- #dwh
- Nikolay Grebenshchikov
  Yandex Go
- Evgeny Ermakov
  Yandex Go
How cloud changes databases architecture and why we care about it
Day 1
09:30 PM
RU
Over the last ten years, cloud computing made a gigantic leap and fundamentally changed the way we approach building systems. In this talk, we will discuss how modern capabilities of the cloud infrastructure change the core principles and the architecture of a database. We will see how separation of compute and storage allows to improve scalability and availability of the system while allowing to have a more predictable cost for the end-users.
- #partner
- #techtalk
- Alexey Goncharuk
  Querify Labs
ML model lifecycle at Cherry Labs
Day 4
07:00 PM
RU
How to build an ML pipeline in a computer vision startup.
- Kirill Rybachuk
  Cherry Labs
Design steps of building analytical data platform in the clouds
Day 2
08:30 PM
RU
Imagine that a company needs to build a powerful analytical platform. ManyChat created such configuration, choosing the latest tools for maximum convenience and minimizing the cost of ownership. Nikolay plans to describe the selection process at each step of building the platform, possible risks, and the final experience.
- Nikolay Golov
  ManyChat
Spark Yoga — saving time & money with lean data pipelines
Day 3
08:30 PM
EN
The talk will focus exclusively on Apache Spark & cost-based engineering will be the theme.
- #costoptimization
- #spark
- Vishnu Chanderraju
  eyeota.com