Introduction:
In the world of big data analytics, crunching through raw datasets repeatedly can feel like running in circles—slow, resource-heavy, and far from efficient. Enter Materialized Views (MVs) and Refreshable Materialized Views (RMVs), the game-changers that transform raw data into instant insights. Think of them as your fast-track bridge between massive data lakes and lightning-fast analytics, delivering results in record time with minimal effort.
What is Materialized View?
Materialized views shift heavy computation from query time to insert time, delivering lightning-fast SELECT queries. In ClickHouse, they act like real-time triggers, running queries on incoming data and storing the results in a target table. These precomputed results—whether aggregated, filtered, or transformed—reduce data size and simplify queries, making analytics incredibly efficient. Think of materialized views as constantly updating indexes, keeping your data ready for quick insights as soon as it lands!
How ClickHouse store precomputed query result?
When you create a materialized view, ClickHouse computes the query result and stores it on disk. When queried, it retrieves the precomputed data directly from storage, avoiding recalculations.
Materialized views can be created using SQL queries on one or more tables. They support updates through operations like INSERT, UPDATE, and DELETE. When the source table changes, the materialized view automatically updates to keep the results consistent.
Creating Materialized View in ClickHouse:
The votes table stores voting records, including details like vote type, creation date, and user ID.
CREATE TABLE votes
(
`Id` UInt32,
`PostId` Int32,
`VoteTypeId` UInt8,
`CreationDate` DateTime64(3, 'UTC'),
`UserId` Int32,
`BountyAmount` UInt8
)
ENGINE = MergeTree
ORDER BY (VoteTypeId, CreationDate, PostId)To analyze the dataset, let's count the total number of records in the votes table:
SELECT count()
FROM votes
FINAL
┌─count()──┐
│ 238984011│
└──────────┘To optimize vote analysis, we aggregate daily upvotes and downvotes:
SELECT toStartOfDay(CreationDate) AS day,
countIf(VoteTypeId = 2) AS UpVotes,
countIf(VoteTypeId = 3) AS DownVotes
FROM votes
GROUP BY day
ORDER BY day ASC
LIMIT 10
┌─────────────────day─┬─UpVotes─┬─DownVotes─┐
│ 2008-07-31 00:00:00 │ 6 │ 0 │
│ 2008-08-01 00:00:00 │ 182 │ 50 │
│ 2008-08-02 00:00:00 │ 436 │ 107 │
│ 2008-08-03 00:00:00 │ 564 │ 100 │
│ 2008-08-04 00:00:00 │ 1306 │ 259 │
│ 2008-08-05 00:00:00 │ 1368 │ 269 │
│ 2008-08-06 00:00:00 │ 1701 │ 211 │
│ 2008-08-07 00:00:00 │ 1544 │ 211 │
│ 2008-08-08 00:00:00 │ 1241 │ 212 │
│ 2008-08-09 00:00:00 │ 576 │ 46 │
└─────────────────────┴─────────┴───────────┘
10 rows in set. Elapsed: 0.133 sec. Processed 238.98 million rows, 2.15 GB (1.79 billion rows/s., 16.14 GB/s.)
Peak memory usage: 363.22 MiB.While this query efficiently retrieves insights, executing it frequently on a large dataset is resource-intensive. To speed up analytics, we create a materialized view.
Creating a target table ,it will store precomputed daily vote counts.
CREATE TABLE up_down_votes_per_day
(
`Day` Date,
`UpVotes` UInt32,
`DownVotes` UInt32
)
ENGINE = SummingMergeTree
ORDER BY DayTo initialize the materialized view, we insert aggregated data:
INSERT INTO up_down_votes_per_day
SELECT CAST(toStartOfDay(CreationDate), 'Date') AS Day,
countIf(VoteTypeId = 2) AS UpVotes,
countIf(VoteTypeId = 3) AS DownVotes
FROM votes
GROUP BY DayInstead of scanning millions of rows, we can now query the precomputed table:
SELECT count() FROM up_down_votes_per_day
┌─count()─┐
1.│ 5723 │
└─────────┘SELECT
Day,
UpVotes,
DownVotes
FROM up_down_votes_per_day FINAL
ORDER BY Day ASC
LIMIT 10
┌────────Day─┬─UpVotes─┬─DownVotes─┐
1. │ 2008-07-31 │ 6 │ 0 │
2. │ 2008-08-01 │ 182 │ 50 │
3. │ 2008-08-02 │ 436 │ 107 │
4. │ 2008-08-03 │ 564 │ 100 │
5. │ 2008-08-04 │ 1306 │ 259 │
6. │ 2008-08-05 │ 1368 │ 269 │
7. │ 2008-08-06 │ 1701 │ 211 │
8. │ 2008-08-07 │ 1544 │ 211 │
9. │ 2008-08-08 │ 1241 │ 212 │
10. │ 2008-08-09 │ 576 │ 46 │
└────────────┴─────────┴───────────┘
10 rows in set. Elapsed: 0.006 sec. Processed 5.72 thousand rows, 57.23 KB (1.02 million rows/s., 10.19 MB/s.)
Peak memory usage: 65.70 KiB.
Performance Comparison:
Query | Rows Processed | Data Scanned | Execution Time | Memory Used | Storage Space |
| Direct Query on votes | 238.98M rows | 2.15 GB | 0.491 sec | 3.79 MiB | 1.7GB |
| Query on Materialized View (up_down_votes_per_day) | 5.72K rows | 57.23 KB | 0.006 sec | 65.70 KiB | 52.13KB |
Refreshable Materialized View:
Refreshable Materialized Views (RMVs) provide a more flexible approach to data updates by allowing periodic refreshes instead of real-time updates triggered by new inserts. This makes them particularly useful when data needs to be recalculated at scheduled intervals rather than continuously, ensuring efficient resource utilization while maintaining up-to-date analytics.
Creating Refreshable Materialized View:
Imagine you have a table called orders, storing e-commerce transactions. You want to maintain a daily summary of total sales per product but refresh it every hour instead of updating in real-time.
First, let's define the table that will store the raw transaction data.
CREATE TABLE orders
(
`order_id` UInt32,
`product_id` UInt32,
`order_date` DateTime,
`amount` Float32
)
ENGINE = MergeTree
ORDER BY order_dateThe target table will store precomputed aggregated data for better query performance.
CREATE TABLE daily_sales (
product_id UInt32,
order_date DateTime,
total_sales Float32,
) ENGINE = MergeTree()
ORDER BY (product_id, order_date);Now, we create a refreshable materialized view that updates every hour:
CREATE MATERIALIZED VIEW daily_sales_mv
REFRESH EVERY 1 HOUR
TO daily_sales AS
SELECT
product_id
toDate(order_date) AS order_date,
SUM(amount) AS total_sales,
COUNT(*) AS total_orders
FROM orders
GROUP BY product_id, order_date;To begin synchronization, we manually insert the aggregated data into daily_sales:
INSERT INTO daily_sales
SELECT
product_id,
toDate(order_date) AS order_date,
SUM(amount) AS total_sales,
COUNT(*) AS total_orders
FROM orders
GROUP BY product_id,order_dateAfter executing the above steps, you can now query daily_sales to see the precomputed results
SELECT * FROM orders
┌─order_id─┬─product_id─┬──────────order_date─┬─amount─┐
1. │ 1 │ 101 │ 2024-02-05 10:15:00 │ 25.5 │
2. │ 2 │ 102 │ 2024-02-05 10:30:00 │ 40 │
3. │ 3 │ 103 │ 2024-02-05 11:00:00 │ 15.75 │
4. │ 4 │ 101 │ 2024-02-05 11:15:00 │ 30.25 │
5. │ 5 │ 102 │ 2024-02-05 12:00:00 │ 50 │
6. │ 6 │ 103 │ 2024-02-05 12:30:00 │ 20 │
7. │ 7 │ 101 │ 2024-02-05 13:00:00 │ 27.5 │
8. │ 8 │ 102 │ 2024-02-05 13:15:00 │ 60 │
9. │ 9 │ 103 │ 2024-02-05 14:00:00 │ 18 │
10. │ 10 │ 101 │ 2024-02-05 14:30:00 │ 35.75 │
└──────────┴────────────┴─────────────────────┴────────┘
10 rows in set. Elapsed: 0.003 sec.Now you can see the pre-computed data in the target table
SELECT *
FROM daily_sales
┌─product_id─┬──────────order_date─┬─total_sales─┬─total_orders─┐
1. │ 101 │ 2024-02-05 00:00:00 │ 119 │ 4 │
2. │ 102 │ 2024-02-05 00:00:00 │ 150 │ 3 │
3. │ 103 │ 2024-02-05 00:00:00 │ 53.75 │ 3 │
└────────────┴─────────────────────┴─────────────┴──────────────┘
3 rows in set. Elapsed: 0.003 sec.As new transactions arrive in the orders table, they won’t be reflected in daily_sales until the next refresh.
If needed, you can trigger a manual refresh:
SYSTEM REFRESH VIEW daily_sales_mv;Once the refresh is complete, the latest aggregated sales data will be available in daily_sales.
SELECT *
FROM daily_sales
┌─product_id─┬──────────order_date─┬─total_sales─┬─total_orders─┐
1. │ 101 │ 2024-02-05 00:00:00 │ 151.5 │ 5 │
2. │ 102 │ 2024-02-05 00:00:00 │ 225 │ 4 │
3. │ 103 │ 2024-02-05 00:00:00 │ 76.55 │ 4 │
4. │ 104 │ 2024-02-05 00:00:00 │ 45 │ 1 │
5. │ 105 │ 2024-02-05 00:00:00 │ 55.25 │ 1 │
└────────────┴─────────────────────┴─────────────┴──────────────┘
5 rows in set. Elapsed: 0.003 sec.
Scenarios Where Each Type is More Beneficial:
Use Materialized View When:
Use Refreshable Materialized Views When:
Performance Benefits in Analytics:
How Materialized Views Reduce Query Time:
Reducing System Usage and Resource Usage:
Comparison: Materialized View vs Refreshable Materialized View:
Feature | Materialized View | Refreshable Materialized View |
| Update Method | Real-time on insert | Manual or scheduled refresh |
| Best for | High-frequency queries, real-time analytics | Periodic reports, complex joins |
| Performance | Faster for incremental aggregations | Reduces system load for large datasets |
| Use Case | Continuous monitoring dashboards | Batch analytics, periodic data recomputation |
Conclusion:
Choosing between Materialized Views and Refreshable Materialized Views depends on your specific analytics needs. If you require real-time updates and incremental processing, Materialized Views are the best option. However, if your queries involve complex joins, periodic recomputation, or require greater control over refresh timing, Refreshable Materialized Views provide the flexibility you need.
By leveraging these powerful features in ClickHouse, businesses can optimize query performance, reduce system load, and accelerate data analytics, ensuring fast, efficient, and scalable insights for modern data-driven applications.
Miru IT Park, Vallankumaranvillai,
Nagercoil, Tamilnadu - 629 002.
Unit 303, Vanguard Rise,
5th Main, Konena Agrahara,
Old Airport Road, Bangalore - 560 017.
Call: +91 6383016411
Email: sales@mafiree.com