Traditionally, one of the most powerful techniques used to accelerate query processing in data warehouses is the pre-computation of relevant summaries or materialized views.
The initial implementation introduced in Apache Hive 3.0.0 focuses on introducing materialized views and automatic query rewriting based on those materializations in the project. Materialized views can be stored natively in Hive or in other custom storage handlers (ORC), and they can seamlessly exploit exciting new Hive features such as LLAP acceleration. Then, the optimizer relies in Apache Calcite to automatically produce full and partial rewritings for a large set of query expressions comprising projections, filters, join, and aggregation operations.
In this document, we provide details about materialized view creation and management in Hive against the source parquet tables.
The syntax to create a materialized view in Hive is very similar to the CTAS statement syntax, supporting common features such as partition columns, custom storage handler, or passing table properties.
Standard Syntax:
CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name
[DISABLE REWRITE]
[COMMENT materialized_view_comment]
[PARTITIONED ON (col_name, ...)]
[CLUSTERED ON (col_name, ...) | DISTRIBUTED ON (col_name, ...) SORTED ON (col_name, ...)]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
AS
<query>;
Table description:
0: jdbc:hive2://zk0-hdilla.xi2kmm3bon0engqedn> desc formatted hive_parquet;
+-------------------------------+----------------------------------------------------+-----------------------------+
| col_name | data_type | comment |
+-------------------------------+----------------------------------------------------+-----------------------------+
| # col_name | data_type | comment |
| clientid | string | |
| querytime | string | |
| market | string | |
| deviceplatform | string | |
| devicemake | string | |
| devicemodel | string | |
| state | string | |
| country | string | |
| querydwelltime | double | |
| sessionid | bigint | |
| sessionpagevieworder | bigint | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| OwnerType: | USER | NULL |
| Owner: | anonymous | NULL |
| CreateTime: | Mon Jun 21 11:38:49 UTC 2021 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | abfs://hdillap-2021-06-09t16-52-55-399z@hiverepl.dfs.core.windows.net/hive/warehouse/managed/hive_parquet | NULL |
| Table Type: | MANAGED_TABLE | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\"} |
| | bucketing_version | 2 |
| | numFiles | 1 |
| | numRows | 59793 |
| | rawDataSize | 657723 |
| | totalSize | 1419783 |
| | transactional | true |
| | transactional_properties | insert_only |
| | transient_lastDdlTime | 1624275529 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| Storage Desc Params: | NULL | NULL |
| | serialization.format | 1 |
+-------------------------------+----------------------------------------------------+-----------------------------+
CREATE MATERIALIZED VIEW hivemv1
PARTITIONED ON (country)
STORED AS ORC
AS
SELECT * FROM hive_parquet;
CREATE MATERIALIZED VIEW hivemv3
PARTITIONED ON (country)
STORED AS ORC
AS
SELECT * FROM hive_parquet order by deviceplatform;
Currently we support the following operations that aid at managing the materialized views in Hive:
-- Drops a materialized view
DROP MATERIALIZED VIEW [db_name.]materialized_view_name;
-- Shows materialized views (with optional filters)
SHOW MATERIALIZED VIEWS [IN database_name] ['identifier_with_wildcards’];
-- Shows information about a specific materialized view
DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name;
0: jdbc:hive2://zk0-hdilla.xi2kmm3bon0engqedn> show materialized views;
+------------+--------------------+-----------------+
| mv_name | rewrite_enabled | mode |
+------------+--------------------+-----------------+
| # MV Name | Rewriting Enabled | Mode |
| hivemv | Yes | Manual refresh |
| hivemv1 | Yes | Manual refresh |
| hivemv2 | Yes | Manual refresh |
| | NULL | NULL |
+------------+--------------------+-----------------+
Once a materialized view has been created, the optimizer will be able to exploit its definition semantics to automatically rewrite incoming queries using materialized views, and hence, accelerate query execution.
The rewriting algorithm can be enabled and disabled globally using the hive.materializedview.rewriting configuration property (default value is true) and at materialized view level as below:
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;
When data in the source tables used by a materialized view changes, the rebuild operation for a materialized view needs to be triggered by the user. In particular, the user should execute the following statement:
ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;
Hive supports incremental view maintenance, i.e., only refresh data that was affected by the changes in the original source tables. Incremental view maintenance will decrease the rebuild step execution time. In addition, it will preserve LLAP cache for existing data in the materialized view.
FYI, Hive will attempt to rebuild a materialized view incrementally, falling back to full rebuild if it is not possible. Current implementation only supports incremental rebuild when there were INSERT operations over the source tables, while UPDATE and DELETE operations will force a full rebuild of the materialized view.
To execute incremental maintenance, following conditions should be met:
A rebuild operation acquires an exclusive write lock over the materialized view, i.e., for a given materialized view, only one rebuild operation can be executed at a given time.
If the materialized view uses non-transactional tables and hence, we cannot verify whether its contents are outdated, however we still want to use the automatic rewriting. For such occasions, we can combine a rebuild operation run periodically, e.g., every 5 minutes, and define the required freshness of the materialized view data using the hive.materializedview.rewriting.time.window configuration parameter, for instance:
SET hive.materializedview.rewriting.time.window=10min;
The parameter value can be also overridden by a concrete materialized view just by setting it as a table property when the materialization is created.
Please note: By default, hive.materializedview.rewriting.time.window will be set to 0min which means auto rebuild is disabled. To enable at global level add the same with specific time interval under Ambari -> Hive config -> Custom hive-site. Also, the change will be applicable to the MVs created post this change.
Post the rewrite window, the update with MV could be validated by `desc formatted mv_name`.
....
Rewrite Enabled: | Yes | NULL
Outdated for Rewriting: | Yes | NULL
....
CREATE MATERIALIZED VIEW hivemv3
PARTITIONED ON (country)
STORED AS ORC
TBLPROPERTIES (hive.materializedview.rewriting.time.window"="10min")
AS
SELECT * FROM hive_parquet;
With distributed on or sort by on roadmap, to sort the data within materialized view – create the mv with order by clause with the select query.
CREATE MATERIALIZED VIEW hivemv14
PARTITIONED ON (country)
STORED AS ORC
AS
SELECT * FROM hive_parquet ORDER BY (devicemake);
The data stored in the ORC file is sorted and could be validated with command `/usr/bin/hive –orcfiledump -d <location_of_orc_file>`.
Refer: Attached sample ORC data file.
+----------------------------------+----------------------------------------------------+-----------------------------+
| col_name | data_type | comment |
+----------------------------------+----------------------------------------------------+-----------------------------+
| # col_name | data_type | comment |
| clientid | string | |
| querytime | string | |
| market | string | |
| deviceplatform | string | |
| devicemake | string | |
| devicemodel | string | |
| state | string | |
| querydwelltime | double | |
| sessionid | bigint | |
| sessionpagevieworder | bigint | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| country | string | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| OwnerType: | USER | NULL |
| Owner: | anonymous | NULL |
| CreateTime: | Mon Jun 21 11:44:00 UTC 2021 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | abfs://hdillap-2021-06-09t16-52-55-399z@hiverepl.dfs.core.windows.net/hive/warehouse/managed/hivemv1 | NULL |
| Table Type: | MATERIALIZED_VIEW | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\"} |
| | bucketing_version | 2 |
| | numFiles | 88 |
| | numPartitions | 88 |
| | numRows | 59793 |
| | rawDataSize | 39334953 |
| | totalSize | 841901 |
| | transient_lastDdlTime | 1624275840 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.orc.OrcSerde | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| | NULL | NULL |
| # Materialized View Information | NULL | NULL |
| Original Query: | select * from hive_parquet | NULL |
| Expanded Query: | SELECT `clientid`, `querytime`, `market`, `deviceplatform`, `devicemake`, `devicemodel`, `state`, `querydwelltime`, `sessionid`, `sessionpagevieworder`, `country` FROM (select `hive_parquet`.`clientid`, `hive_parquet`.`querytime`, `hive_parquet`.`market`, `hive_parquet`.`deviceplatform`, `hive_parquet`.`devicemake`, `hive_parquet`.`devicemodel`, `hive_parquet`.`state`, `hive_parquet`.`country`, `hive_parquet`.`querydwelltime`, `hive_parquet`.`sessionid`, `hive_parquet`.`sessionpagevieworder` from `default`.`hive_parquet`) `hivemv1` | NULL |
| Rewrite Enabled: | Yes | NULL |
| Outdated for Rewriting: | No | NULL |
+----------------------------------+----------------------------------------------------+-----------------------------+
0: jdbc:hive2://zk0-hdilla.xi2kmm3bon0engqedn> select * from hivemv1 limit 5;
INFO : Compiling command(queryId=hive_20210621163758_1a523251-bfd0-40ca-a9fd-ef463120b3d8): select * from hivemv1 limit 5
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:hivemv1.clientid, type:string, comment:null), FieldSchema(name:hivemv1.querytime, type:string, comment:null), FieldSchema(name:hivemv1.market, type:string, comment:null), FieldSchema(name:hivemv1.deviceplatform, type:string, comment:null), FieldSchema(name:hivemv1.devicemake, type:string, comment:null), FieldSchema(name:hivemv1.devicemodel, type:string, comment:null), FieldSchema(name:hivemv1.state, type:string, comment:null), FieldSchema(name:hivemv1.querydwelltime, type:double, comment:null), FieldSchema(name:hivemv1.sessionid, type:bigint, comment:null), FieldSchema(name:hivemv1.sessionpagevieworder, type:bigint, comment:null), FieldSchema(name:hivemv1.country, type:string, comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20210621163758_1a523251-bfd0-40ca-a9fd-ef463120b3d8); Time taken: 1.111 seconds
INFO : Executing command(queryId=hive_20210621163758_1a523251-bfd0-40ca-a9fd-ef463120b3d8): select * from hivemv1 limit 5
INFO : Completed executing command(queryId=hive_20210621163758_1a523251-bfd0-40ca-a9fd-ef463120b3d8); Time taken: 0.01 seconds
INFO : OK
+-------------------+--------------------+-----------------+-------------------------+---------------------+----------------------+----------------+-------------------------+--------------------+-------------------------------+----------------------+
| hivemv1.clientid | hivemv1.querytime | hivemv1.market | hivemv1.deviceplatform | hivemv1.devicemake | hivemv1.devicemodel | hivemv1.state | hivemv1.querydwelltime | hivemv1.sessionid | hivemv1.sessionpagevieworder | hivemv1.country |
+-------------------+--------------------+-----------------+-------------------------+---------------------+----------------------+----------------+-------------------------+--------------------+-------------------------------+----------------------+
| 11786 | 22:33:53 | en-US | Android | Motorola | Quench XT5 | Saint John | 7.0328606 | 0 | 1 | Antigua And Barbuda |
| 11786 | 22:30:29 | en-US | Android | Motorola | Quench XT5 | Saint John | 68.3177076 | 0 | 0 | Antigua And Barbuda |
| 11786 | 22:35:02 | en-US | Android | Motorola | Quench XT5 | Saint John | 8.1046491 | 0 | 2 | Antigua And Barbuda |
| 11786 | 22:35:25 | en-US | Android | Motorola | Quench XT5 | Saint John | 26.3155831 | 0 | 3 | Antigua And Barbuda |
| 11786 | 22:36:00 | en-US | Android | Motorola | Quench XT5 | Saint John | 3.8841237 | 0 | 4 | Antigua And Barbuda |
+-------------------+--------------------+-----------------+-------------------------+---------------------+----------------------+----------------+-------------------------+--------------------+-------------------------------+----------------------+
5 rows selected (1.851 seconds)
One of the key use cases of statistics is query optimization. Statistics serve as the input to the cost functions of the optimizer so that it can compare different plans and choose among them.
analyze table hivemv1 partition(country) compute statistics for columns;
+----------------------------------+----------------------------------------------------+----------------------------------------------------+
| col_name | data_type | comment |
+----------------------------------+----------------------------------------------------+----------------------------------------------------+
| # col_name | data_type | comment |
| clientid | string | |
| querytime | string | |
| market | string | |
| deviceplatform | string | |
| devicemake | string | |
| devicemodel | string | |
| state | string | |
| country | string | |
| querydwelltime | double | |
| sessionid | bigint | |
| sessionpagevieworder | bigint | |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | default | NULL |
| OwnerType: | USER | NULL |
| Owner: | anonymous | NULL |
| CreateTime: | Fri Jun 18 05:48:03 UTC 2021 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Retention: | 0 | NULL |
| Location: | abfs://hdillap-2021-06-09t16-52-55-399z@hiverepl.dfs.core.windows.net/hive/warehouse/managed/hivemv2 | NULL |
| Table Type: | MATERIALIZED_VIEW | NULL |
| Table Parameters: | NULL | NULL |
| | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"clientid\":\"true\",\"country\":\"true\",\"devicemake\":\"true\",\"devicemodel\":\"true\",\"deviceplatform\":\"true\",\"market\":\"true\",\"querydwelltime\":\"true\",\"querytime\":\"true\",\"sessionid\":\"true\",\"sessionpagevieworder\":\"true\",\"state\":\"true\"}} |
| | bucketing_version | 2 |
| | numFiles | 1 |
| | numRows | 59793 |
| | rawDataSize | 45057355 |
| | totalSize | 737187 |
| | transient_lastDdlTime | 1623995315 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.orc.OrcSerde | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL |
| Compressed: | No | NULL |
| Num Buckets: | -1 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| | NULL | NULL |
| # Materialized View Information | NULL | NULL |
| Original Query: | select * from hivesampletable | NULL |
| Expanded Query: | select `hivesampletable`.`clientid`, `hivesampletable`.`querytime`, `hivesampletable`.`market`, `hivesampletable`.`deviceplatform`, `hivesampletable`.`devicemake`, `hivesampletable`.`devicemodel`, `hivesampletable`.`state`, `hivesampletable`.`country`, `hivesampletable`.`querydwelltime`, `hivesampletable`.`sessionid`, `hivesampletable`.`sessionpagevieworder` from `default`.`hivesampletable` | NULL |
| Rewrite Enabled: | Yes | NULL |
| Outdated for Rewriting: | No | NULL |
+----------------------------------+----------------------------------------------------+----------------------------------------------------+
https://cwiki.apache.org/confluence/display/Hive/Materialized+views
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.