# UserName & Password sylgg0918 IamGeek0 # start Hadoop /home/sylgg0918/bin/myhadoop.sh start # start mysql service mysqld start # start hiveserver2 nohup hive --service hiveserver2 & # start metastore service nohup hive --service metastore & # connect hive beeline !connect jdbc:hive2://hadoop102:10000 sylgg0918 You can just press Enter without entering a password. ######################################### # During the testing phase, if the execution is too slow, you can enable local mode. Enter the following command in the Hive command line (effective only for the current session). set mapreduce.framework.name=local; # Log location # hiveserver2 Log location /opt/module/hive-3.1.3/logs

1. Introduction to Hive
1.1 What is Hive
Hive was open-sourced by Facebook and is a Hadoop-based data warehouse tool. It can map structured data files to a table and provides SQL-like query functionality. So why does Hive exist? What problems was it created to solve? Let's quickly understand Hive through an example.
In Hadoop, we need to write Mapper, Reducer, and Driver classes and implement the corresponding logic, which is relatively cumbersome.
test table id column ArtistS ArtistS sss sas su lei sda su
If implemented using Hive SQL, it can be done in just one line.
select count(*) from test group by id;
1.2 The essence of Hive
The essence of Hive is that it is a Hadoop client used to convert HQL (Hive SQL) into MapReduce programs.
(1) The data of each Hive table is stored in HDFS.
(2) Hive performs data analysis using MapReduce as the underlying execution engine (and it can also be configured to use Spark or Tez).
(3) The execution program runs on YARN.
1.3 Hive architecture principles

Metastore: Provides an interface for accessing metadata. Metadata refers to the databases, tables, and fields created by users in Hive. The Metastore does not store actual data; it only provides an interface for accessing metadata.
By default, metadata is stored in the built-in Derby database, which only supports single-client access. In production environments, to support multi-user development, it is recommended to use MySQL to store the Metastore.
CLI: The command line can only be used locally on the machine where the Hive server is installed.
Driver:
(1) Parser (SQLParser): Converts the SQL string into an Abstract Syntax Tree (AST).
(2) Semantic Analyzer: Further divides the AST into QueryBlocks.
(3) Logical Plan Generator: Generates the logical plan from the syntax tree.
(4) Logical Optimizer: Optimizes the logical plan.
(5) Physical Plan Generator: Generates the physical plan based on the optimized logical plan.
(6) Physical Optimizer: Optimizes the physical plan.
(7) Executor (Execution): Executes the plan, obtains the query result, and returns it to the client.

JDBC/ODBC: Used for remote access to Hive, such as JDBC or ODBC clients. Obviously, these clients connect to the HiveServer2 database.
Difference between JDBC and ODBC:
(1) JDBC has better portability than ODBC:
Typically, after installing an ODBC driver, additional configuration is required before it can be used. Different configurations may not be compatible across different servers, so you need to configure it each time. In contrast, JDBC only requires selecting the appropriate JDBC database driver, with no extra configuration needed. During installation, the JDBC driver handles all necessary configuration automatically.
(2) Different programming languages:
JDBC is used in Java programming, while ODBC is generally used in C/C++ programming.
Example: Suppose a user executes a DDL (Data Definition Language) statement. Its execution sequence is as follows:
- The user creates a table using
CREATE TABLE.
- The corresponding table path is recorded in the Metastore.
- The database used by the Metastore to store metadata is mapped to HDFS.
2. Hive Installation
3. Hive Tips and Tricks
3.1 Common Hive Commands
hive -H can be used to view Hive-related commands.Option | Description | Example |
-d, --define | Variable substitution to apply to Hive commands. e.g, -d A=B or --define A=B | ㅤ |
--database <database> | Specify a database to use | --database stu |
-e <quoted-query-string> | Non-interactive SQL usage: Normally, we interactively access Hive through bin/hive or HiveServer2. Now, we want to execute SQL directly using Linux commands. | hive -e "insert into stu values(1,'aa')” |
-f <filename> | If the SQL is too long, you can put it into a file and then execute it. This is also a form of non-interactive execution. | hive -f stu.sql |
-H,--help | Print help information | hive -H |
--hiveconf <property=value> | When starting a Hive task with special requirements and you do not want to use the configurations in the configuration file, you can use this parameter to add or override configurations from the file. However, note that it is effective only for this Hive session. | bin/hive -hiveconf mapreduce.job.reduces=10; |
--hivevar <key=value> | Variable substitution to apply to hive commands. e.g. --hivevar A=B | ㅤ |
-i <filename> | Initialization SQL file | ㅤ |
-S, --silent | Silent mode in interactive shell | ㅤ |
-v, --verbose | Can display the executed SQL statements | ㅤ |
3.2 Hive Configuration Files
The priority of the three configuration methods above increases in the following order: Configuration File < Command-line Parameters < Parameter Declaration. Note that some system-level parameters, such as log4j settings, must be set using the first two methods because they are read before the session is established.
- Configuration File Method
- Default configuration file:
hive-default.xml(used for reference only; does not take effect) - User-defined configuration file:
hive-site.xml(global configuration file; usually the one that is modified)
Tips: User-defined configurations override the default configurations. Additionally, Hive also reads Hadoop configurations since it is started as a Hadoop client. Hive configurations override Hadoop configurations. Configuration file settings are effective for all Hive processes started on the local machine.
- Command-line Parameter Method
bin/hive -hiveconf mapreduce.job.reduces=10;
Tips: Effective only for this Hive session.
- Parameter Declaration Method: If you have already entered the Hive interactive interface (started via
bin/hive) and want to modify configurations, you can use the following method.
hive(default)> set mapreduce.job.reduces;
Tips: Effective only for this Hive session.
3.3 Common Hive Configuration Properties
Hive client displays the current database and table headers
- Add the following two configurations in
hive-site.xml:
<!-- Display the current database and table headers on the client --> <property> <name>hive.cli.print.header</name> <value>true</value> <description>Whether to print the names of the columns in query output.</description> </property> <property> <name>hive.cli.print.current.db</name> <value>true</value> <description>Whether to include the current database in the Hive prompt.</description> </property>
After the configuration, the Hive client can display the current database and table headers during runtime.

Hive Runtime Log Path Configuration
- By default, Hive logs are stored at /tmp/sylgg0918/hive.log (under the current username).

- Change Hive log location to
/opt/module/hive/logs
# Rename conf/hive-log4j2.properties.template to hive-log4j2.properties mv hive-log4j2.properties.template hive-log4j2.properties # Open the configuration file vim hive-log4j2.properties # Modify the log storage location property.hive.log.dir = /opt/module/hive-3.1.3/logs

Hive JVM Heap Memory Settings
In the new version of Hive, the default JVM heap memory at startup is 256M. If the heap size is too small, running complex SQL in local mode may often cause the error:
java.lang.OutOfMemoryError: Java heap space. Therefore, it is recommended to adjust the HADOOP_HEAPSIZE parameter in advance.Q: What is Local Mode?
A: Local mode means that when the data volume is small, tasks are not submitted to YARN.
- If using the Hive client, "local" refers to the current process.
- If using HiveServer2, "local" refers to the HiveServer2 process.
- Rename
conf/hive-env.sh.templatetohive-env.sh
mv hive-env.sh.template hive-env.sh
- Modify the parameter
export HADOOP_HEAPSIZE(line 40) inhive-env.shto 2048
export HADOOP_HEAPSIZE=2048

- Restart Hive, otherwise the changes will not take effect
Disable Hadoop Virtual Memory Check
Disable virtual memory check in
yarn-site.xml (if it is already disabled, no need to configure)- Stop Hadoop first
- Modify the configuration file
vim /opt/module/ # Add the following configuration <!-- Disable virtual memory --> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property>
- xsync
yarn-site.xml
xsync
- Restart yarn
4. DDL(Data Definition Language)
4.1 Database
4.1.1 Create Database
Syntax
CREATE DATABASE [IF NOT EXISTS] database_name [COMMENT database_comment] [LOCATION hdfs_path] -- Specify the HDFS location for this database [WITH DBPROPERTIES(property_name=property_value)]
Example
-- Create a database without specifying a location -- If no location is specified, the default path is ${hive.metastore.warehouse.dir}/database_name.db on HDFS CREATE DATABASE db_hive1; -- Create a database with a specified location CREATE DATABASE db_hive2 LOCATION '/db_hive2'; -- Create a database with DB properties (use DESC DATABASES EXTENDED database_name to view) CREATE DATABASE db_hive3 WITH DBPROPERTIES('create_date'='2022-11-18');
4.1.2 Query Database
- View Databases
Syntax
SHOW DATABASES [LIKE 'identifier_with_wildcards’]
Example
-- Show all databases SHOW DATABASES; -- Show all databases starting with 'db_hive' SHOW DATABASES LIKE 'db_hive*'; -- Show databases starting with 'db_hive' and the 'default' database SHOW DATABASES LIKE 'db_hive*|default';
- View Database Information
Syntax
DESCRIBE DATABASE [EXTENDED] db_name;
Example
-- View basic information DESC DATABASE db_hive3; -- View more detailed information about the database DESC DATABASE EXTENDED db_hive3;
4.1.3 Update Database
Users can use the
ALTER DATABASE command to modify certain information about a database. The modifiable information includes DB properties, location, and owner user.Note: Changing the database location does not affect the paths of existing tables; it only changes the default parent directory for newly created tables.
Syntax
- Modify properties
ALTER DATABASE database_name SET DBPROPERTIES(property_name=property_value);
- Modify location
ALTER DATABASE database_name SET LOCATION hdfs_path;
- Modify owner user
ALTER DATABASE database_name SET OWNER USER user_name;
Example
# Modify dbproperties ALTER DATABASE database_name SET DBPROPERTIES('create_date'='2022-11-01');
4.1.4 Delete Database
Syntax
DROP DATABASE [IF EXISTS] database_name [RESTRICT|CASCADE]
Tips:
- RESTRICT: Strict mode. If the database is not empty, the deletion will fail. This is the default mode.
- CASCADE: Cascade mode. If the database is not empty, all tables within the database will be deleted as well.
Example
-- Delete an empty database DROP DATABASE db_hive2; -- Delete a non-empty database DROP DATABASE db_hive3 CASCADE;
Switching databases is the same as in MySQL; simply use
USE database.4.2 Table
Data Types
Primitive Data Types
Type | Description | Definition |
tinyint | 1-byte signed integer | ㅤ |
smallint | 2 byte signed integer | ㅤ |
int | 4 byte signed integer | ㅤ |
bigint | 8 byte signed integer | ㅤ |
boolean | Boolean type, true or false | ㅤ |
float | Single-precision floating-point number | ㅤ |
double | Double-precision floating-point number | ㅤ |
decimal | Decimal precise numeric type (for example, DECIMAL(16,2) means the total number of digits including integer and fractional parts is 16, and the fractional part can have 2 digits). | decimal(16,2) |
varchar | Character sequence, maximum length must be specified. The valid range for the maximum length is [1, 65535]. | varchar(32) |
string | String, no maximum length needs to be specified. | ㅤ |
timestamp | Time Type | ㅤ |
binary | Binary Data | ㅤ |
Complex Data Types
Type | Description | Definition | How to get value |
array | An array is a collection of values of the same type. | array<string> | arr[0] |
map | A map is a collection of key-value pairs of the same type. | map<string, int> | map[’key’] |
struct | A struct consists of multiple fields, each with its own name and data type. | struct<id:int, name:string> | struct.id |
Type Conversion
Hive's primitive data types support type conversion, which includes implicit conversion and explicit conversion.
- Implicit Conversion
Users do not need to perform any operations; Hive will automatically convert the data. The specific rules are as follows:

For example, if we execute the SQL statement:
SELECT '1' + 1;
What is the result type? The answer is DOUBLE.
Why? According to the conversion rules, a
STRING can first be converted to DOUBLE. An INT can be converted to INT or BIGINT, but a STRING cannot. The smallest type that satisfies both operands is DOUBLE, so the result type is DOUBLE.- Explicit Conversion
Syntax
cast(expr as <type>)
Example
-- Convert string to int (implicit conversion cannot achieve this, see the table above) select cast ('111' as int); -- Convert string to int (cannot be achieved if the string is not a number, e.g., 'abc'; the result will be <null>) select cast ('abc' as int);
4.2.1 Create Table
Standard Table Creation Method
Syntax
/** * File: create_example_table.sql * Description: SQL file demonstrating the syntax for creating a large Hive/SQL table. * Note: Keywords in brackets [] are optional clauses. */ -- 1. Core Statement and Table Name CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name ( -- 2. Column Definitions col_name_1 data_type [COMMENT col_comment], col_name_2 data_type [COMMENT col_comment], -- ... more columns col_name_n data_type ) -- 3. Table-Level Comment (Optional) [COMMENT table_comment] -- 4. Partitioning Clause (Optional) [PARTITIONED BY ( col_name data_type [COMMENT col_comment], ... )] -- 5. Clustering and Sorting (Optional, for Bucket Optimization) [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] -- 6. Row Format (Optional, specifies record termination and field delimiters) [ROW FORMAT row_format] -- 7. Storage Format (Optional, specifies the underlying file type) [STORED AS file_format] -- Common formats: TEXTFILE, PARQUET, ORC, AVRO -- 8. Data Location (Optional, specifies the HDFS path) [LOCATION hdfs_path] -- 9. Table Properties (Optional, key-value metadata) [TBLPROPERTIES (property_name=property_value, ...)];
Partition columns cannot have the same name as existing table columns; otherwise, an error will occur.
Keyword Explanation
TEMPORARY | Temporary Table: This table is visible only in the current session. When the session ends, the table is deleted. Simply put, if you log in via a client, perform operations, and then exit the client, the table will no longer exist. |
EXTERNAL | External Table: In contrast to internal (managed) tables, a managed table means Hive fully controls the table, including metadata and data in HDFS. An external table means Hive only manages the metadata and does not fully control the data in HDFS. |
PARTITIONED BY | Create a Partitioned Table |
CLUSTERED BY… SORTED BY… INTO … BUCKETS | Create a Bucketed Table, where each bucket is sorted by the field(s) specified after SORTED BY. |
ROW FORMAT | Specify SERDE (Serializer and Deserializer): Hive uses a SerDe to serialize and deserialize each row of data.What is a SerDe?Read file flow: HDFS files → InputFileFormat → <key,value> → Deserializer → Row objectWrite file flow: Row object → Serializer → <key,value> → OutputFileFormat → HDFS filesIf no serializer or deserializer is provided, Hive will use the default SerDe. You can check it with SHOW CREATE TABLE.
Syntax 1:ROW FORMAT DELIMITED
[FIELDS TERMINATED BY char]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[LINES TERMINATED BY char]
[NULL DEFINED AS char]
DELIMITED: Indicates that each field in the file is separated by a specific delimiter; the default SerDe will be used to serialize and deserialize each row.FIELDS TERMINATED BY: Column delimiterCOLLECTION ITEMS TERMINATED BY: Delimiter between elements in a map, struct, or arrayMAP KEYS TERMINATED BY: Delimiter between keys and values in a mapLINES TERMINATED BY: Line delimiter
Syntax 2:ROW FORMAT SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
serde_name must be the full class name (e.g., com.xxx.xxx.xxx).The SERDE keyword can specify other built-in SerDes or user-defined SerDes, such as JSON SerDe, which can be used to handle JSON strings. |
STORED AS | Example: STORED AS TEXTFILE |
Create Table As Select (CTAS)
Syntax
CREATE [TEMPORARY] TABLE [IF NOT EXISTS] table_name [COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] [AS select_statement]
This syntax allows users to create a table directly from the results of a SELECT query, with the table structure matching the structure of the query results.
Create Table Like
Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [LIKE exist_table_name] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)]
This syntax allows users to replicate the structure of an existing table. Unlike the CTAS (CREATE TABLE AS SELECT) syntax, the table created using this method does not contain any data.
Example
Internal Table
By default, tables created in Hive are managed (internal) tables.
create table if not exists student( id int, name string ) row format delimited fields terminated by '\t' location '/user/hive/warehouse/student'
Prepare test data
1001 student1 1002 student2 1003 student3 1004 student4 1005 student5 1006 student6 1007 student7 1008 student8 1009 student9 1010 student10 1011 student11 1012 student12 1013 student13 1014 student14 1015 student15 1016 student16
Upload data to
/user/hive/warehouse/studenthadoop fs -put student.txt /user/hive/warehouse/student
Delete table and then you will find the data in HDFS gone as well
drop table student;

External Table
create external table if not exists student( id int, name string ) row format delimited fields terminated by '\t' location '/user/hive/warehouse/student';
Prepare test data
1001 student1 1002 student2 1003 student3 1004 student4 1005 student5 1006 student6 1007 student7 1008 student8 1009 student9 1010 student10 1011 student11 1012 student12 1013 student13 1014 student14 1015 student15 1016 student16
Upload data to
/user/hive/warehouse/studenthadoop fs -put student.txt /user/hive/warehouse/student
Delete table and then you will found the data in HDFS still there.
drop table student;

SERDE and Complex Data Types If you have a JSON file in the following format that needs to be analyzed and processed by Hive, consider how to design the table.
{ "name": "dasongsong", "friends": [ "bingbing", "lili" ], "students": { "xiaohaihai": 18, "xiaoyangyang": 16 }, "address": { "street": "hui long guan", "city": "beijing", "postal_code": 10010 } }
When designing table fields, the table columns should correspond to the top-level fields in the JSON string. For JSON strings with nested structures,...
create table teacher( name string, friends array<string>, students map<string, int>, address struct<city:string, street:string, postal_code:int> ) row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe' location '/user/hive/warehouse/teacher';
Save the following JSON into
teacher.txt (do not use the JSON format above) and upload it to /user/hive/warehouse/teacher.hadoop fs -put teacher.txt /user/hive/warehouse/teacher
Query the values within it
select friends[0], students['xiaohaihai'], address.city from teacher;

create table as select & create table like
create table teacher1 as select * from teacher;
It will include the data from the
teacher table.
create table teacher2 like teacher;
It won’t include the data from the
teacher table.
4.2.2 View Table
Show All Tables
Syntax
SHOW TABLES [IN database_name] LIKE ['identifier_with_wldcards’];
Tips: Explanation of
LIKE wildcard expressions:- represents any number of any characters
|represents an OR relationship
Example
-- Show all tables in the 'default' database starting with 'stu' show tables in default like 'stu*';
View Table Information
Syntax
DESCRIBE [EXTENDED | FORMATTED] [db_name.]table_name
Option | Description |
EXTENDED | Show detailed information |
FORMATTED | Format the detailed information for better readability |
Example
-- View basic information DESC stu; -- View more detailed information in a formatted way DESC FORMATTED stu;
4.2.3 Update Table
Rename Table
Syntax
ALTER TABLE table_name RENAME TO new_table_name
Example
-- Rename the table 'stu' to 'stu1' ALTER TABLE stu RENAME TO stu1;

Modify Column Information
Tips: This only modifies the table's metadata. If the table contains data in HDFS, the data on HDFS must be handled manually. For example, if the original columns are
a, b, c and you want to reorder them to c, a, b, the metadata will reflect the new order, but the actual HDFS data will not change. If you do not manually adjust the HDFS data, reading the table will return incorrect values (e.g., the c column will return what was originally in the a column).Add Column
This statement allows users to add new columns, which will be appended at the end.
Syntax
ALTER TABLE table_name ADD COLUMNS (col_name data_type [COMMENT col_comment], ...)
Update Column
This statement allows users to modify a specific column's name, data type, comment, and position in the table.
Syntax
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
Replace Column
This statement also allows users to replace all existing columns in the table with new columns.
Syntax
ALTER TABLE table_name REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
Example
-- View table structure DESC stu1; -- Add a new column 'age' of type INT ALTER TABLE stu1 ADD COLUMNS(age INT); -- Add a new column 'gender' ALTER TABLE stu1 ADD COLUMNS(gender STRING); -- Change the type of 'gender' to INT and place it after 'id' ALTER TABLE stu1 CHANGE COLUMN gender gender INT AFTER id; -- Error while processing statement: FAILED: Execution Error, return code 1 -- from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. The -- following columns have types incompatible with the existing columns in -- their respective positions: -- If the above error occurs, execute the following: SET hive.metastore.disallow.incompatible.col.type.changes=false; -- View table structure DESC stu1; -- Update a column (rename 'age' to 'ages' and change type from INT to DOUBLE) ALTER TABLE stu1 CHANGE COLUMN age ages DOUBLE; -- Replace columns (replace existing columns with new ones) ALTER TABLE stu1 REPLACE COLUMNS(id INT, name STRING);
4.2.4 Delete Table
Syntax
DROP TABLE [IF EXISTS] table_name;
Example
-- Delete the table 'stu' DROP TABLE stu;
4.2.5 Truncate Table
Tips: This only truncates the table; it does not delete data in an external table.
Syntax
TRUNCATE [TABLE] table_name
Example
-- Truncate the table without deleting data TRUNCATE TABLE student;
5. DML(Data Manipulation Language)
5.1 Load
The
LOAD statement can be used to import files into a Hive table.Tips: The
LOCAL keyword is worth noting. If you are using the hive command to connect directly, LOCAL refers to the current server. If you are using the beeline command, which is the JDBC connection we commonly use, LOCAL refers to the path on the HiveServer2 machine.Syntax
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)];
option | description |
local | Indicates loading data from the local system into a Hive table; otherwise, data is loaded from HDFS into the Hive table. |
overwrite | Indicates overwriting existing data in the table; otherwise, data is appended. |
partition | Indicates uploading to a specified partition; if the target is a partitioned table, the partition must be specified. |
Example
# Create a table CREATE TABLE student ( id INT, name STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
- Load local file to Hive
# If the LOCAL keyword is used, it is equivalent to hadoop fs -put—the source file will not be removed. load data local inpath '/opt/module/datas/student.txt' into table student;

# Execute again. Since it uses INTO instead of OVERWRITE INTO, this is an append operation rather than an overwrite. # Therefore, the table should now contain two copies of the data. LOAD DATA LOCAL INPATH '/opt/module/datas/student.txt' INTO TABLE student;

# Since it uses OVERWRITE INTO, this is an overwrite operation. The original two copies of the data should now be reduced to one. LOAD DATA LOCAL INPATH '/opt/module/datas/student.txt' OVERWRITE INTO TABLE student;
- Load HDFS file to Hive
# Upload file to HDFS hadoop fs -put /home/sylgg0918/student.txt /user/sylgg0918 # Load data from HDFS. After the import, check HDFS to see if the file still exists. # (Without the LOCAL keyword, this behaves like a move operation, # and the source file will be removed.) LOAD DATA INPATH '/opt/module/datas/student.txt' INTO TABLE student;
5.2 Insert
Insert the query results into a table
Syntax
INSERT (INTO | OVERWRITE) TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement;
Example
# Create a new table CREATE TABLE student1 ( id INT, name STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; # Insert data based on query results INSERT OVERWRITE TABLE student1 SELECT id, name FROM student; # Append one more copy of the data INSERT INTO TABLE student1 SELECT id, name FROM student;
Insert the given values into a table
Syntax
INSERT (INTO | OVERWRITE) TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
Example
# Insert fixed values into the table INSERT INTO TABLE student1 VALUES (1, 'wangwu'), (2, 'zhaoliu'); # Insert fixed values into a partitioned table INSERT INTO TABLE student1 PARTITION(id, age) VALUES (1, 'wangwu'), (2, 'zhaoliu');
Write the query results to the target path
Tips: Make sure the path after
directory is completely correct. If there is any mistake, it will overwrite all files in the incorrect path you provided. The damage can be as severe as rm -rf.Syntax
INSERT OVERWRITE [LOCAL] DIRECTORY directory [ROW FORMAT row_format] [STORED AS file_format] select_statement;
Example
# Without the LOCAL keyword, the output would be written directly to an HDFS path INSERT OVERWRITE LOCAL DIRECTORY '/home/sylgg0918/datas' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' SELECT id, name FROM student;

5.3 Export & Import
The EXPORT statement can export both the table data and its metadata to an HDFS path. The IMPORT statement can load the exported content back into Hive, restoring both the table data and the metadata. EXPORT and IMPORT can be used for data migration between two Hive instances.
Syntax
- Export
EXPORT TABLE tablename TO 'export_target_path'
- Import
IMPORT [EXTERNAL] TABLE new_or_original_tablename FROM 'source_path' [LOCATION 'import_target_path']
Example
# Export EXPORT TABLE default.student TO '/user/hive/warehouse/export/student'; # Import # If an error occurs, check Error & Exception -> Hive IMPORT TABLE student2 FROM '/user/hive/warehouse/export/student';

6. Query
Syntax
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference -- From which table to query [WHERE where_condition] -- Filtering [GROUP BY col_list] -- Grouping [HAVING col_list] -- Filtering after grouping [ORDER BY col_list] -- Sorting [ CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT number] -- Limit the number of output rows
Data Preparation
- Create
dept.txtunder/opt/module/datasand fill it with the following content.
10 Administration 1700 20 Finance 1800 30 Teaching 1900 40 Sales 1700
- Create
emp.txtunder the/opt/module/dataspath and fill it with the following content.
7369 Zhang San R&D 800.00 30 7499 Li Si Finance 1600.00 20 7521 Wang Wu Admin 1250.00 10 7566 Zhao Liu Sales 2975.00 40 7654 Hou Qi R&D 1250.00 30 7698 Ma Ba R&D 2850.00 30 7782 Jin Jiu \N 2450.0 30 7788 Yin Shi Admin 3000.00 10 7839 Xiao Fang Sales 5000.00 40 7844 Xiao Ming Sales 1500.00 40 7876 Xiao Li Admin 1100.00 10 7900 Xiao Yuan Lecturer 950.00 30 7902 Xiao Hai Admin 3000.00 10 7934 Xiao Hongming Lecturer 1300.00 30
- Create the department table.
create table if not exists dept( deptno int, dname string, loc int ) row format delimited fields terminated by '\t';
- Create the employee table.
create table if not exists emp( empno int, ename string, job string, sal double, deptno int ) row format delimited fields terminated by '\t';
- Import data
load data local inpath '/opt/module/datas/dept.txt' into table dept; load data local inpath '/opt/module/datas/emp.txt' into table emp;
6.1 Basic query
6.1.1 Query the entire table and specific columns.
Query the entire table.
select * from emp;
Select specific columns for query.
Tips:
- SQL is case-insensitive.
- SQL statements can be written in a single line or across multiple lines.
- Keywords cannot be abbreviated or split across lines.
- Each clause is generally written on a separate line.
- Use indentation to improve the readability of the statements.
select empno, ename from emp;
6.1.2 Column alias
select ename AS name, deptno dn from emp;
6.1.3 Limit
# Return 5 rows SELECT * FROM emp LIMIT 5; # Fetch 3 rows starting from the 2nd row (excluding the second row) SELECT * FROM emp LIMIT 2, 3;
6.1.4 Where
# Query employees with a salary greater than 1000 SELECT * FROM emp WHERE sal > 1000;
6.1.5 Relational operators/functions
The following operators are mainly used in
WHERE and HAVING clauses.Operator | Support data type | Description |
A=B | Basic data types | Returns true if A equals B; otherwise, returns false. |
A<=>B | Basic data types | Returns true if both A and B are either NULL or both not NULL; returns false if only one of them is NULL. |
A<>B, A!=B | Basic data types | Returns NULL if either A or B is NULL; returns true if A is not equal to B, otherwise returns false. |
A<B | Basic data types | Returns NULL if either A or B is NULL; returns true if A is less than B, otherwise returns false. |
A<=B | Basic data types | Returns NULL if either A or B is NULL; returns true if A is less than or equal to B, otherwise returns false. |
A>B | Basic data types | Returns NULL if either A or B is NULL; returns true if A is greater than B, otherwise returns false. |
A>=B | Basic data types | Returns NULL if either A or B is NULL; returns true if A is greater than or equal to B, otherwise returns false. |
A [not] between B and C | Basic data types | If any of A, B, or C is NULL, the result is NULL.
If A is greater than or equal to B and less than or equal to C, the result is true; otherwise, it is false.
Using the NOT keyword produces the opposite result. |
A is null | All data types | Returns true if A is NULL; otherwise, returns false. |
A is not null | All data types | Returns true if A is not NULL; otherwise, returns false. |
in(数值1,数值2) | All data types | Use the IN operator to check if a value exists in a list. |
A [not] like B | string | B is a simple SQL regular expression, also called a wildcard pattern.
If A matches B, it returns true; otherwise, it returns false.The pattern rules for B are as follows:'x%' means A must start with the letter x.'%x' means A must end with the letter x.'%x%' means A contains the letter x, which can be at the beginning, end, or anywhere within the string.Using the NOT keyword produces the opposite result. |
A rlike B, A regexp B | string | ChatGPT said:B is a Java-based regular expression.
If A matches B, it returns true; otherwise, it returns false.The matching uses the regular expression interface implemented in the JDK, so the rules follow Java regex conventions.
For example, the regular expression must match the entire string A, not just a part of it. |
6.1.6 Logical operators/functions
Operator | Description |
and | Logical AND |
or | Logical OR |
not | Logical NOT |
Example
# Query employees with salary greater than 1000 and department 30 SELECT * FROM emp WHERE sal > 1000 AND deptno = 30; # Query employees with salary greater than 1000 or department 30 SELECT * FROM emp WHERE sal > 1000 OR deptno = 30; # Query employees who are not in department 20 or 30 SELECT * FROM emp WHERE deptno NOT IN (20, 30);
6.1.7 Aggregate functions
Function Name | Description |
count(*) | Counts all rows, including NULL values. |
count([Column Name]) | Counts the number of non- NULL values in a column. |
max() | Returns the maximum value, excluding NULL (unless all values are NULL). |
min() | Returns the minimum value, excluding NULL (unless all values are NULL). |
sum() | Calculates the sum, excluding NULL. |
avg() | Calculates the average, excluding NULL. |
Example
- Get the total number of rows (in Hive,
COUNT(1)andCOUNT(*)have exactly the same performance).
select count(*) cnt from emp;

- Get the maximum salary.
select max(sal) max_sal from emp;

- Get the minimum salary.
select min(sal) min_sal from emp;

- Get the total sum of salaries.
select sum(sal) sum_sal from emp;

- Get the average salary.
select avg(sal) avg_sal from emp;

6.2 Group & Having
6.2.1 Group By
The
GROUP BY statement is usually used with aggregate functions. It groups the results by one or more columns and then performs aggregate operations on each group.Example
- Calculate the average salary for each department in the
emptable.
select t.deptno, avg(sal) avg_sal from emp t group by t.deptno;


- Calculate the highest salary for each job in each department in the
emptable.
select t.deptno, t.job, max(sal) max_sal from emp t group by t.deptno, t.job;


6.2.2 Having
Difference between
HAVING and WHERE:- You cannot use aggregate functions after
WHERE, but you can use them afterHAVING.
HAVINGis used only inGROUP BYstatements for grouped aggregation.
Example
- Get departments with an average salary greater than 2000.
select deptno from emp group by deptno having avg(sal) > 2000; select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;

6.3 Join
6.3.1 Table alias
Example
select e.*, d.* from emp p join dept d on e.deptno = d.deptno;
6.3.2 inner join
Only rows that have matching data in both tables based on the join condition are retained. By default,
JOIN is an INNER JOIN.select e.empno, e.ename, d.deptno from emp e join dept d on e.deptno = d.deptno;

6.3.3 left join
Returns all rows from the left table; matched rows from the right table show their values, and unmatched rows are filled with
NULL.select e.empno, e.ename, d.deptno from emp e left join dept d on e.deptno = d.deptno;
6.3.4 right join
Returns all rows from the right table; matched rows from the left table show their values, and unmatched rows are filled with
NULL.select e.empno, e.ename, d.deptno from emp e right join dept d on e.deptno = d.deptno;
6.3.5 full join
Returns all rows from both the left and right tables; unmatched rows are filled with
NULL.select e.empno, e.ename, d.deptno from emp e full join dept d on e.deptno = d.deptno;
6.3.6 Multi-table join
To join
n tables, at least n-1 join conditions are required.Data Preparation
- Create a file named
location.txtunder/home/sylgg0918/datas/.
1700 Beijing 1800 Shanghai 1900 Shenzhen
- 创建表
create table if not exists location( loc int, loc_name string ) row format delimited fields terminated by '\t';
- Import data
load data local inpath '/home/sylgg0918/datas/location.txt' into table location;
Example
select e.ename, d.dname, l.loc_name from emp e join dept d on d.deptno = e.deptno join location l on d.loc = l.loc;
6.3.7 Cartesian product
Conditions for producing a Cartesian product:
- The join condition is omitted.
- The join condition is invalid.
- All rows from all tables are joined with each other.
Example
select empno, dname from emp, dept;

6.3.8 union & union all
Tips:
UNIONcan only combineSELECTstatements.
- If the field names of the two
SELECTstatements are different, the result will use the field names from the firstSELECT. (Even if the names differ, the union works as long as the column types match.)
UNION and UNION ALL concatenate SQL results vertically, unlike JOIN, which combines tables horizontally.UNIONremoves duplicates.
UNION ALLkeeps all rows, including duplicates.
Requirements for using
UNION or UNION ALL:- Both SQL results must have the same number of columns.
- The data types of corresponding columns must match.
Example
# Concatenate and display employees from department 30 and department 40 using UNION SELECT * FROM emp WHERE deptno = 30 UNION SELECT * FROM emp WHERE deptno = 40;
6.4 Sort
6.4.1 order by (Global)
Global sorting uses only one reducer. Even if you set
mapreduce.job.reduces=3, the final number of reducers will still be one. This is because the purpose of ORDER BY is to achieve a globally sorted result.In production, it’s common to add a
LIMIT after ORDER BY to reduce the workload. If you truly need a fully globally sorted result, you must allocate more resources to the reducer.Example
- Query employee information sorted by salary in ascending order.
select * from emp order by sal;

- Query employee information sorted by salary in descending order.
select * from emp order by sal desc;
- Sort employees based on twice their salary.
select ename, sal * 2 twosal from emp order by twosal;

- Query employee information sorted by twice the salary:
select ename, deptno, sal from emp order by deptno, sal;

6.4.2 Sort By (Reduce End)
SORT BY is used to specify the sorting field from Map to Reduce, but it does not guarantee global ordering.For large datasets,
ORDER BY is very inefficient. Often, global sorting is unnecessary, and in such cases, SORT BY can be used. SORT BY generates a sorted file for each reducer. The sorting is done within each reducer, so the overall global result is not fully sorted.Example
# Setting the number of reducers set mapreduce.job.reduces=3; # Checking the set number of reducers set mapreduce.job.reduces; # Viewing employee information sorted by department number in descending order select * from emp sort by deptno desc;

# You can verify that the data is partitioned and sorted by using the following commands: # Setting 3 Reducers (The number of reducers equals the number of final output files) set mapreduce.job.reduces=3; # Output the results to a file. insert overwrite local directory '/home/sylgg0918/datas/sortby-result' select * from emp sort by deptno desc;
Check the files in the target directory to see if they are sorted.
6.4.3 Distribute By
Tips:
- The partitioning rule of
distribute byis based on the hash code of the partition field modulo the number of reducers; rows with the same remainder are grouped into the same partition.
- Hive requires that the
distribute byclause be written before thesort byclause.
- After the demonstration, remember to reset the value of
mapreduce.job.reducesback to -1; otherwise, subsequent operations on partitioned or bucketed tables may cause errors during MapReduce jobs.
distribute by is used to specify the partition field—essentially, it distributes data to designated reducers according to a rule, after which sort by performs sorting.In some cases, we need to control which specific rows go to which Reducer, typically for subsequent aggregation operations. The
distribute by clause accomplishes this. distribute by works similarly to partitioning in MapReduce (custom partitioning) and is often used in combination with sort by.When testing
distribute by, be sure to allocate multiple reducers for processing; otherwise, the effect of distribute by may not be observable.Example
# Partition by department number first, then sort by employee salary insert overwrite local directory '/home/sylgg0918/datas/distribute-result' select * from emp distribute by deptno sort by sal desc;

6.4.4 Cluster By
When the
distribute by and sort by fields are the same, you can use cluster by. In addition to the functionality of distribute by, cluster by also includes the functionality of sort by. However, the sorting can only be in ascending order; you cannot specify the sorting rule as ASC or DESC.Example
# Partition and sort by deptno — the following two methods are exactly equivalent: select * from emp cluster by deptno; #### select * from emp distribute by deptno sort by deptno;

7. Functions
# View system built-in functions show functions; # View usage of built-in functions desc function upper; # View detailed information of built-in functions desc function extended upper;
7.1 Single-row functions
7.1.1 Arithmetic operation functions
Operator | Description |
A+B | A plus B |
A-B | A minus B |
A*B | A multiplied by B |
A/B | A divided by B |
A%B | A modulo B |
A&B | A bitwise AND B |
A|B | A bitwise OR B |
A^B | A bitwise XOR B |
~A | Bitwise NOT of A |
select sal + 1 from emp;
7.1.2 Numerical functions
- round
# Result: 3 select round(3.3);
- ceil
# Result: 4 select ceil(3.1);
- floor
# Result: 4 select floor(4.8);
7.1.3 String functions
substring
Syntax
Grammar | Return | Description |
substring(string A, int start) | string | Returns the substring of string A starting from position start to the end. Includes the character at position start, and note that string indices in Hive start from 1. |
substring(string A, int start, int len) | string | Returns a substring of string A starting from position start with a length of len. Includes the character at position start, and note that string indices in Hive start from 1. |
Example
# Get all characters starting from the second character. Result: rtists select substring("artists", 2); # Get all characters starting from the third-to-last character. Result: sts select substring("artists", -3); # Starting from the third character, get the next 2 characters. Result: ti select substring("artists", 3, 2);
replace
Syntax
Grammar | Return | Description |
replace(string A, string B, string C) | string | Replace substring B in string A with C. |
Example
# Replace 's' with 'S'. Result: artiStS select replace('artists', 's', 'S')
regexp_replace
Syntax
Grammar | Return | Description |
regexp_replace(string A, string regexp, string C) | string | Replace the part of string A that matches the Java regular expression B with C. Note: in some cases, escape characters may be required. |
Example
# Replace digits with 'num'. Result: num-num select regexp_replace('1000-200', '(\d+)', 'num')
regexp
Syntax
Grammar | Return | Description |
select 'strA' regexp ‘strB’ | boolean | If the string matches the regular expression, return true; otherwise, return false. |
Example
# Result: true select 'dfsaaaa' regexp 'dfsa+' # Result: false select 'dfsaaaa' regexp 'dfsb+';
repeat
Syntax
Grammar | Return | Description |
repeat(string A, int n) | string | Repeat string A n times. |
Example
# Result: 123123123 select repeat('123', 3);
split
Syntax
Grammar | Return | Description |
split(string str, string pat) | array | Split the string str based on the content matched by the regular expression pat, and return the resulting substrings as an array. |
Example
# Result: ["a","b","c","d"] select split('a-b-c-d','-');
nvl
Syntax
Grammar | Return | Description |
nvl(A,B) | A or B | If the value of A is not null, return A; otherwise, return B. |
Example
# Result: 1 select nvl(null,1);
concat
Syntax
Grammar | Return | Description |
concat(string A, string B, string C, ……) | string | Concatenate characters A, B, C, ... into a single string. |
Example
# Result: beijing-shanghai-shenzhen select concat('beijing','-','shanghai','-','shenzhen');
concat_ws
Syntax
Grammar | Return | Description |
concat_ws(string A, string…| array(string)) | string | Concatenate multiple strings, or all elements of an array, using separator A. |
Example
# Result: beijing-shanghai-shenzhen select concat_ws('-','beijing','shanghai','shenzhen'); # Result: aa-bb-cc select concat_wa('-', array("aa","bb","cc"));
get_json_object
Syntax
Grammar | Return | Description |
get_json_object(string json_string, string path) | string | Parse the JSON string json_string and return the content specified by path. If the input JSON string is invalid, return NULL. |
Example
# Result: artists select get_json_object('[{"name":"artists","sex":"male","age":"25"},{"name":"small ben","sex":"male","age":"47"}]','$.[0].name'); # Result: {"name":"artists","sex":"male","age":"25"} hive> select get_json_object('[{"name":"artists","sex":"male","age":"25"},{"name":"small ben","sex":"male","age":"47"}]','$.[0]');
7.1.4 Date functions
unix_timestamp: Returns the timestamp of the current or specified time.
Syntax
Grammar | Return | Description |
unix_timestamp() | bigint | Returns the number of seconds elapsed between the current or specified time and January 1, 1970, UTC. |
Example
# Number of seconds or milliseconds in UTC time, independent of the time zone of the hiveserver2 server select unix_timestamp('2022/08/08 08-08-08','yyyy/MM/dd HH-mm-ss'); # Returns the number of seconds elapsed between the current time and January 1, 1970 select unix_timestamp();
Result

from_unixtime: Convert a UNIX timestamp (seconds elapsed from 1970-01-01 00:00:00 UTC to the specified time) to the time format in the UTC+0 time zone.
Syntax
Grammar | Return | Description |
from_unixtime(bigint unixtime[,string format]) | string | Convert a UNIX timestamp (seconds elapsed from 1970-01-01 00:00:00 UTC to the specified time) to the time format in the current time zone. |
Example
# Result:2022-08-08 08:08:08 select from_unixtime(1659946088);
from_utc_timestamp: Convert a UNIX timestamp (seconds elapsed from 1970-01-01 00:00:00 UTC to the specified time) to the time format in the specified time zone.
Syntax
Grammar | Return | Description |
from_utc_unixtime(bigint unixtime[,string format]) | string | Convert a UNIX timestamp (seconds elapsed from 1970-01-01 00:00:00 UTC to the specified time) to the time format in the current time zone. |
Example
# Result: 2022-08-08 16:08:08.000000000 select from_utc_timestamp(cast(1659946088 as bigint)*1000, 'GMT+8');
current_date: Current date.
Syntax
Grammar | Return | Description |
current_date | ㅤ | Current date |
Example
# Result: 2023-03-12 select current_date;
current_timestamp: Current date and time, with precision down to milliseconds.
Syntax
Grammar | Return | Description |
current_timestamp | ㅤ | Current date + time (unlike unix_timestamp, this method takes into account the current time zone). |
Example
# Result: 2023-03-12 15:18:36.292000000 select current_timestamp;
month: Get the month from a date.
Syntax
Grammar | Return | Description |
month(string date) | int | Get the month from a date. |
Example
# Result: 8 select month('2022-08-08 08:08:08')
day: Get the day from a date.
Syntax
Grammar | Return | Description |
date(string date) | int | Get the day from a date |
Example
# Result: 8 select day('2022-08-08 08:08:08')
hour: Get the hour from a date.
Syntax
Grammar | Return | Description |
hour(string date) | int | Get the hour from a date. |
Example
# Result: 8 select day('2022-08-08 08:08:08');
datediff: The number of days between two dates (end date minus start date).
Syntax
Grammar | Return | Description |
datediff(string enddate, string startdate) | int | The number of days between two dates (end date minus start date). |
Example
# The difference in days between the earlier date and the later date. Result: -427 select datediff('2021-08-08','2022-10-09');
date_add: Date plus number of days.
Syntax
Grammar | Return | Description |
date_add(string startdate, int days) | string | Returns the date after adding days to the start date. |
Example
# What is the date 2 days after 2022-08-08? Result: 2022-08-10 select date_add('2022-08-08',2);
date_sub: Date minus number of days.
Syntax
Grammar | Return | Description |
date_sub(string startdate, int days) | string | Returns the date after subtracting days from the start date. |
Example
# What is the date 2 days before 2022-08-08? Result: 2022-08-06 select date_sub('2022-08-08',2);
date_format: Parse a standard date into a string of the specified format.
Syntax
Grammar | Return | Description |
date_sub(string startdate, string formatPattern) | string | Format startdate according to the formatPattern. |
Example
# Result: 2022年-08月-08日 select date_format('2022-08-08', 'yyyy年-MM月-dd日');
7.1.5 Flow control functions
case when: Conditional judgment functions.
Syntax
Grammar | Return | Description |
case when A then B [when C then D]* [else E] end | T | If A is true, then return B; if C is true, then return D; otherwise, return E. |
case A when B then C [when D then E]* [else F] end | T | If A equals B, then return C; if A equals D, then return E; otherwise, return F. |
Example
# Syntax一 select case when 1=2 then 'tom' when 2=2 then 'mary' else 'tim' end from table eName; # Syntax二 select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' and from t ableName;
if: Conditional judgment, similar to the ternary operator in Java.
Syntax
Grammar | Return | Description |
if (boolean testCondition, T valueTrue, T valueFalseOrNull) | T | When testCondition is true, return valueTrue; otherwise, return valueFalseOrNull. |
Example
# Condition is satisfied, output 'Correct' select if(10 > 5, 'Correct', 'False'); # Condition is not satisfied, output 'False' select if(10 < 5, 'Correct', 'False');
7.1.6 Collection functions
size: Number of elements in the collection.
# Result: 4 select size(array(1,2,3,4));
map: Create a map collection.
Syntax
Grammar | Return | Description |
map(key1, value1, key2, value2) | ㅤ | Construct a map type based on the input key-value pairs. |
Example
select map('xiaohai',1,'dahai',2);
map_keys: Return the keys from a map.
select map_keys(map('xiaohai',1,'dahai',2))
map_values: Return the values from a map (duplicate values are not removed).
select map_values(map('xiaohai',1,'dahai',2))
array: Declare an array collection.
Syntax
Grammar | Return | Description |
array(value1, value2) | ㅤ | 根据输入的参数构建数组array类 |
Example
# Syntax实操 Result: ["1","2","3","4"] select array('1','2','3','4');
array_contains: Determine whether an array contains a specific element.
# Result: true select array_contains(array('a','b','c','d'), 'a');
sort_array: Sort the elements in the array (cannot control the sorting rule, defaults to ascending order).
# Result: ["a","c","d"] select sort_array(array('a','d','c'));
struct: Declare each attribute in a struct.
Syntax
Grammar | Return | Description |
struct(value1, value2, value3,…) | ㅤ | Construct an array struct class based on the input parameters. |
Example
# Syntax实操 Result: {"col1":"name","col2":"age","col3":"weight"} select struct('name', 'age','weight');
named_struct: Declare the attributes and values of a struct.
select named_struct('name', 'xiaosong', 'age',19,'weight',80)
7.2 Advanced aggregate functions.
For common aggregate functions, please refer to section 6.1.7 Aggregate Functions.
collect_list: Collect and form a list collection, without removing duplicates in the result.
select sex,collect_list(job) from employee group by sex;
collect_set: Collect and form a set collection, with duplicates removed from the result.
select sex,collect_set(job) from employee group by sex;
7.3 Explode function.
UDTF (User-defined Table Generating Function): User-defined table generating functions. One example is the explode function, which transforms a single row of data into multiple rows.

Example:
Requirement: Please count the number of movies in each category. Expected results are as follows.
plot | 1 |
action | 3 |
psychology | 1 |
suspenseful | 2 |
war | 1 |
disaster | 1 |
science fiction | 1 |
cops | 1 |
- Create table
create table movie_info( movie string, category string ) row format delimited fields terminated by "\t";
- Insert
insert overwrite table movie_info values ("suspect tracking", "suspenseful,action,science fiction,plot"), ("lie to me", "suspenseful,cops,action,psychology,plot"), ("wolf warrior 2", "war,action,disaster");
- Implementation
select cate, count(*) from ( select movie, cate from ( select movie, split(category,',') cates from movie_info ) t1 lateral view explode(cates) tmp as cate )t2 group by cate

7.4 Window functions.
A window function can assign a window to each row of data, then perform calculations on the data within that window's range, and finally return the calculation result to the row of data. The syntax of window functions mainly consists of two parts: the "window" and the "function." The "window" defines the calculation scope, while the "function" defines the calculation logic.
Syntax:
select [column name1], [column name1] 函数([column name]) over ([窗口范围]) [别名] from [table name];
- Most aggregate functions can be used as the function part in window functions.
- The window scope is primarily divided into two types: one is row-based (e.g., requiring the window for each row to be from the previous row to the current row), and the other is value-based (e.g., requiring the window for each row to be from the current value minus 1 to the current value).

unbounded preceding | 无边界前面多少行, 负无穷(也就是第一行) |
[num] preceding | 多少行前 |
current row | 当前行 |
[num] following | 多少行后 |
unbounded following | 无边界后面多少行, 正无穷(也就是最后一行) |

此时order by后面的column就是定义中所说的, 基于当前值的那个值. [num] preceding 就变成了当前值减去num, [num] following 就变成了当前值加上[num]. 如果涉及到以上的这种操作, 需要涉及到数值加减的操作, 那么order by 后面的column必须是int, smallint这样的整数, 不能是小数. 如果是unbounded preceding 到 current row 就不需要了.

Example:
select orrder_id, order_date, amount, sum(amount) over (partition by user_id order by order_date rows between unbounded preceding and current row) total_amount from order_info;

7.5 Cross-row value retrieval function.
lead & lag
Retrieve the value of a field from a row above or below the current row.
Lead & lag do not support custom window ranges, meaning
range between cannot be used with them.Syntax:
select order_id, user_id, order_date, amount, # lag(field_name, offset, default_value) # lead(field_name, offset, default_value) lag(order_date,1,'1970-01-01') over (partition by user_id order by order_date) last_date, lead(order_date,1,'9999-12-31') over (partition by user_id order by order_date) next_date from order_info;
first_value & last_value
Retrieve the first value or last value of a column within a window.
first_value & last_value allow custom window ranges, meaning
range between can be used with them.Syntax:
select order_id, user_id, order_date, amount, -- first_value(field_name, ignore_nulls) -- last_value(field_name, ignore_nulls) first_value(order_date, false) over (partition by user_id order by order_date) as first_date, last_value(order_date, false) over (partition by user_id order by order_date) as last_date from order_info;
7.6 Ranking functions.
Ranking functions do not support custom windows, meaning
range between cannot be used.rank: Same values get the same rank, with gaps (e.g., two tied for first place results in ranks: 1, 1, 3).
dense_rank: Same values get the same rank, without gaps (e.g., two tied for first place results in ranks: 1, 1, 2).
row_number: Incremental row numbers.
Syntax:
select stu_id, course, score, rank() over (partition by course order by score desc) rk, dense_rank() over (partition by course order by score) dense_rk, row_number() over(partition by course order by score desc) rn from score_info;

7.7 User-defined functions.
User-defined functions are categorized into the following three types:
- UDF (User-Defined Function): One input, one output
- UDAF (User-Defined Aggregation Function): Multiple inputs, one output (e.g., count/max/min)
- UDTF (User-Defined Table-generating Functions): One input, multiple outputs (e.g., lateral view explode)
Requirement:
Customize a UDF to calculate the length of a given primitive data type.
Example:
- Create a Maven project named Hive-UDF.
- Import dependencies.
<dependencies> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>3.1.3</version> </dependency> </dependencies>
- Write the code.
package com.artists.hive.udf; import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; /** * @Author: ArtistS * @Site: https://github.com/ArtistSu * @Create: 2023/5/21 17:45 * @Description: Custom UDF to calculate the length of a string **/ public class MyLength extends GenericUDF { @Override public ObjectInspector initialize(ObjectInspector[] objectInspectors) throws UDFArgumentException { // Return int type inspector because our function calculates the length of a field if (objectInspectors.length != 1) { throw new UDFArgumentException("Only accept one argument"); } ObjectInspector argument = objectInspectors[0]; if (ObjectInspector.Category.PRIMITIVE != argument.getCategory()) { throw new UDFArgumentException("Only accept primitive argument"); } PrimitiveObjectInspector primitiveObjectInspector = (PrimitiveObjectInspector) argument; if (primitiveObjectInspector.getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING) { throw new UDFArgumentException("Only accept string argument"); } return PrimitiveObjectInspectorFactory.javaIntObjectInspector; } @Override public Object evaluate(DeferredObject[] deferredObjects) throws HiveException { // evaluate() is called once per row of data DeferredObject deferredObject = deferredObjects[0]; Object obj = deferredObject.get(); if (obj == null) { return 0; } return obj.toString().length(); } @Override public String getDisplayString(String[] strings) { // This method is called for display in execution plans (e.g., when using explain) return "my_length(" + (strings.length > 0 ? strings[0] : "") + ")"; } }
- Create a temporary function (only valid in the current session).
- Package it into a jar file and upload it to the server at
/opt/module/hive-3.1.3/datas. - Package it into a jar file and upload it to the server at
/opt/module/hive-3.1.3/datas. - Create a temporary function and associate it with the developed Java class.
- Test
0: jdbc:hive2://hadoop102:10000> add jar /opt/module/hive-3.1.3/datas/myudf.jar
0: jdbc:hive2://hadoop102:10000> create temporary function my_len as "com.artists.hive.udf.MyLength";
0: jdbc:hive2://hadoop102:10000> show functions like "my_len"; 0: jdbc:hive2://hadoop102:10000> select my_len("abc");
- Create a permanent function (exists forever).
- Upload the jar to HDFS.
- Create a permanent function.
- Test
Because
ADD JAR itself is only temporarily effective, when creating a permanent function, you need to specify a path (and due to metadata reasons, this path must be on HDFS).[sylgg0918@hadoop102 datas]$ hdfs dfs -put /opt/module/hive-3.1.3/datas/myudf.jar /udf/
0: jdbc:hive2://hadoop102:10000> create function my_len as "com.artists.hive.udf.MyLength" using jar "hdfs://hadoop102:8020/udf/myudf.jar";
0: jdbc:hive2://hadoop102:10000> show functions like "*my_len*"; +-----------------+ | tab_name | +-----------------+ | default.my_len | +-----------------+ 0: jdbc:hive2://hadoop102:10000> select default.my_len("abc"); +------+ | _c0 | +------+ | 3 | +------+
8. Partitioned tables and bucketed tables.
8.1 Partitioned table
In Hive, partitioning refers to the process of dispersing the data of a large table into multiple directories based on business needs. Each directory becomes a partition of the table. During queries, the desired partitions are selected using expressions in the WHERE clause, which significantly improves query efficiency.
8.1.1 Syntax
Syntax:
create table dept_partition ( deptno int, dname string, loc string ) partitioned by (day string) row format delimited fields terminated by '\t';
Write Data
load
- Data preparation
vim /opt/module/hive-3.1.3/datas/dept_20230521.log # Enter the following content 10 Administration 1700 20 Finance 1800
- Load statement.
load data local inpath '/opt/module/hive-3.1.3/datas/dept_20230521.log' into table dept_partition partition(day='20230503');
- Test
select * from dept_partition;
insert
Requirement:
Insert data from the
day='20230521' partition into the day='20230522' partition.Example:
insert overwrite table dept_partition partition (day = '20230523') select deptno, dname, loc from dept_partition where day = '20230521';
Read Data
select deptno, dname, loc, day from dept_partition where day = '20230523'
View partitions.
show partitions [Table Name] show partitions dept_partition;
Add partition.
alter table dept_partition add partition([partition_field_name]=[partition_value]) # e.g. No commas between multiple partitions alter table dept_partition add partition(day='20230523'); alter table dept_partition add partition(day='20230523') partition(day='20230524');
Delete partition.
Partitioned tables are also affected by whether they are managed (internal) tables or external tables. If it is an internal table, when a partition is deleted, both the HDFS data and the metadata will be removed. If it is an external table, only the metadata will be deleted.
alter table dept_partition drop partition ([partition_field_name]=[partition_value]) # e.g. Delete a single partition alter table dept_partition drop partition (day='20230523') # e.g. Delete multiple partitions, with commas between partitions alter table dept_partition drop partition (day='20230521'), partition(day='20230523')
Repair partition.
Hive stores all partition information of a partitioned table in its metadata. The partitioned table can only read and write data normally when the metadata is consistent with the partition paths on HDFS.
- If a user manually creates or deletes partition paths (i.e., directly creates paths on HDFS), Hive will not be aware of these changes. This leads to inconsistencies between Hive's metadata and the partition paths on HDFS.
- Similarly, if the partitioned table is an external table and the user executes the
drop partitioncommand, the partition metadata will be deleted, but the partition path on HDFS will not be removed. This also causes inconsistencies between Hive's metadata and the partition paths on HDFS.
If inconsistencies between metadata and HDFS paths occur, they can be resolved using the following methods:
add partition
In the first scenario, where HDFS partition paths were manually created, causing Hive to fail to recognize them, you can manually use the
add partition command to add partition metadata. This aligns the metadata with the partition paths.drop partition
In the second scenario, where HDFS metadata was manually deleted but the HDFS partition paths remain, Hive cannot recognize them. You can use the
drop partition command to delete the partition metadata, thereby aligning the metadata with the partition paths.msck
If partition metadata is inconsistent with HDFS partition paths, you can also use the
msck command to repair it. Below is an explanation of how to use this command.Syntax:
# msck => metastore consistency check msck repair table [table_name] [[add | drop | sync] partitions]; # e.g. This command adds partition information for HDFS paths that exist but are missing in the metadata msck repair table dept_partition add partitions; # e.g. This command deletes partition information for which the HDFS paths have been removed but the metadata still exists msck repair table dept_partition drop partitions; # e.g. This command synchronizes HDFS paths and metadata partition information, equivalent to executing both of the above commands simultaneously msck repair table dept_partition sync partitions; # e.g. The following command is equivalent to msck repair table dept_partition add partitions msck repair table dept_partition;
8.1.2 Secondary partitioned table
If the volume of log data within a single day is also substantial, how can the data be further split? The answer is a secondary partitioned table. For example, beyond daily partitioning, data within each day can be further partitioned by hour.
Syntax:
# Secondary partitioning means 2 partition fields; tertiary partitioning means 3 partition fields. Table creation statement create table dept_partition ( deptno int, dname string, loc string ) # partitioned by (day string, hour string) row format delimited fields terminated by '\t'; # Data loading statement load data local inpath '/opt/module/hive-3.1.3/datas/dept_20230521.log' into table dept_partition partition (day = '20230523', hour='12'); # Query partition data select deptno, dname, loc, day from dept_partition where day = '20230523' and hour = '12';
8.1.3 Dynamic partitioning
Dynamic partitioning refers to when inserting data into a partitioned table, the target partition is not specified by the user but is dynamically determined by the value of the last field in each row of data. Using dynamic partitioning, data can be written to multiple partitions with just one INSERT statement.
Global dynamic partitioning function switch (default: true, enabled).
set hive.exec.dynamic.partition=true
Strict mode & Non-strict mode.
# Mode for dynamic partitioning, default is 'strict' (strict mode), which requires at least one partition to be specified as a static partition. # 'nonstrict' (non-strict mode) allows all partition fields to use dynamic partitioning. # The meaning above is that when we create a table with partition(id, age, name), in strict mode, at least one of these three partition fields must have a fixed value. # If it is partition(id) alone, strict mode cannot be satisfied because it requires at least one partition field to have a fixed value, so it should be set to 'nonstrict'. set hive.exec.dynamic.partition.mode=nonstrict;
The maximum number of partitions that can be created by a single INSERT statement is 1000 by default.
set hive.exec.max.dynamic.partition=1000
The maximum number of partitions that a single Mapper or Reducer can create simultaneously is 100 by default.
set hive.exec.max.dynamic.partitions.pernode=100
The maximum number of files that a single INSERT statement can create is 100,000 by default.
set hive.exec.max.created.files=100000
Whether to throw an exception when the query result is empty and dynamic partitioning is performed; the default is false.
set hive.error.on.empty.partition=false
8.2 Bucketed table
Partitioning provides a convenient way to isolate data and optimize queries. However, not all datasets can be reasonably partitioned. For a table or partition, Hive can further organize data into buckets, which is a more fine-grained division of data ranges. Partitioning deals with data storage paths, while bucketing deals with data files.
The basic principle of a bucketed table is as follows: First, calculate a hash value for the data in a specified field for each row. Then, take the modulus of this hash value with a specified number of buckets. Finally, rows with the same modulus result are written into the same file, which is called a bucket.
There is no inherent connection between partitioning and bucketing. A table can be both a partitioned table and a bucketed table.
Syntax:
# Create table statement create table stu_buck( id int, name string ) clustered by (id) into 4 buckets row format delimited fields terminated by '\t'; # Prepare data vim /opt/module/hive-3.1.3/datas/student.txt # Insert the following data 1001 student1 1002 student2 1003 student3 1004 student4 1005 student5 1006 student6 1007 student7 1008 student8 1009 student9 1010 student10 1011 student11 1012 student12 1013 student13 1014 student14 1015 student15 1016 student16 # Load data into the bucketed table # Note: Newer versions of Hive can run MapReduce directly when loading data. Older versions of Hive require transferring the data to a table first and then loading it into the bucketed table via a query. load data local inpath '/opt/module/hive-3.1.3/datas/student.txt' into table stu_buck;
8.3 Bucketed sorted table
# Table creation statement. The bucketing field and the sorting field can be different. # The bucketing field and sorting field are not limited to one; they can be multiple. create table stu_buck_sort( id int, name string ) clustered by(id) sorted by ()id # Prepare data vim /opt/module/hive-3.1.3/datas/student.txt # Insert the following data 1001 student1 1002 student2 1003 student3 1004 student4 1005 student5 1006 student6 1007 student7 1008 student8 1009 student9 1010 student10 1011 student11 1012 student12 1013 student13 1014 student14 1015 student15 1016 student16 # Load data into the bucketed sorted table load data local inpath '/opt/module/hive-3.1.3/datas/student.txt' into table stu_buck_sort;
9. File formats and compression.
9.1 Overview of Hadoop Compression
Compression Format | Algorithm | File Extension | Is Splittable |
DEFLATE | DEFLATE | .deflate | N |
Gzip | DEFLATE | .gz | N |
bzip2 | bzip2 | .bz2 | Y |
LZO | LZO | .lzo | Y |
Snappy | Snappy | .snappy | N |
To support multiple compression/decompression algorithms, Hadoop introduced codecs, as shown below.
Hadoop checks for supported compression methods via
hadoop checknative.Hadoop sets compression on the driver side.
Compression Format | Corresponding codec |
DEFLATE | org.compress.hadoop.io.compress.DefaultCodec |
gzip | org.apache.hadoop.io.compress.GzioCodec |
bzip2 | org.apache.hadoop.io.compress.BZip2Codec |
LZO | com.hadoop.compression.lzo.LzopCodec |
Snappy | org.aoache.hadoop.io.compress.SnappyCodec |
Compression performance comparison
Compression Algorithm | Original File Size | Compressed File Size | Compression Speed | Decompression Speed |
gzip | 8.3GB | 1.8GB | 17.5MB/s | 58MB/s |
bzip2 | 8.3GB | 1.1GB | 2.4MB/s | 9.5MB |
LZO | 8.3GB | 2.9GB | 49.3MB/s | 74.6MB/s |
On a single core of a CoreI7 processor in 64-bit mode, Snappy comoresses at about 250MB/s or mored and decppresses at about 500MB/s or more.
9.2 Hive file formats
Choosing an appropriate file format for the data in Hive tables is highly beneficial for improving query performance. The storage format for Hive table data can be selected from options such as Text File, ORC, Parquet, Sequence File, etc.
9.2.1 Text File
Text File is the default file format used by Hive. Each line in a text file corresponds to a row in a Hive table. The file format can be specified as Text File through the following table creation statement. This format is row-based storage.
create table textfile_table( (column_specs) ) stored as textfile;
9.2.2 ORC (Optimized Row Columnar)
The ORC (Optimized Row Columnar) file format is a columnar storage file format introduced in Hive version 0.11. ORC files can improve Hive's performance in reading, writing, and processing data. In contrast to columnar storage is row-based storage. The figure below compares the two: (as shown, the left side is the logical table, the first on the right is row-based storage, and the second is columnar storage).

(1) Characteristics of row-based storage
When querying an entire row of data that meets certain conditions, columnar storage requires retrieving the corresponding values for each aggregated field across different columns, whereas row-based storage only needs to locate one value, as the remaining values are adjacent. Therefore, in such cases, row-based storage offers faster query performance.
Row-based storage can be likened to the following SQL analogy:
# In the SQL below, we query several columns without a WHERE clause. In such cases, columnar storage can significantly improve query performance. select id, name, age from stu_table;
(2) Characteristics of columnar storage.
Since the data for each field is stored together, queries that require only a few fields can significantly reduce the amount of data read. The data type for each field is necessarily the same, allowing columnar storage to design more targeted and efficient compression algorithms.
As mentioned earlier,
Text File and Sequence File are row-based storage formats, while ORC and Parquet are columnar storage formats.The diagram below illustrates the logical and physical storage structure of ORC:

Component | Description |
Header | Content is ORC, used to denote the file type. |
Index Data | A lightweight index, which by default is created for every 10,000 rows per column. Each index records the position of the nth ten-thousandth row, as well as the maximum and minimum values of the last 10,000 rows. |
Row Data | Stores the actual data, organized by column. Each column is encoded and split into multiple streams for storage. |
Stripe Footer | Stores the positions of each stream and the encoding information for each column. |
The Body consists of
index data, row data, and stripe footer. The Tail is made up of File Footer and Postscript. The File Footer contains the actual positions of each stripe, index length, data length, and statistical information for each column, among other details. The Postscript records the compression type of the entire file and the length of the File Footer, etc.When reading an ORC file, the process starts from the last byte to read the length of the
Postscript, then reads the Postscript itself. From there, it parses the length of the File Footer, reads the File Footer, and finally extracts information about each stripe. In other words, the file is read from the end backward.Table creation statement:
create table orc_table (column_specs) stored as orc tblproperties (property_name=property_value,...)
The ORC file format supports the following parameters:
Parameter | Default | Description |
orc.compress | ZLIB | Compression format, options: NONE, ZLIB, SNAPPY |
orc.compress.size | 262,144 | Size of each compression block (ORC files are compressed in blocks) |
orc.stripe.size | 67,108,864 | Size of each stripe (usually set to the same size as Hadoop's block, e.g., 128M) |
orc.row.index.stride | 10,000 | Index stride (interval at which an index is created, e.g., every n rows). |
9.2.3 Parquet
Parquet 文件是 Hadoop 生态中的一个通用的文件格式, 它也是一个列式存储的文件格式.
Parquest 文件的格式如下图所示:

上图展示了一个 Parquet 文件的基本结构, 文件的首尾都是该文件的 Magic Code, 用于校验它是否是一个 Parquet 文件.
首尾中间由若干个
Row Group 和 一个 Footer(File Meta Data) 组成每个
Row Group 包含多 Column Chunk ,每个 Column Chunk 包含多个 pageComponent | Description |
行组(Row Group) | 一个行组对应逻辑表中的若干行 |
列块(Column Chunk) | 一个行组中的一列保存在一个列块中 |
页(Page) | 一个列块的数据会划分为若干个页 |
Footer(File Meta Data) | 存储了每个 Row Group 中的每个 Column Chunk 额元数据信息, 元数据信息包含了该列的数据类型, 该列的编码方式, 该类的 Data Page 位置等信息. |
建表语句:
create table parquet_table (column_specs) stored as parquet tblproperties (property_name=property_value,...);
Parquet支持的参数如下:
Parameter | Default | Description |
parquet.compression | uncompressed | 压缩格式, 可选项: uncompressed, snappy, gzip, lzo, brotli, lz4 |
parquet.block.size | 134217728 | 行组大小, 通常与 HDFS 块大小保持一致 |
parquet.page.size | 1048576 | 页大小 |
9.3 Compression
Maintaining data compression within Hive tables and during computation processes is highly beneficial for efficient disk space utilization and improving query performance.
9.3.1 Compression of Hive table data
In Hive, the way to enable data compression varies depending on the file type of the table.
(1) TextFile
If a table's file type is TextFile and the data in the table needs to be compressed, in most cases, no special declaration is required in the table creation statement. Simply import the compressed file directly into the table, and Hive will automatically recognize its compression format and decompress it when querying the data.
However, when executing the SQL statement to load data into the table, users need to set the following parameters to ensure that the data written to the table is compressed.
# Whether the final output result of the SQL statement is compressed set hive.exec.compress.output=true; # Compression format of the output result (snappy in the example below) set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
(2)ORC
If the file type of a table is ORC and compression of the table data is required, the compression format must be declared in the table creation statement as follows:
create table orc_table (column_specs) stored as orc tblproperties ("orc.compress"="snappy");
(3)Parquet
If the file type of a table is Parquet and compression of the table data is required, the compression format must be declared in the table creation statement as follows:
create table orc_table (column_specs) stored as parquet tblproperties ("parquet.compression"="snappy");
9.3.2 Using compression during computation.
(1) Compressing the intermediate results of a single MapReduce job
The intermediate results of a single MapReduce job refer to the data output by the Mapper. Compressing this data can reduce network I/O during the shuffle phase. This can be configured using the following parameters:
# Enable MapReduce intermediate data compression set mapreduce.map.output.compress=true; # Set the compression method for MapReduce intermediate data (snappy in the example below) set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
(2) Compressing intermediate results of a single SQL statement
The intermediate results of a single SQL statement refer to the temporary data between two MapReduce jobs (a SQL statement may require computation through MapReduce). This can be configured using the following parameters:
# Whether to compress temporary data between two MapReduce jobs set hive.exec.compress.intermediate=true; # Compression format (snappy in the example below) set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
10. Enterprise-level tuning.
10.1 Computing resource allocation.
10.1.1 Yarn Resource configuration
需要调整的 Yarn 参数均与 CPU, 内存等资源有关, 核心配置如下
Parameter | Property | Description |
yarn.nodemanager.resource.memory-mb | <property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>65536</value>
</property> | Memory allocated to each Container by a NodeManager node. The configuration of this parameter depends on the total memory capacity of the node where the NodeManager is located and the number of other services running on that node (the previous configuration was 64G, typically allocated as 1/2 to 2/3 of the total memory). |
yarn.nodemanager.resource.cpu-vcores | <property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>16</value>
</property> | CPU cores allocated to each Container by a NodeManager node. The configuration of this parameter similarly depends on the total CPU cores of the node where the NodeManager is located and the other services running on that node (the previous configuration was 16 cores, with a ratio to the above parameter roughly 1:4, meaning 1 core for every 4 GB of memory). |
yarn.scheduler.maximum-allocation-mb | <property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>16384</value>
</property> | Maximum memory a single container can use (the previous configuration was 16G). |
yarn.scheduler.minimum-allocation-mb | <property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
</property> | Minimum memory a single container can use (the previous configuration was 512M). |
The parameters above need to be modified in
$HADOOP_HOME/etc/hadoop/yarn-site.xml. After making the changes, remember to distribute the configuration to other nodes and then restart YARN.10.1.2 MapReduce Resource configuration
MapReduce resource configuration primarily includes the memory and CPU core allocation for Map Tasks, as well as the memory and CPU core allocation for Reduce Tasks.
Parameter | Description |
mapreduce.map.memory.mb | Memory size of the container requested by a single Map Task. Its default value is 1024 MB. This value must not exceed the range defined by yarn.scheduler.maximum-allocation-mb and yarn.scheduler.minimum-allocation-mb. This parameter needs to be configured separately for different computing tasks. In Hive, you can directly configure it for each SQL statement as follows:set mapreduce.map.memory.mb=2048; |
mapreduce.map.cpu.vcores | CPU core count of the container requested by a single Map Task. Its default value is 1, and this value generally does not require adjustment. |
mapreduce.reduce.memory.mb | Memory size of the container requested by a single Reduce Task. Its default value is 1024 MB. This value also must not exceed the range defined by yarn.scheduler.maximum-allocation-mb and yarn.scheduler.minimum-allocation-mb. This parameter needs to be configured separately for different computing tasks. In Hive, you can directly configure it for each SQL statement as follows:set mapreduce.reduce.memory.mb=2048; |
mapreduce.reduce.cpu.vcores | CPU core count of the container requested by a single Reduce Task. Its default value is 1, and this value generally does not require adjustment. |
10.2 Explain - View execution plan.
10.2.1 Explain - Execution Plan Overview.
The execution plan presented by
EXPLAIN consists of a series of Stages, which have dependencies on each other. Each Stage corresponds to a MapReduce Job, a file system operation, or other actions.If a Stage corresponds to a MapReduce Job, the computational logic of its Map side and Reduce side is described by the Map Operator Tree and Reduce Operator Tree, respectively. The Operator Tree is composed of a series of Operators, where each Operator represents a single logical operation during the Map or Reduce phase, such as TableScan Operator, Select Operator, Join Operator, etc.

10.2.2 Explain - Syntax
Syntax:
# FORMATTED: Outputs the execution plan as a JSON string. # EXTENDED: Outputs additional information from the execution plan, typically including file names for reads and writes, etc. # DEPENDENCY: Outputs the tables and partitions read by the execution plan. EXPLAIN [FORMATTED | EXTENDED | DEPENDENCY] query-sql;
Example:
explain select user_id, count(*) from order_detail group by user_id;