Artists Universe


# UserName & Password
sylgg0918 IamGeek0

# start Hadoop
/home/sylgg0918/bin/myhadoop.sh start

# start mysql
service mysqld start

# start hiveserver2
nohup hive --service hiveserver2 &

# start metastore service
nohup hive --service metastore &

# connect hive
beeline
!connect jdbc:hive2://hadoop102:10000
sylgg0918
You can just press Enter without entering a password.

#########################################
# During the testing phase, if the execution is too slow, you can enable local mode. Enter the following command in the Hive command line (effective only for the current session).
set mapreduce.framework.name=local;

# Log location
# hiveserver2 Log location /opt/module/hive-3.1.3/logs

1. Introduction to Hive

1.1 What is Hive

Hive was open-sourced by Facebook and is a Hadoop-based data warehouse tool. It can map structured data files to a table and provides SQL-like query functionality. So why does Hive exist? What problems was it created to solve? Let's quickly understand Hive through an example.

In Hadoop, we need to write Mapper, Reducer, and Driver classes and implement the corresponding logic, which is relatively cumbersome.


test table
id column

ArtistS
ArtistS
sss
sas
su
lei
sda
su

If implemented using Hive SQL, it can be done in just one line.


select count(*) from test group by id;

1.2 The essence of Hive

The essence of Hive is that it is a Hadoop client used to convert HQL (Hive SQL) into MapReduce programs.

(1) The data of each Hive table is stored in HDFS.

(2) Hive performs data analysis using MapReduce as the underlying execution engine (and it can also be configured to use Spark or Tez).

(3) The execution program runs on YARN.

1.3 Hive architecture principles

Metastore: Provides an interface for accessing metadata. Metadata refers to the databases, tables, and fields created by users in Hive. The Metastore does not store actual data; it only provides an interface for accessing metadata.

By default, metadata is stored in the built-in Derby database, which only supports single-client access. In production environments, to support multi-user development, it is recommended to use MySQL to store the Metastore.

CLI: The command line can only be used locally on the machine where the Hive server is installed.

Driver:

(1) Parser (SQLParser): Converts the SQL string into an Abstract Syntax Tree (AST).

(2) Semantic Analyzer: Further divides the AST into QueryBlocks.

(3) Logical Plan Generator: Generates the logical plan from the syntax tree.

(4) Logical Optimizer: Optimizes the logical plan.

(5) Physical Plan Generator: Generates the physical plan based on the optimized logical plan.

(6) Physical Optimizer: Optimizes the physical plan.

(7) Executor (Execution): Executes the plan, obtains the query result, and returns it to the client.

JDBC/ODBC: Used for remote access to Hive, such as JDBC or ODBC clients. Obviously, these clients connect to the HiveServer2 database.

Difference between JDBC and ODBC:

(1) JDBC has better portability than ODBC:

Typically, after installing an ODBC driver, additional configuration is required before it can be used. Different configurations may not be compatible across different servers, so you need to configure it each time. In contrast, JDBC only requires selecting the appropriate JDBC database driver, with no extra configuration needed. During installation, the JDBC driver handles all necessary configuration automatically.

(2) Different programming languages:

JDBC is used in Java programming, while ODBC is generally used in C/C++ programming.

Example: Suppose a user executes a DDL (Data Definition Language) statement. Its execution sequence is as follows:

The user creates a table using CREATE TABLE.

The corresponding table path is recorded in the Metastore.

The database used by the Metastore to store metadata is mapped to HDFS.

2. Hive Installation

🇬🇧

Hive - 3.1.3

3. Hive Tips and Tricks

3.1 Common Hive Commands

hive -H can be used to view Hive-related commands.

Option	Description	Example
-d, --define	Variable substitution to apply to Hive commands. e.g, -d A=B or --define A=B	ㅤ
--database <database>	Specify a database to use	--database stu
-e <quoted-query-string>	Non-interactive SQL usage: Normally, we interactively access Hive through `bin/hive` or HiveServer2. Now, we want to execute SQL directly using Linux commands.	hive -e "insert into stu values(1,'aa')”
-f <filename>	If the SQL is too long, you can put it into a file and then execute it. This is also a form of non-interactive execution.	hive -f stu.sql
-H,--help	Print help information	hive -H
--hiveconf <property=value>	When starting a Hive task with special requirements and you do not want to use the configurations in the configuration file, you can use this parameter to add or override configurations from the file. However, note that it is effective only for this Hive session.	bin/hive -hiveconf mapreduce.job.reduces=10;
--hivevar <key=value>	Variable substitution to apply to hive commands. e.g. --hivevar A=B	ㅤ
-i <filename>	Initialization SQL file	ㅤ
-S, --silent	Silent mode in interactive shell	ㅤ
-v, --verbose	Can display the executed SQL statements	ㅤ

3.2 Hive Configuration Files

The priority of the three configuration methods above increases in the following order: Configuration File < Command-line Parameters < Parameter Declaration. Note that some system-level parameters, such as log4j settings, must be set using the first two methods because they are read before the session is established.

Configuration File Method

Default configuration file: hive-default.xml (used for reference only; does not take effect)

User-defined configuration file: hive-site.xml (global configuration file; usually the one that is modified)

Tips: User-defined configurations override the default configurations. Additionally, Hive also reads Hadoop configurations since it is started as a Hadoop client. Hive configurations override Hadoop configurations. Configuration file settings are effective for all Hive processes started on the local machine.

Command-line Parameter Method


bin/hive -hiveconf mapreduce.job.reduces=10;

Tips: Effective only for this Hive session.

Parameter Declaration Method: If you have already entered the Hive interactive interface (started via bin/hive) and want to modify configurations, you can use the following method.


hive(default)> set mapreduce.job.reduces;

Tips: Effective only for this Hive session.

3.3 Common Hive Configuration Properties

Hive client displays the current database and table headers

Add the following two configurations in hive-site.xml:


<!-- Display the current database and table headers on the client -->
<property>
    <name>hive.cli.print.header</name>
    <value>true</value>
    <description>Whether to print the names of the columns in query output.</description>
</property>
<property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
    <description>Whether to include the current database in the Hive prompt.</description>
</property>

After the configuration, the Hive client can display the current database and table headers during runtime.

Hive Runtime Log Path Configuration

By default, Hive logs are stored at /tmp/sylgg0918/hive.log (under the current username).

Change Hive log location to /opt/module/hive/logs


# Rename conf/hive-log4j2.properties.template to hive-log4j2.properties
mv hive-log4j2.properties.template hive-log4j2.properties

# Open the configuration file
vim hive-log4j2.properties

# Modify the log storage location
property.hive.log.dir = /opt/module/hive-3.1.3/logs

Hive JVM Heap Memory Settings

In the new version of Hive, the default JVM heap memory at startup is 256M. If the heap size is too small, running complex SQL in local mode may often cause the error: java.lang.OutOfMemoryError: Java heap space. Therefore, it is recommended to adjust the HADOOP_HEAPSIZE parameter in advance.

Q: What is Local Mode?

A: Local mode means that when the data volume is small, tasks are not submitted to YARN.

If using the Hive client, "local" refers to the current process.

If using HiveServer2, "local" refers to the HiveServer2 process.

Rename conf/hive-env.sh.template to hive-env.sh


mv hive-env.sh.template hive-env.sh

Modify the parameter export HADOOP_HEAPSIZE (line 40) in hive-env.sh to 2048


export HADOOP_HEAPSIZE=2048

Restart Hive, otherwise the changes will not take effect

Disable Hadoop Virtual Memory Check

Disable virtual memory check in yarn-site.xml (if it is already disabled, no need to configure)

Stop Hadoop first

Modify the configuration file


vim /opt/module/

# Add the following configuration
<!-- Disable virtual memory -->
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

xsync yarn-site.xml


xsync

Restart yarn

4. DDL(Data Definition Language)

4.1 Database

4.1.1 Create Database

Syntax


CREATE DATABASE [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path] -- Specify the HDFS location for this database
[WITH DBPROPERTIES(property_name=property_value)]

Example


-- Create a database without specifying a location
-- If no location is specified, the default path is ${hive.metastore.warehouse.dir}/database_name.db on HDFS
CREATE DATABASE db_hive1;

-- Create a database with a specified location
CREATE DATABASE db_hive2 LOCATION '/db_hive2';

-- Create a database with DB properties (use DESC DATABASES EXTENDED database_name to view)
CREATE DATABASE db_hive3 WITH DBPROPERTIES('create_date'='2022-11-18');

4.1.2 Query Database

View Databases

Syntax


SHOW DATABASES [LIKE 'identifier_with_wildcards’]

Example


-- Show all databases
SHOW DATABASES;

-- Show all databases starting with 'db_hive'
SHOW DATABASES LIKE 'db_hive*';

-- Show databases starting with 'db_hive' and the 'default' database
SHOW DATABASES LIKE 'db_hive*|default';

View Database Information

Syntax


DESCRIBE DATABASE [EXTENDED] db_name;

Example


-- View basic information
DESC DATABASE db_hive3;

-- View more detailed information about the database
DESC DATABASE EXTENDED db_hive3;

4.1.3 Update Database

Users can use the ALTER DATABASE command to modify certain information about a database. The modifiable information includes DB properties, location, and owner user.

Note: Changing the database location does not affect the paths of existing tables; it only changes the default parent directory for newly created tables.

Syntax

Modify properties


ALTER DATABASE database_name SET DBPROPERTIES(property_name=property_value);

Modify location


ALTER DATABASE database_name SET LOCATION hdfs_path;

Modify owner user


ALTER DATABASE database_name SET OWNER USER user_name;

Example


# Modify dbproperties
ALTER DATABASE database_name SET DBPROPERTIES('create_date'='2022-11-01');

4.1.4 Delete Database

Syntax


DROP DATABASE [IF EXISTS] database_name [RESTRICT|CASCADE]

Tips:

RESTRICT: Strict mode. If the database is not empty, the deletion will fail. This is the default mode.

CASCADE: Cascade mode. If the database is not empty, all tables within the database will be deleted as well.

Example


-- Delete an empty database
DROP DATABASE db_hive2;

-- Delete a non-empty database
DROP DATABASE db_hive3 CASCADE;

Switching databases is the same as in MySQL; simply use USE database.

4.2 Table

Data Types

Primitive Data Types

Type	Description	Definition
tinyint	1-byte signed integer	ㅤ
smallint	2 byte signed integer	ㅤ
int	4 byte signed integer	ㅤ
bigint	8 byte signed integer	ㅤ
boolean	Boolean type, `true` or `false`	ㅤ
float	Single-precision floating-point number	ㅤ
double	Double-precision floating-point number	ㅤ
decimal	Decimal precise numeric type (for example, `DECIMAL(16,2)` means the total number of digits including integer and fractional parts is 16, and the fractional part can have 2 digits).	decimal(16,2)
varchar	Character sequence, maximum length must be specified. The valid range for the maximum length is [1, 65535].	varchar(32)
string	String, no maximum length needs to be specified.	ㅤ
timestamp	Time Type	ㅤ
binary	Binary Data	ㅤ

Complex Data Types

Type	Description	Definition	How to get value
array	An array is a collection of values of the same type.	array<string>	arr[0]
map	A map is a collection of key-value pairs of the same type.	map<string, int>	map[’key’]
struct	A struct consists of multiple fields, each with its own name and data type.	struct<id:int, name:string>	struct.id

Type Conversion

Hive's primitive data types support type conversion, which includes implicit conversion and explicit conversion.

Implicit Conversion

Users do not need to perform any operations; Hive will automatically convert the data. The specific rules are as follows:

For example, if we execute the SQL statement:


SELECT '1' + 1;

What is the result type? The answer is DOUBLE.

Why? According to the conversion rules, a STRING can first be converted to DOUBLE. An INT can be converted to INT or BIGINT, but a STRING cannot. The smallest type that satisfies both operands is DOUBLE, so the result type is DOUBLE.

Explicit Conversion

Syntax


cast(expr as <type>)

Example


-- Convert string to int (implicit conversion cannot achieve this, see the table above)
select cast ('111' as int);

-- Convert string to int (cannot be achieved if the string is not a number, e.g., 'abc'; the result will be <null>)
select cast ('abc' as int);

4.2.1 Create Table

Standard Table Creation Method

Syntax


/**
 * File: create_example_table.sql
 * Description: SQL file demonstrating the syntax for creating a large Hive/SQL table.
 * Note: Keywords in brackets [] are optional clauses.
 */

-- 1. Core Statement and Table Name
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
(
    -- 2. Column Definitions
    col_name_1     data_type       [COMMENT col_comment],
    col_name_2     data_type       [COMMENT col_comment],
    -- ... more columns
    col_name_n     data_type
)

-- 3. Table-Level Comment (Optional)
[COMMENT table_comment]

-- 4. Partitioning Clause (Optional)
[PARTITIONED BY (
    col_name data_type [COMMENT col_comment],
    ...
)]

-- 5. Clustering and Sorting (Optional, for Bucket Optimization)
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)]
INTO num_buckets BUCKETS]

-- 6. Row Format (Optional, specifies record termination and field delimiters)
[ROW FORMAT row_format]

-- 7. Storage Format (Optional, specifies the underlying file type)
[STORED AS file_format]
-- Common formats: TEXTFILE, PARQUET, ORC, AVRO

-- 8. Data Location (Optional, specifies the HDFS path)
[LOCATION hdfs_path]

-- 9. Table Properties (Optional, key-value metadata)
[TBLPROPERTIES (property_name=property_value, ...)];

Partition columns cannot have the same name as existing table columns; otherwise, an error will occur.

Keyword Explanation

TEMPORARY	Temporary Table: This table is visible only in the current session. When the session ends, the table is deleted. Simply put, if you log in via a client, perform operations, and then exit the client, the table will no longer exist.
EXTERNAL	External Table: In contrast to internal (managed) tables, a managed table means Hive fully controls the table, including metadata and data in HDFS. An external table means Hive only manages the metadata and does not fully control the data in HDFS.
PARTITIONED BY	Create a Partitioned Table
CLUSTERED BY… SORTED BY… INTO … BUCKETS	Create a Bucketed Table, where each bucket is sorted by the field(s) specified after `SORTED BY`.
ROW FORMAT	Specify SERDE (Serializer and Deserializer): Hive uses a SerDe to serialize and deserialize each row of data.What is a SerDe?Read file flow: `HDFS files → InputFileFormat → <key,value> → Deserializer → Row object`Write file flow: `Row object → Serializer → <key,value> → OutputFileFormat → HDFS files`If no serializer or deserializer is provided, Hive will use the default SerDe. You can check it with `SHOW CREATE TABLE`. Syntax 1:`ROW FORMAT DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] [NULL DEFINED AS char]`DELIMITED: Indicates that each field in the file is separated by a specific delimiter; the default SerDe will be used to serialize and deserialize each row.FIELDS TERMINATED BY: Column delimiterCOLLECTION ITEMS TERMINATED BY: Delimiter between elements in a `map`, `struct`, or `array`MAP KEYS TERMINATED BY: Delimiter between keys and values in a `map`LINES TERMINATED BY: Line delimiter Syntax 2:`ROW FORMAT SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)] serde_name` must be the full class name (e.g., `com.xxx.xxx.xxx`).The SERDE keyword can specify other built-in SerDes or user-defined SerDes, such as JSON SerDe, which can be used to handle JSON strings.
STORED AS	Example:`STORED AS TEXTFILE`

Create Table As Select (CTAS)

Syntax


CREATE [TEMPORARY] TABLE [IF NOT EXISTS] table_name
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement]

This syntax allows users to create a table directly from the results of a SELECT query, with the table structure matching the structure of the query results.

Create Table Like

Syntax


CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] 
[db_name.]table_name
[LIKE exist_table_name]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]

This syntax allows users to replicate the structure of an existing table. Unlike the CTAS (CREATE TABLE AS SELECT) syntax, the table created using this method does not contain any data.

Example

Internal Table

By default, tables created in Hive are managed (internal) tables.


create table if not exists student(
id int,
name string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/student'

Prepare test data


1001	student1
1002	student2
1003	student3
1004	student4
1005	student5
1006	student6
1007	student7
1008	student8
1009	student9
1010	student10
1011	student11
1012	student12
1013	student13
1014	student14
1015	student15
1016	student16

Upload data to /user/hive/warehouse/student


hadoop fs -put student.txt /user/hive/warehouse/student

Delete table and then you will find the data in HDFS gone as well


drop table student;

External Table


create external table if not exists student(
	id int,
	name string
)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/student';

Prepare test data


1001	student1
1002	student2
1003	student3
1004	student4
1005	student5
1006	student6
1007	student7
1008	student8
1009	student9
1010	student10
1011	student11
1012	student12
1013	student13
1014	student14
1015	student15
1016	student16

Upload data to /user/hive/warehouse/student


hadoop fs -put student.txt /user/hive/warehouse/student

Delete table and then you will found the data in HDFS still there.


drop table student;

SERDE and Complex Data Types If you have a JSON file in the following format that needs to be analyzed and processed by Hive, consider how to design the table.


{
    "name": "dasongsong",
    "friends": [
        "bingbing",
        "lili"
    ],
    "students": {
        "xiaohaihai": 18,
        "xiaoyangyang": 16
    },
    "address": {
        "street": "hui long guan",
        "city": "beijing",
        "postal_code": 10010
    }
}

When designing table fields, the table columns should correspond to the top-level fields in the JSON string. For JSON strings with nested structures,...


create table teacher(
	name string,
	friends array<string>,
	students map<string, int>,
	address struct<city:string, street:string, postal_code:int>
)
row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe'
location '/user/hive/warehouse/teacher';

Save the following JSON into teacher.txt (do not use the JSON format above) and upload it to /user/hive/warehouse/teacher.


hadoop fs -put teacher.txt /user/hive/warehouse/teacher

Query the values within it


select friends[0], students['xiaohaihai'], address.city from teacher;

create table as select & create table like


create table teacher1 as select * from teacher;

It will include the data from the teacher table.


create table teacher2 like teacher;

It won’t include the data from the teacher table.

4.2.2 View Table

Show All Tables

Syntax


SHOW TABLES [IN database_name] LIKE ['identifier_with_wldcards’];

Tips: Explanation of LIKE wildcard expressions:

represents any number of any characters

| represents an OR relationship

Example


-- Show all tables in the 'default' database starting with 'stu'
show tables in default like 'stu*';

View Table Information

Syntax


DESCRIBE [EXTENDED | FORMATTED] [db_name.]table_name

Option	Description
EXTENDED	Show detailed information
FORMATTED	Format the detailed information for better readability

Example


-- View basic information
DESC stu;

-- View more detailed information in a formatted way
DESC FORMATTED stu;

4.2.3 Update Table

Rename Table

Syntax


ALTER TABLE table_name RENAME TO new_table_name

Example


-- Rename the table 'stu' to 'stu1'
ALTER TABLE stu RENAME TO stu1;

Modify Column Information

Tips: This only modifies the table's metadata. If the table contains data in HDFS, the data on HDFS must be handled manually. For example, if the original columns are a, b, c and you want to reorder them to c, a, b, the metadata will reflect the new order, but the actual HDFS data will not change. If you do not manually adjust the HDFS data, reading the table will return incorrect values (e.g., the c column will return what was originally in the a column).

Add Column

This statement allows users to add new columns, which will be appended at the end.

Syntax


ALTER TABLE table_name ADD COLUMNS (col_name data_type [COMMENT col_comment], ...)

Update Column

This statement allows users to modify a specific column's name, data type, comment, and position in the table.

Syntax


ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]

Replace Column

This statement also allows users to replace all existing columns in the table with new columns.

Syntax


ALTER TABLE table_name REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)

Example


-- View table structure
DESC stu1;

-- Add a new column 'age' of type INT
ALTER TABLE stu1 ADD COLUMNS(age INT);

-- Add a new column 'gender'
ALTER TABLE stu1 ADD COLUMNS(gender STRING);

-- Change the type of 'gender' to INT and place it after 'id'
ALTER TABLE stu1 CHANGE COLUMN gender gender INT AFTER id;
-- Error while processing statement: FAILED: Execution Error, return code 1 
-- from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. The 
-- following columns have types incompatible with the existing columns in 
-- their respective positions:
-- If the above error occurs, execute the following:
SET hive.metastore.disallow.incompatible.col.type.changes=false;

-- View table structure
DESC stu1;

-- Update a column (rename 'age' to 'ages' and change type from INT to DOUBLE)
ALTER TABLE stu1 CHANGE COLUMN age ages DOUBLE;

-- Replace columns (replace existing columns with new ones)
ALTER TABLE stu1 REPLACE COLUMNS(id INT, name STRING);

4.2.4 Delete Table

Syntax


DROP TABLE [IF EXISTS] table_name;

Example


-- Delete the table 'stu'
DROP TABLE stu;

4.2.5 Truncate Table

Tips: This only truncates the table; it does not delete data in an external table.

Syntax


TRUNCATE [TABLE] table_name

Example


-- Truncate the table without deleting data
TRUNCATE TABLE student;

5. DML(Data Manipulation Language)

5.1 Load

The LOAD statement can be used to import files into a Hive table.

Tips: The LOCAL keyword is worth noting. If you are using the hive command to connect directly, LOCAL refers to the current server. If you are using the beeline command, which is the JDBC connection we commonly use, LOCAL refers to the path on the HiveServer2 machine.

Syntax


LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)];

option	description
local	Indicates loading data from the local system into a Hive table; otherwise, data is loaded from HDFS into the Hive table.
overwrite	Indicates overwriting existing data in the table; otherwise, data is appended.
partition	Indicates uploading to a specified partition; if the target is a partitioned table, the partition must be specified.

Example


# Create a table
CREATE TABLE student (
    id INT,
    name STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

Load local file to Hive


# If the LOCAL keyword is used, it is equivalent to hadoop fs -put—the source file will not be removed.
load data local inpath '/opt/module/datas/student.txt' into table student;


# Execute again. Since it uses INTO instead of OVERWRITE INTO, this is an append operation rather than an overwrite.
# Therefore, the table should now contain two copies of the data.
LOAD DATA LOCAL INPATH '/opt/module/datas/student.txt' INTO TABLE student;


# Since it uses OVERWRITE INTO, this is an overwrite operation. The original two copies of the data should now be reduced to one.
LOAD DATA LOCAL INPATH '/opt/module/datas/student.txt' OVERWRITE INTO TABLE student;

Load HDFS file to Hive


# Upload file to HDFS
hadoop fs -put /home/sylgg0918/student.txt /user/sylgg0918

# Load data from HDFS. After the import, check HDFS to see if the file still exists.
# (Without the LOCAL keyword, this behaves like a move operation, 
# and the source file will be removed.)
LOAD DATA INPATH '/opt/module/datas/student.txt' INTO TABLE student;

5.2 Insert

Insert the query results into a table

Syntax


INSERT (INTO | OVERWRITE) TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement;

Example


# Create a new table
CREATE TABLE student1 (
    id INT,
    name STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

# Insert data based on query results
INSERT OVERWRITE TABLE student1 SELECT id, name FROM student;

# Append one more copy of the data
INSERT INTO TABLE student1 SELECT id, name FROM student;

Insert the given values into a table

Syntax


INSERT (INTO | OVERWRITE) TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]

Example


# Insert fixed values into the table
INSERT INTO TABLE student1 VALUES (1, 'wangwu'), (2, 'zhaoliu');

# Insert fixed values into a partitioned table
INSERT INTO TABLE student1 PARTITION(id, age) VALUES (1, 'wangwu'), (2, 'zhaoliu');

Write the query results to the target path

Tips: Make sure the path after directory is completely correct. If there is any mistake, it will overwrite all files in the incorrect path you provided. The damage can be as severe as rm -rf.

Syntax


INSERT OVERWRITE [LOCAL] DIRECTORY directory
[ROW FORMAT row_format] [STORED AS file_format] select_statement;

Example


# Without the LOCAL keyword, the output would be written directly to an HDFS path
INSERT OVERWRITE LOCAL DIRECTORY '/home/sylgg0918/datas' 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
SELECT id, name FROM student;

5.3 Export & Import

The EXPORT statement can export both the table data and its metadata to an HDFS path. The IMPORT statement can load the exported content back into Hive, restoring both the table data and the metadata. EXPORT and IMPORT can be used for data migration between two Hive instances.

Syntax

Export


EXPORT TABLE tablename TO 'export_target_path'

Import


IMPORT [EXTERNAL] TABLE new_or_original_tablename FROM 'source_path' [LOCATION 'import_target_path']

Example


# Export
EXPORT TABLE default.student TO '/user/hive/warehouse/export/student';

# Import
# If an error occurs, check Error & Exception -> Hive
IMPORT TABLE student2 FROM '/user/hive/warehouse/export/student';

6. Query

Syntax


SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference        -- From which table to query
[WHERE where_condition]     -- Filtering
[GROUP BY col_list]         -- Grouping
[HAVING col_list]           -- Filtering after grouping
[ORDER BY col_list]         -- Sorting
[
  CLUSTER BY col_list
  | [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT number]              -- Limit the number of output rows

Data Preparation

Create dept.txt under /opt/module/datas and fill it with the following content.


10    Administration    1700
20    Finance           1800
30    Teaching          1900
40    Sales             1700

Create emp.txt under the /opt/module/datas path and fill it with the following content.


7369    Zhang San     R&D        800.00    30
7499    Li Si         Finance    1600.00   20
7521    Wang Wu       Admin      1250.00   10
7566    Zhao Liu      Sales      2975.00   40
7654    Hou Qi        R&D        1250.00   30
7698    Ma Ba         R&D        2850.00   30
7782    Jin Jiu       \N         2450.0    30
7788    Yin Shi       Admin      3000.00   10
7839    Xiao Fang     Sales      5000.00   40
7844    Xiao Ming     Sales      1500.00   40
7876    Xiao Li       Admin      1100.00   10
7900    Xiao Yuan     Lecturer   950.00    30
7902    Xiao Hai      Admin      3000.00   10
7934    Xiao Hongming Lecturer   1300.00   30

Create the department table.


create table if not exists dept(
	deptno int,
	dname string,
	loc int
)
row format delimited fields terminated by '\t';

Create the employee table.


create table if not exists emp(
	empno int,
	ename string,
	job string,
	sal double,
	deptno int
)
row format delimited fields terminated by '\t';

Import data


load data local inpath '/opt/module/datas/dept.txt' into table dept;
load data local inpath '/opt/module/datas/emp.txt' into table emp;

6.1 Basic query

6.1.1 Query the entire table and specific columns.

Query the entire table.


select * from emp;

Select specific columns for query.

Tips:

SQL is case-insensitive.

SQL statements can be written in a single line or across multiple lines.

Keywords cannot be abbreviated or split across lines.

Each clause is generally written on a separate line.

Use indentation to improve the readability of the statements.


select empno, ename from emp;

6.1.2 Column alias


select 
    ename AS name, 
    deptno dn 
from emp;

6.1.3 Limit


# Return 5 rows
SELECT * FROM emp LIMIT 5;

# Fetch 3 rows starting from the 2nd row (excluding the second row)
SELECT * FROM emp LIMIT 2, 3;

6.1.4 Where


# Query employees with a salary greater than 1000
SELECT * FROM emp WHERE sal > 1000;

6.1.5 Relational operators/functions

The following operators are mainly used in WHERE and HAVING clauses.

Operator	Support data type	Description
A=B	Basic data types	Returns `true` if A equals B; otherwise, returns `false`.
A<=>B	Basic data types	Returns `true` if both A and B are either `NULL` or both not `NULL`; returns `false` if only one of them is `NULL`.
A<>B, A!=B	Basic data types	Returns `NULL` if either A or B is `NULL`; returns `true` if A is not equal to B, otherwise returns `false`.
A<B	Basic data types	Returns `NULL` if either A or B is `NULL`; returns `true` if A is less than B, otherwise returns `false`.
A<=B	Basic data types	Returns `NULL` if either A or B is `NULL`; returns `true` if A is less than or equal to B, otherwise returns `false`.
A>B	Basic data types	Returns `NULL` if either A or B is `NULL`; returns `true` if A is greater than B, otherwise returns `false`.
A>=B	Basic data types	Returns `NULL` if either A or B is `NULL`; returns `true` if A is greater than or equal to B, otherwise returns `false`.
A [not] between B and C	Basic data types	If any of A, B, or C is `NULL`, the result is `NULL`. If A is greater than or equal to B and less than or equal to C, the result is `true`; otherwise, it is `false`. Using the `NOT` keyword produces the opposite result.
A is null	All data types	Returns `true` if A is `NULL`; otherwise, returns `false`.
A is not null	All data types	Returns `true` if A is not `NULL`; otherwise, returns `false`.
in（数值1，数值2）	All data types	Use the `IN` operator to check if a value exists in a list.
A [not] like B	string	B is a simple SQL regular expression, also called a wildcard pattern. If A matches B, it returns `true`; otherwise, it returns `false`.The pattern rules for B are as follows:`'x%'` means A must start with the letter `x`.`'%x'` means A must end with the letter `x`.`'%x%'` means A contains the letter `x`, which can be at the beginning, end, or anywhere within the string.Using the `NOT` keyword produces the opposite result.
A rlike B, A regexp B	string	ChatGPT said:B is a Java-based regular expression. If A matches B, it returns `true`; otherwise, it returns `false`.The matching uses the regular expression interface implemented in the JDK, so the rules follow Java regex conventions. For example, the regular expression must match the entire string A, not just a part of it.

6.1.6 Logical operators/functions

Operator	Description
and	Logical AND
or	Logical OR
not	Logical NOT

Example


# Query employees with salary greater than 1000 and department 30
SELECT *
FROM emp
WHERE sal > 1000 AND deptno = 30;

# Query employees with salary greater than 1000 or department 30
SELECT *
FROM emp
WHERE sal > 1000 OR deptno = 30;

# Query employees who are not in department 20 or 30
SELECT *
FROM emp
WHERE deptno NOT IN (20, 30);

6.1.7 Aggregate functions

Function Name	Description
count(*)	Counts all rows, including `NULL` values.
count([Column Name])	Counts the number of non-`NULL` values in a column.
max()	Returns the maximum value, excluding `NULL` (unless all values are `NULL`).
min()	Returns the minimum value, excluding `NULL` (unless all values are `NULL`).
sum()	Calculates the sum, excluding `NULL`.
avg()	Calculates the average, excluding `NULL`.

Example

Get the total number of rows (in Hive, COUNT(1) and COUNT(*) have exactly the same performance).


select count(*) cnt from emp;

Get the maximum salary.


select max(sal) max_sal from emp;

Get the minimum salary.


select min(sal) min_sal from emp;

Get the total sum of salaries.


select sum(sal) sum_sal from emp;

Get the average salary.


select avg(sal) avg_sal from emp;

6.2 Group & Having

6.2.1 Group By

The GROUP BY statement is usually used with aggregate functions. It groups the results by one or more columns and then performs aggregate operations on each group.

Example

Calculate the average salary for each department in the emp table.


select
	t.deptno,
	avg(sal) avg_sal
from emp t
group by t.deptno;

Calculate the highest salary for each job in each department in the emp table.


select
	t.deptno,
	t.job,
	max(sal) max_sal
from emp t
group by t.deptno, t.job;

6.2.2 Having

Difference between HAVING and WHERE:

You cannot use aggregate functions after WHERE, but you can use them after HAVING.

HAVING is used only in GROUP BY statements for grouped aggregation.

Example

Get departments with an average salary greater than 2000.


select
	deptno
from emp
group by deptno
having 
	avg(sal) > 2000;

select
	deptno,
	avg(sal) avg_sal
from emp
group by deptno
having
	avg_sal > 2000;

6.3 Join

6.3.1 Table alias

Example


select
	e.*,
	d.*
from emp p
join dept d
on e.deptno = d.deptno;

6.3.2 inner join

Only rows that have matching data in both tables based on the join condition are retained. By default, JOIN is an INNER JOIN.


select
	e.empno,
	e.ename,
	d.deptno
from emp e
join dept d
on e.deptno = d.deptno;

6.3.3 left join

Returns all rows from the left table; matched rows from the right table show their values, and unmatched rows are filled with NULL.


select
	e.empno,
	e.ename,
	d.deptno
from emp e
left join dept d
on e.deptno = d.deptno;

6.3.4 right join

Returns all rows from the right table; matched rows from the left table show their values, and unmatched rows are filled with NULL.


select
	e.empno,
	e.ename,
	d.deptno
from emp e
right join dept d
on e.deptno = d.deptno;

6.3.5 full join

Returns all rows from both the left and right tables; unmatched rows are filled with NULL.


select
	e.empno,
	e.ename,
	d.deptno
from emp e
full join dept d
on e.deptno = d.deptno;

6.3.6 Multi-table join

To join n tables, at least n-1 join conditions are required.

Data Preparation

Create a file named location.txt under /home/sylgg0918/datas/.


1700	Beijing
1800	Shanghai
1900	Shenzhen

创建表


create table if not exists location(
	loc int,
	loc_name string
)
row format delimited fields terminated by '\t';

Import data


load data local inpath '/home/sylgg0918/datas/location.txt' into table location;

Example


select
    e.ename,
    d.dname,
    l.loc_name
from emp e
join dept d
on d.deptno = e.deptno
join location l
on d.loc = l.loc;

6.3.7 Cartesian product

Conditions for producing a Cartesian product:

The join condition is omitted.

The join condition is invalid.

All rows from all tables are joined with each other.

Example


select
	empno,
	dname
from emp, dept;

6.3.8 union & union all

Tips:

UNION can only combine SELECT statements.

If the field names of the two SELECT statements are different, the result will use the field names from the first SELECT. (Even if the names differ, the union works as long as the column types match.)

UNION and UNION ALL concatenate SQL results vertically, unlike JOIN, which combines tables horizontally.

UNION removes duplicates.

UNION ALL keeps all rows, including duplicates.

Requirements for using UNION or UNION ALL:

Both SQL results must have the same number of columns.

The data types of corresponding columns must match.

Example


# Concatenate and display employees from department 30 and department 40 using UNION
SELECT *
FROM emp
WHERE deptno = 30
UNION
SELECT *
FROM emp
WHERE deptno = 40;

6.4 Sort

6.4.1 order by (Global)

Global sorting uses only one reducer. Even if you set mapreduce.job.reduces=3, the final number of reducers will still be one. This is because the purpose of ORDER BY is to achieve a globally sorted result.

In production, it’s common to add a LIMIT after ORDER BY to reduce the workload. If you truly need a fully globally sorted result, you must allocate more resources to the reducer.

Example

Query employee information sorted by salary in ascending order.


select 
	* 
from emp 
order by sal;

Query employee information sorted by salary in descending order.


select 
	* 
from emp 
order by sal desc;

Sort employees based on twice their salary.


select 
    ename, 
    sal * 2 twosal 
from emp 
order by twosal;

Query employee information sorted by twice the salary:


select 
    ename, 
    deptno, 
    sal 
from emp 
order by deptno, sal;

6.4.2 Sort By (Reduce End)

SORT BY is used to specify the sorting field from Map to Reduce, but it does not guarantee global ordering.

For large datasets, ORDER BY is very inefficient. Often, global sorting is unnecessary, and in such cases, SORT BY can be used. SORT BY generates a sorted file for each reducer. The sorting is done within each reducer, so the overall global result is not fully sorted.

Example


# Setting the number of reducers
set mapreduce.job.reduces=3;

# Checking the set number of reducers
set mapreduce.job.reduces;

# Viewing employee information sorted by department number in descending order
select 
	* 
from emp 
sort by deptno desc;


# You can verify that the data is partitioned and sorted by using the following commands:

# Setting 3 Reducers (The number of reducers equals the number of final output files)
set mapreduce.job.reduces=3;

# Output the results to a file.
insert overwrite local directory '/home/sylgg0918/datas/sortby-result'
select
	*
from emp
sort by deptno desc;

Check the files in the target directory to see if they are sorted.

6.4.3 Distribute By

Tips:

The partitioning rule of distribute by is based on the hash code of the partition field modulo the number of reducers; rows with the same remainder are grouped into the same partition.

Hive requires that the distribute by clause be written before the sort by clause.

After the demonstration, remember to reset the value of mapreduce.job.reduces back to -1; otherwise, subsequent operations on partitioned or bucketed tables may cause errors during MapReduce jobs.

distribute by is used to specify the partition field—essentially, it distributes data to designated reducers according to a rule, after which sort by performs sorting.

In some cases, we need to control which specific rows go to which Reducer, typically for subsequent aggregation operations. The distribute by clause accomplishes this. distribute by works similarly to partitioning in MapReduce (custom partitioning) and is often used in combination with sort by.

When testing distribute by, be sure to allocate multiple reducers for processing; otherwise, the effect of distribute by may not be observable.

Example


# Partition by department number first, then sort by employee salary
insert overwrite local directory 
'/home/sylgg0918/datas/distribute-result' 
select 
    * 
from emp 
distribute by deptno 
sort by sal desc;

6.4.4 Cluster By

When the distribute by and sort by fields are the same, you can use cluster by. In addition to the functionality of distribute by, cluster by also includes the functionality of sort by. However, the sorting can only be in ascending order; you cannot specify the sorting rule as ASC or DESC.

Example


# Partition and sort by deptno — the following two methods are exactly equivalent:
select
	*
from emp
cluster by deptno;

####
select
	*
from emp
distribute by deptno
sort by deptno;

7. Functions


# View system built-in functions
show functions;

# View usage of built-in functions
desc function upper;

# View detailed information of built-in functions
desc function extended upper;

7.1 Single-row functions

7.1.1 Arithmetic operation functions

Operator	Description
A+B	A plus B
A-B	A minus B
A*B	A multiplied by B
A/B	A divided by B
A%B	A modulo B
A&B	A bitwise AND B
A\|B	A bitwise OR B
A^B	A bitwise XOR B
~A	Bitwise NOT of A


select sal + 1 from emp;

7.1.2 Numerical functions

round


# Result: 3
select round(3.3);

ceil


# Result: 4
select ceil(3.1);

floor


# Result: 4
select floor(4.8);

7.1.3 String functions

substring

Syntax

Grammar	Return	Description
substring(string A, int start)	string	Returns the substring of string A starting from position `start` to the end. Includes the character at position `start`, and note that string indices in Hive start from 1.
substring(string A, int start, int len)	string	Returns a substring of string A starting from position `start` with a length of `len`. Includes the character at position `start`, and note that string indices in Hive start from 1.

Example


# Get all characters starting from the second character. Result: rtists
select substring("artists", 2);

# Get all characters starting from the third-to-last character. Result: sts
select substring("artists", -3);

# Starting from the third character, get the next 2 characters. Result: ti
select substring("artists", 3, 2);

replace

Syntax

Grammar	Return	Description
replace(string A, string B, string C)	string	Replace substring B in string A with C.

Example


# Replace 's' with 'S'. Result: artiStS
select replace('artists', 's', 'S')

regexp_replace

Syntax

Grammar	Return	Description
regexp_replace(string A, string regexp, string C)	string	Replace the part of string A that matches the Java regular expression B with C. Note: in some cases, escape characters may be required.

Example


# Replace digits with 'num'. Result: num-num
select regexp_replace('1000-200', '(\d+)', 'num')

regexp

Syntax

Grammar	Return	Description
select 'strA' regexp ‘strB’	boolean	If the string matches the regular expression, return true; otherwise, return false.

Example


# Result: true
select 'dfsaaaa' regexp 'dfsa+'

# Result: false
select 'dfsaaaa' regexp 'dfsb+';

repeat

Syntax

Grammar	Return	Description
repeat(string A, int n)	string	Repeat string A n times.

Example


# Result: 123123123
select repeat('123', 3);

split

Syntax

Grammar	Return	Description
split(string str, string pat)	array	Split the string `str` based on the content matched by the regular expression `pat`, and return the resulting substrings as an array.

Example


# Result: ["a","b","c","d"]
select split('a-b-c-d','-');

nvl

Syntax

Grammar	Return	Description
nvl(A,B)	A or B	If the value of A is not null, return A; otherwise, return B.

Example


# Result: 1
select nvl(null,1);

concat

Syntax

Grammar	Return	Description
concat(string A, string B, string C, ……)	string	Concatenate characters A, B, C, ... into a single string.

Example


# Result: beijing-shanghai-shenzhen
select concat('beijing','-','shanghai','-','shenzhen');

concat_ws

Syntax

Grammar	Return	Description
concat_ws(string A, string…\| array(string))	string	Concatenate multiple strings, or all elements of an array, using separator A.

Example


# Result: beijing-shanghai-shenzhen
select concat_ws('-','beijing','shanghai','shenzhen');

# Result: aa-bb-cc
select concat_wa('-', array("aa","bb","cc"));

get_json_object

Syntax

Grammar	Return	Description
get_json_object(string json_string, string path)	string	Parse the JSON string `json_string` and return the content specified by `path`. If the input JSON string is invalid, return NULL.

Example


# Result: artists
select get_json_object('[{"name":"artists","sex":"male","age":"25"},{"name":"small ben","sex":"male","age":"47"}]','$.[0].name');

# Result: {"name":"artists","sex":"male","age":"25"}
hive> select get_json_object('[{"name":"artists","sex":"male","age":"25"},{"name":"small ben","sex":"male","age":"47"}]','$.[0]');

7.1.4 Date functions

unix_timestamp: Returns the timestamp of the current or specified time.

Syntax

Grammar	Return	Description
unix_timestamp()	bigint	Returns the number of seconds elapsed between the current or specified time and January 1, 1970, UTC.

Example


# Number of seconds or milliseconds in UTC time, independent of the time zone of the hiveserver2 server
select unix_timestamp('2022/08/08 08-08-08','yyyy/MM/dd HH-mm-ss');

# Returns the number of seconds elapsed between the current time and January 1, 1970
select unix_timestamp();

Result

from_unixtime: Convert a UNIX timestamp (seconds elapsed from 1970-01-01 00:00:00 UTC to the specified time) to the time format in the UTC+0 time zone.

Syntax

Grammar	Return	Description
from_unixtime(bigint unixtime[,string format])	string	Convert a UNIX timestamp (seconds elapsed from 1970-01-01 00:00:00 UTC to the specified time) to the time format in the current time zone.

Example


# Result:2022-08-08 08:08:08
select from_unixtime(1659946088);

from_utc_timestamp: Convert a UNIX timestamp (seconds elapsed from 1970-01-01 00:00:00 UTC to the specified time) to the time format in the specified time zone.

Syntax

Grammar	Return	Description
from_utc_unixtime(bigint unixtime[,string format])	string	Convert a UNIX timestamp (seconds elapsed from 1970-01-01 00:00:00 UTC to the specified time) to the time format in the current time zone.

Example


# Result: 2022-08-08 16:08:08.000000000
select from_utc_timestamp(cast(1659946088 as bigint)*1000, 'GMT+8');

current_date: Current date.

Syntax

Grammar	Return	Description
current_date	ㅤ	Current date

Example


# Result: 2023-03-12
select current_date;

current_timestamp: Current date and time, with precision down to milliseconds.

Syntax

Grammar	Return	Description
current_timestamp	ㅤ	Current date + time (unlike unix_timestamp, this method takes into account the current time zone).

Example


# Result: 2023-03-12 15:18:36.292000000
select current_timestamp;

month: Get the month from a date.

Syntax

Grammar	Return	Description
month(string date)	int	Get the month from a date.

Example


# Result: 8
select month('2022-08-08 08:08:08')

day: Get the day from a date.

Syntax

Grammar	Return	Description
date(string date)	int	Get the day from a date

Example


# Result: 8
select day('2022-08-08 08:08:08')

hour: Get the hour from a date.

Syntax

Grammar	Return	Description
hour(string date)	int	Get the hour from a date.

Example


# Result: 8
select day('2022-08-08 08:08:08');

datediff: The number of days between two dates (end date minus start date).

Syntax

Grammar	Return	Description
datediff(string enddate, string startdate)	int	The number of days between two dates (end date minus start date).

Example


# The difference in days between the earlier date and the later date. Result: -427
select datediff('2021-08-08','2022-10-09');

date_add: Date plus number of days.

Syntax

Grammar	Return	Description
date_add(string startdate, int days)	string	Returns the date after adding days to the start date.

Example


# What is the date 2 days after 2022-08-08? Result: 2022-08-10
select date_add('2022-08-08',2);

date_sub: Date minus number of days.

Syntax

Grammar	Return	Description
date_sub(string startdate, int days)	string	Returns the date after subtracting days from the start date.

Example


# What is the date 2 days before 2022-08-08? Result: 2022-08-06
select date_sub('2022-08-08',2);

date_format: Parse a standard date into a string of the specified format.

Syntax

Grammar	Return	Description
date_sub(string startdate, string formatPattern)	string	Format startdate according to the formatPattern.

Example


# Result: 2022年-08月-08日
select date_format('2022-08-08', 'yyyy年-MM月-dd日');

7.1.5 Flow control functions

case when: Conditional judgment functions.

Syntax

Grammar	Return	Description
case when A then B [when C then D]* [else E] end	T	If A is true, then return B; if C is true, then return D; otherwise, return E.
case A when B then C [when D then E]* [else F] end	T	If A equals B, then return C; if A equals D, then return E; otherwise, return F.

Example


# Syntax一
select case when 1=2 then 'tom' when 2=2 then 'mary' else 'tim' end from table eName;

# Syntax二
select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' and from t ableName;

if: Conditional judgment, similar to the ternary operator in Java.

Syntax

Grammar	Return	Description
if (boolean testCondition, T valueTrue, T valueFalseOrNull)	T	When testCondition is true, return valueTrue; otherwise, return valueFalseOrNull.

Example


# Condition is satisfied, output 'Correct'
select if(10 > 5, 'Correct', 'False');

# Condition is not satisfied, output 'False'
select if(10 < 5, 'Correct', 'False');

7.1.6 Collection functions

size: Number of elements in the collection.


# Result: 4
select size(array(1,2,3,4));

map: Create a map collection.

Syntax

Grammar	Return	Description
map(key1, value1, key2, value2)	ㅤ	Construct a map type based on the input key-value pairs.

Example


select map('xiaohai',1,'dahai',2);

map_keys: Return the keys from a map.


select map_keys(map('xiaohai',1,'dahai',2))

map_values: Return the values from a map (duplicate values are not removed).


select map_values(map('xiaohai',1,'dahai',2))

array: Declare an array collection.

Syntax

Grammar	Return	Description
array(value1, value2)	ㅤ	根据输入的参数构建数组array类

Example


# Syntax实操 Result: ["1","2","3","4"]
select array('1','2','3','4');

array_contains: Determine whether an array contains a specific element.


# Result: true
select array_contains(array('a','b','c','d'), 'a');

sort_array: Sort the elements in the array (cannot control the sorting rule, defaults to ascending order).


# Result: ["a","c","d"]
select sort_array(array('a','d','c'));

struct: Declare each attribute in a struct.

Syntax

Grammar	Return	Description
struct(value1, value2, value3,…)	ㅤ	Construct an array struct class based on the input parameters.

Example


# Syntax实操 Result: {"col1":"name","col2":"age","col3":"weight"}
select struct('name', 'age','weight');

named_struct: Declare the attributes and values of a struct.


select named_struct('name', 'xiaosong', 'age',19,'weight',80)

7.2 Advanced aggregate functions.

For common aggregate functions, please refer to section 6.1.7 Aggregate Functions.

collect_list: Collect and form a list collection, without removing duplicates in the result.


select sex,collect_list(job) from employee group by sex;

collect_set: Collect and form a set collection, with duplicates removed from the result.


select sex,collect_set(job) from employee group by sex;

7.3 Explode function.

UDTF (User-defined Table Generating Function): User-defined table generating functions. One example is the explode function, which transforms a single row of data into multiple rows.

Example:

Requirement: Please count the number of movies in each category. Expected results are as follows.

plot	1
action	3
psychology	1
suspenseful	2
war	1
disaster	1
science fiction	1
cops	1

Create table


create table movie_info(
movie string,
category string
)
row format delimited fields terminated by "\t";

Insert


insert overwrite table movie_info
values ("suspect tracking", "suspenseful,action,science fiction,plot"),
("lie to me", "suspenseful,cops,action,psychology,plot"),
("wolf warrior 2", "war,action,disaster");

Implementation


select
    cate,
    count(*)
from
    (
        select
            movie,
            cate
        from
            (
                select
                    movie,
                    split(category,',') cates
                from movie_info
            ) t1 lateral view explode(cates) tmp as cate
    )t2
group by cate

7.4 Window functions.

A window function can assign a window to each row of data, then perform calculations on the data within that window's range, and finally return the calculation result to the row of data. The syntax of window functions mainly consists of two parts: the "window" and the "function." The "window" defines the calculation scope, while the "function" defines the calculation logic.

Syntax:


select
    [column name1],
    [column name1]
    函数([column name]) over ([窗口范围]) [别名]
from [table name];

Most aggregate functions can be used as the function part in window functions.

The window scope is primarily divided into two types: one is row-based (e.g., requiring the window for each row to be from the previous row to the current row), and the other is value-based (e.g., requiring the window for each row to be from the current value minus 1 to the current value).

unbounded preceding	无边界前面多少行, 负无穷(也就是第一行)
[num] preceding	多少行前
current row	当前行
[num] following	多少行后
unbounded following	无边界后面多少行, 正无穷(也就是最后一行)

此时order by后面的column就是定义中所说的, 基于当前值的那个值. [num] preceding 就变成了当前值减去num, [num] following 就变成了当前值加上[num]. 如果涉及到以上的这种操作, 需要涉及到数值加减的操作, 那么order by 后面的column必须是int, smallint这样的整数, 不能是小数. 如果是unbounded preceding 到 current row 就不需要了.

Example:


select 
    orrder_id,
    order_date,
    amount,
    sum(amount) over (partition by user_id 
        order by order_date 
        rows between unbounded preceding and current row) total_amount
from order_info;

7.5 Cross-row value retrieval function.

lead & lag

Retrieve the value of a field from a row above or below the current row.

Lead & lag do not support custom window ranges, meaning range between cannot be used with them.

Syntax:


select
    order_id,
    user_id,
    order_date,
    amount,
# lag(field_name, offset, default_value)
# lead(field_name, offset, default_value)
    lag(order_date,1,'1970-01-01') over (partition by user_id order by order_date) last_date,
    lead(order_date,1,'9999-12-31') over (partition by user_id order by order_date) next_date
from order_info;

first_value & last_value

Retrieve the first value or last value of a column within a window.

first_value & last_value allow custom window ranges, meaning range between can be used with them.

Syntax:


select
    order_id,
    user_id,
    order_date,
    amount,
    -- first_value(field_name, ignore_nulls)
    -- last_value(field_name, ignore_nulls)
    first_value(order_date, false) over (partition by user_id order by order_date) as first_date,
    last_value(order_date, false) over (partition by user_id order by order_date) as last_date
from order_info;

7.6 Ranking functions.

Ranking functions do not support custom windows, meaning range between cannot be used.

rank: Same values get the same rank, with gaps (e.g., two tied for first place results in ranks: 1, 1, 3).

dense_rank: Same values get the same rank, without gaps (e.g., two tied for first place results in ranks: 1, 1, 2).

row_number: Incremental row numbers.

Syntax:


select
    stu_id,
    course,
    score,
    rank() over (partition by course order by score desc) rk,
    dense_rank() over (partition by course order by score) dense_rk,
row_number() over(partition by course order by score desc) rn
from score_info;

7.7 User-defined functions.

User-defined functions are categorized into the following three types:

UDF (User-Defined Function): One input, one output

UDAF (User-Defined Aggregation Function): Multiple inputs, one output (e.g., count/max/min)

UDTF (User-Defined Table-generating Functions): One input, multiple outputs (e.g., lateral view explode)

Requirement:

Customize a UDF to calculate the length of a given primitive data type.

Example:

Create a Maven project named Hive-UDF.

Import dependencies.


<dependencies>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>3.1.3</version>
        </dependency>
    </dependencies>

Write the code.


package com.artists.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;

/**
 * @Author: ArtistS
 * @Site: https://github.com/ArtistSu
 * @Create: 2023/5/21 17:45
 * @Description: Custom UDF to calculate the length of a string
 **/
public class MyLength extends GenericUDF {
    
    @Override
    public ObjectInspector initialize(ObjectInspector[] objectInspectors) throws UDFArgumentException {
        // Return int type inspector because our function calculates the length of a field
        if (objectInspectors.length != 1) {
            throw new UDFArgumentException("Only accept one argument");
        }

        ObjectInspector argument = objectInspectors[0];
        if (ObjectInspector.Category.PRIMITIVE != argument.getCategory()) {
            throw new UDFArgumentException("Only accept primitive argument");
        }

        PrimitiveObjectInspector primitiveObjectInspector = (PrimitiveObjectInspector) argument;
        if (primitiveObjectInspector.getPrimitiveCategory() != PrimitiveObjectInspector.PrimitiveCategory.STRING) {
            throw new UDFArgumentException("Only accept string argument");
        }

        return PrimitiveObjectInspectorFactory.javaIntObjectInspector;
    }

    @Override
    public Object evaluate(DeferredObject[] deferredObjects) throws HiveException {
        // evaluate() is called once per row of data
        DeferredObject deferredObject = deferredObjects[0];
        Object obj = deferredObject.get();

        if (obj == null) {
            return 0;
        }

        return obj.toString().length();
    }

    @Override
    public String getDisplayString(String[] strings) {
        // This method is called for display in execution plans (e.g., when using explain)
        return "my_length(" + (strings.length > 0 ? strings[0] : "") + ")";
    }
}

Create a temporary function (only valid in the current session).

Package it into a jar file and upload it to the server at /opt/module/hive-3.1.3/datas.
Package it into a jar file and upload it to the server at /opt/module/hive-3.1.3/datas.


0: jdbc:hive2://hadoop102:10000> add jar /opt/module/hive-3.1.3/datas/myudf.jar

Create a temporary function and associate it with the developed Java class.


0: jdbc:hive2://hadoop102:10000> create temporary function my_len as "com.artists.hive.udf.MyLength";

Test


0: jdbc:hive2://hadoop102:10000> show functions like "my_len";
0: jdbc:hive2://hadoop102:10000> select my_len("abc");

Create a permanent function (exists forever).

Because ADD JAR itself is only temporarily effective, when creating a permanent function, you need to specify a path (and due to metadata reasons, this path must be on HDFS).

Upload the jar to HDFS.


[sylgg0918@hadoop102 datas]$ hdfs dfs -put /opt/module/hive-3.1.3/datas/myudf.jar /udf/

Create a permanent function.


0: jdbc:hive2://hadoop102:10000> create function my_len as "com.artists.hive.udf.MyLength" using jar "hdfs://hadoop102:8020/udf/myudf.jar";

Test


0: jdbc:hive2://hadoop102:10000> show functions like "*my_len*";
+-----------------+
|    tab_name     |
+-----------------+
| default.my_len  |
+-----------------+

0: jdbc:hive2://hadoop102:10000> select default.my_len("abc");
+------+
| _c0  |
+------+
| 3    |
+------+

8. Partitioned tables and bucketed tables.

8.1 Partitioned table

In Hive, partitioning refers to the process of dispersing the data of a large table into multiple directories based on business needs. Each directory becomes a partition of the table. During queries, the desired partitions are selected using expressions in the WHERE clause, which significantly improves query efficiency.

8.1.1 Syntax

Syntax:


create table dept_partition
(
    deptno int,
    dname  string,
    loc    string
)
    partitioned by (day string)
    row format delimited fields terminated by '\t';

Write Data

load

Data preparation


vim /opt/module/hive-3.1.3/datas/dept_20230521.log

# Enter the following content
10	Administration	1700
20	Finance	1800

Load statement.


load data local inpath '/opt/module/hive-3.1.3/datas/dept_20230521.log'
into table dept_partition
partition(day='20230503');

Test


select * from dept_partition;

insert

Requirement:

Insert data from the day='20230521' partition into the day='20230522' partition.

Example:


insert overwrite table dept_partition partition (day = '20230523')
select deptno, dname, loc
from dept_partition
where day = '20230521';

Read Data


select deptno, dname, loc, day from dept_partition where day = '20230523'

View partitions.


show partitions [Table Name]

show partitions dept_partition;

Add partition.


alter table dept_partition add partition([partition_field_name]=[partition_value])

# e.g. No commas between multiple partitions
alter table dept_partition add partition(day='20230523');
alter table dept_partition add partition(day='20230523') partition(day='20230524');

Delete partition.

Partitioned tables are also affected by whether they are managed (internal) tables or external tables. If it is an internal table, when a partition is deleted, both the HDFS data and the metadata will be removed. If it is an external table, only the metadata will be deleted.


alter table dept_partition drop partition ([partition_field_name]=[partition_value])

# e.g. Delete a single partition
alter table dept_partition drop partition (day='20230523')

# e.g. Delete multiple partitions, with commas between partitions
alter table dept_partition drop partition (day='20230521'), partition(day='20230523')

Repair partition.

Hive stores all partition information of a partitioned table in its metadata. The partitioned table can only read and write data normally when the metadata is consistent with the partition paths on HDFS.

If a user manually creates or deletes partition paths (i.e., directly creates paths on HDFS), Hive will not be aware of these changes. This leads to inconsistencies between Hive's metadata and the partition paths on HDFS.

Similarly, if the partitioned table is an external table and the user executes the drop partition command, the partition metadata will be deleted, but the partition path on HDFS will not be removed. This also causes inconsistencies between Hive's metadata and the partition paths on HDFS.

If inconsistencies between metadata and HDFS paths occur, they can be resolved using the following methods:

add partition

In the first scenario, where HDFS partition paths were manually created, causing Hive to fail to recognize them, you can manually use the add partition command to add partition metadata. This aligns the metadata with the partition paths.

drop partition

In the second scenario, where HDFS metadata was manually deleted but the HDFS partition paths remain, Hive cannot recognize them. You can use the drop partition command to delete the partition metadata, thereby aligning the metadata with the partition paths.

msck

If partition metadata is inconsistent with HDFS partition paths, you can also use the msck command to repair it. Below is an explanation of how to use this command.

Syntax:


# msck => metastore consistency check
msck repair table [table_name] [[add | drop | sync] partitions];

# e.g. This command adds partition information for HDFS paths that exist but are missing in the metadata
msck repair table dept_partition add partitions;

# e.g. This command deletes partition information for which the HDFS paths have been removed but the metadata still exists
msck repair table dept_partition drop partitions;

# e.g. This command synchronizes HDFS paths and metadata partition information, equivalent to executing both of the above commands simultaneously
msck repair table dept_partition sync partitions;

# e.g. The following command is equivalent to msck repair table dept_partition add partitions
msck repair table dept_partition;

8.1.2 Secondary partitioned table

If the volume of log data within a single day is also substantial, how can the data be further split? The answer is a secondary partitioned table. For example, beyond daily partitioning, data within each day can be further partitioned by hour.

Syntax:


# Secondary partitioning means 2 partition fields; tertiary partitioning means 3 partition fields.
Table creation statement
create table dept_partition
(
deptno int,
dname string,
loc string
)
# partitioned by (day string, hour string)
row format delimited fields terminated by '\t';

# Data loading statement
load data local inpath '/opt/module/hive-3.1.3/datas/dept_20230521.log'
into table dept_partition
partition (day = '20230523', hour='12');

# Query partition data
select deptno, dname, loc, day
from dept_partition
where day = '20230523'
and hour = '12';

8.1.3 Dynamic partitioning

Dynamic partitioning refers to when inserting data into a partitioned table, the target partition is not specified by the user but is dynamically determined by the value of the last field in each row of data. Using dynamic partitioning, data can be written to multiple partitions with just one INSERT statement.

Global dynamic partitioning function switch (default: true, enabled).


set hive.exec.dynamic.partition=true

Strict mode & Non-strict mode.


# Mode for dynamic partitioning, default is 'strict' (strict mode), which requires at least one partition to be specified as a static partition.
# 'nonstrict' (non-strict mode) allows all partition fields to use dynamic partitioning.
# The meaning above is that when we create a table with partition(id, age, name), in strict mode, at least one of these three partition fields must have a fixed value.
# If it is partition(id) alone, strict mode cannot be satisfied because it requires at least one partition field to have a fixed value, so it should be set to 'nonstrict'.
set hive.exec.dynamic.partition.mode=nonstrict;

The maximum number of partitions that can be created by a single INSERT statement is 1000 by default.


set hive.exec.max.dynamic.partition=1000

The maximum number of partitions that a single Mapper or Reducer can create simultaneously is 100 by default.


set hive.exec.max.dynamic.partitions.pernode=100

The maximum number of files that a single INSERT statement can create is 100,000 by default.


set hive.exec.max.created.files=100000

Whether to throw an exception when the query result is empty and dynamic partitioning is performed; the default is false.


set hive.error.on.empty.partition=false

8.2 Bucketed table

Partitioning provides a convenient way to isolate data and optimize queries. However, not all datasets can be reasonably partitioned. For a table or partition, Hive can further organize data into buckets, which is a more fine-grained division of data ranges. Partitioning deals with data storage paths, while bucketing deals with data files.

The basic principle of a bucketed table is as follows: First, calculate a hash value for the data in a specified field for each row. Then, take the modulus of this hash value with a specified number of buckets. Finally, rows with the same modulus result are written into the same file, which is called a bucket.

There is no inherent connection between partitioning and bucketing. A table can be both a partitioned table and a bucketed table.

Syntax:


# Create table statement
create table stu_buck(
    id int,
    name string
)
clustered by (id)
into 4 buckets  
row format delimited fields terminated by '\t';

# Prepare data
vim /opt/module/hive-3.1.3/datas/student.txt

# Insert the following data
1001	student1
1002	student2
1003	student3
1004	student4
1005	student5
1006	student6
1007	student7
1008	student8
1009	student9
1010	student10
1011	student11
1012	student12
1013	student13
1014	student14
1015	student15
1016	student16

# Load data into the bucketed table
# Note: Newer versions of Hive can run MapReduce directly when loading data. Older versions of Hive require transferring the data to a table first and then loading it into the bucketed table via a query.
load data local inpath '/opt/module/hive-3.1.3/datas/student.txt'
into table stu_buck;

8.3 Bucketed sorted table


# Table creation statement. The bucketing field and the sorting field can be different.
# The bucketing field and sorting field are not limited to one; they can be multiple.
create table stu_buck_sort(
    id int,
    name string
)
clustered by(id) sorted by ()id

# Prepare data
vim /opt/module/hive-3.1.3/datas/student.txt

# Insert the following data
1001	student1
1002	student2
1003	student3
1004	student4
1005	student5
1006	student6
1007	student7
1008	student8
1009	student9
1010	student10
1011	student11
1012	student12
1013	student13
1014	student14
1015	student15
1016	student16

# Load data into the bucketed sorted table
load data local inpath '/opt/module/hive-3.1.3/datas/student.txt'
into table stu_buck_sort;

9. File formats and compression.

9.1 Overview of Hadoop Compression

Compression Format	Algorithm	File Extension	Is Splittable
DEFLATE	DEFLATE	.deflate	N
Gzip	DEFLATE	.gz	N
bzip2	bzip2	.bz2	Y
LZO	LZO	.lzo	Y
Snappy	Snappy	.snappy	N

To support multiple compression/decompression algorithms, Hadoop introduced codecs, as shown below.

Hadoop checks for supported compression methods via hadoop checknative.

Hadoop sets compression on the driver side.

Compression Format	Corresponding codec
DEFLATE	org.compress.hadoop.io.compress.DefaultCodec
gzip	org.apache.hadoop.io.compress.GzioCodec
bzip2	org.apache.hadoop.io.compress.BZip2Codec
LZO	com.hadoop.compression.lzo.LzopCodec
Snappy	org.aoache.hadoop.io.compress.SnappyCodec

Compression performance comparison

Compression Algorithm	Original File Size	Compressed File Size	Compression Speed	Decompression Speed
gzip	8.3GB	1.8GB	17.5MB/s	58MB/s
bzip2	8.3GB	1.1GB	2.4MB/s	9.5MB
LZO	8.3GB	2.9GB	49.3MB/s	74.6MB/s

On a single core of a CoreI7 processor in 64-bit mode, Snappy comoresses at about 250MB/s or mored and decppresses at about 500MB/s or more.

9.2 Hive file formats

Choosing an appropriate file format for the data in Hive tables is highly beneficial for improving query performance. The storage format for Hive table data can be selected from options such as Text File, ORC, Parquet, Sequence File, etc.

9.2.1 Text File

Text File is the default file format used by Hive. Each line in a text file corresponds to a row in a Hive table. The file format can be specified as Text File through the following table creation statement. This format is row-based storage.


create table textfile_table(
    (column_specs)
)
stored as textfile;

9.2.2 ORC (Optimized Row Columnar)

The ORC (Optimized Row Columnar) file format is a columnar storage file format introduced in Hive version 0.11. ORC files can improve Hive's performance in reading, writing, and processing data. In contrast to columnar storage is row-based storage. The figure below compares the two: (as shown, the left side is the logical table, the first on the right is row-based storage, and the second is columnar storage).

(1) Characteristics of row-based storage

When querying an entire row of data that meets certain conditions, columnar storage requires retrieving the corresponding values for each aggregated field across different columns, whereas row-based storage only needs to locate one value, as the remaining values are adjacent. Therefore, in such cases, row-based storage offers faster query performance.

Row-based storage can be likened to the following SQL analogy:


# In the SQL below, we query several columns without a WHERE clause. In such cases, columnar storage can significantly improve query performance.
select 
    id, name, age
from stu_table;

(2) Characteristics of columnar storage.

Since the data for each field is stored together, queries that require only a few fields can significantly reduce the amount of data read. The data type for each field is necessarily the same, allowing columnar storage to design more targeted and efficient compression algorithms.

As mentioned earlier, Text File and Sequence File are row-based storage formats, while ORC and Parquet are columnar storage formats.

The diagram below illustrates the logical and physical storage structure of ORC:

Component	Description
Header	Content is ORC, used to denote the file type.
Index Data	A lightweight index, which by default is created for every 10,000 rows per column. Each index records the position of the nth ten-thousandth row, as well as the maximum and minimum values of the last 10,000 rows.
Row Data	Stores the actual data, organized by column. Each column is encoded and split into multiple streams for storage.
Stripe Footer	Stores the positions of each stream and the encoding information for each column.

The Body consists of index data, row data, and stripe footer. The Tail is made up of File Footer and Postscript. The File Footer contains the actual positions of each stripe, index length, data length, and statistical information for each column, among other details. The Postscript records the compression type of the entire file and the length of the File Footer, etc.

When reading an ORC file, the process starts from the last byte to read the length of the Postscript, then reads the Postscript itself. From there, it parses the length of the File Footer, reads the File Footer, and finally extracts information about each stripe. In other words, the file is read from the end backward.

Table creation statement:


create table orc_table
(column_specs)
stored as orc
tblproperties (property_name=property_value,...)

The ORC file format supports the following parameters:

Parameter	Default	Description
orc.compress	ZLIB	Compression format, options: NONE, ZLIB, SNAPPY
orc.compress.size	262,144	Size of each compression block (ORC files are compressed in blocks)
orc.stripe.size	67,108,864	Size of each stripe (usually set to the same size as Hadoop's block, e.g., 128M)
orc.row.index.stride	10,000	Index stride (interval at which an index is created, e.g., every n rows).

9.2.3 Parquet

Parquet 文件是 Hadoop 生态中的一个通用的文件格式, 它也是一个列式存储的文件格式.

Parquest 文件的格式如下图所示:

上图展示了一个 Parquet 文件的基本结构, 文件的首尾都是该文件的 Magic Code, 用于校验它是否是一个 Parquet 文件.

首尾中间由若干个 Row Group 和一个 Footer(File Meta Data) 组成

每个 Row Group 包含多 Column Chunk ,每个 Column Chunk 包含多个 page

Component	Description
行组(Row Group)	一个行组对应逻辑表中的若干行
列块(Column Chunk)	一个行组中的一列保存在一个列块中
页(Page)	一个列块的数据会划分为若干个页
Footer(File Meta Data)	存储了每个 Row Group 中的每个 Column Chunk 额元数据信息, 元数据信息包含了该列的数据类型, 该列的编码方式, 该类的 Data Page 位置等信息.

建表语句:


create table parquet_table
(column_specs)
stored as parquet
tblproperties (property_name=property_value,...);

Parquet支持的参数如下:

Parameter	Default	Description
parquet.compression	uncompressed	压缩格式, 可选项: uncompressed, snappy, gzip, lzo, brotli, lz4
parquet.block.size	134217728	行组大小, 通常与 HDFS 块大小保持一致
parquet.page.size	1048576	页大小

9.3 Compression

Maintaining data compression within Hive tables and during computation processes is highly beneficial for efficient disk space utilization and improving query performance.

9.3.1 Compression of Hive table data

In Hive, the way to enable data compression varies depending on the file type of the table.

(1) TextFile

If a table's file type is TextFile and the data in the table needs to be compressed, in most cases, no special declaration is required in the table creation statement. Simply import the compressed file directly into the table, and Hive will automatically recognize its compression format and decompress it when querying the data.

However, when executing the SQL statement to load data into the table, users need to set the following parameters to ensure that the data written to the table is compressed.


# Whether the final output result of the SQL statement is compressed
set hive.exec.compress.output=true;

# Compression format of the output result (snappy in the example below)
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

(2)ORC

If the file type of a table is ORC and compression of the table data is required, the compression format must be declared in the table creation statement as follows:


create table orc_table
(column_specs)
stored as orc
tblproperties ("orc.compress"="snappy");

(3)Parquet

If the file type of a table is Parquet and compression of the table data is required, the compression format must be declared in the table creation statement as follows:


create table orc_table
(column_specs)
stored as parquet
tblproperties ("parquet.compression"="snappy");

9.3.2 Using compression during computation.

(1) Compressing the intermediate results of a single MapReduce job

The intermediate results of a single MapReduce job refer to the data output by the Mapper. Compressing this data can reduce network I/O during the shuffle phase. This can be configured using the following parameters:


# Enable MapReduce intermediate data compression
set mapreduce.map.output.compress=true;

# Set the compression method for MapReduce intermediate data (snappy in the example below)
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

(2) Compressing intermediate results of a single SQL statement

The intermediate results of a single SQL statement refer to the temporary data between two MapReduce jobs (a SQL statement may require computation through MapReduce). This can be configured using the following parameters:


# Whether to compress temporary data between two MapReduce jobs
set hive.exec.compress.intermediate=true;

# Compression format (snappy in the example below)
set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

10. Enterprise-level tuning.

10.1 Computing resource allocation.

10.1.1 Yarn Resource configuration

需要调整的 Yarn 参数均与 CPU, 内存等资源有关, 核心配置如下

Parameter	Property	Description
yarn.nodemanager.resource.memory-mb	<property> <name>yarn.nodemanager.resource.memory-mb</name> <value>65536</value> </property>	Memory allocated to each Container by a NodeManager node. The configuration of this parameter depends on the total memory capacity of the node where the NodeManager is located and the number of other services running on that node (the previous configuration was 64G, typically allocated as 1/2 to 2/3 of the total memory).
yarn.nodemanager.resource.cpu-vcores	<property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>16</value> </property>	CPU cores allocated to each Container by a NodeManager node. The configuration of this parameter similarly depends on the total CPU cores of the node where the NodeManager is located and the other services running on that node (the previous configuration was 16 cores, with a ratio to the above parameter roughly 1:4, meaning 1 core for every 4 GB of memory).
yarn.scheduler.maximum-allocation-mb	<property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>16384</value> </property>	Maximum memory a single container can use (the previous configuration was 16G).
yarn.scheduler.minimum-allocation-mb	<property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>512</value> </property>	Minimum memory a single container can use (the previous configuration was 512M).

The parameters above need to be modified in $HADOOP_HOME/etc/hadoop/yarn-site.xml. After making the changes, remember to distribute the configuration to other nodes and then restart YARN.

10.1.2 MapReduce Resource configuration

MapReduce resource configuration primarily includes the memory and CPU core allocation for Map Tasks, as well as the memory and CPU core allocation for Reduce Tasks.

Parameter	Description
mapreduce.map.memory.mb	Memory size of the container requested by a single Map Task. Its default value is 1024 MB. This value must not exceed the range defined by `yarn.scheduler.maximum-allocation-mb` and `yarn.scheduler.minimum-allocation-mb`. This parameter needs to be configured separately for different computing tasks. In Hive, you can directly configure it for each SQL statement as follows:`set mapreduce.map.memory.mb=2048;`
mapreduce.map.cpu.vcores	CPU core count of the container requested by a single Map Task. Its default value is 1, and this value generally does not require adjustment.
mapreduce.reduce.memory.mb	Memory size of the container requested by a single Reduce Task. Its default value is 1024 MB. This value also must not exceed the range defined by `yarn.scheduler.maximum-allocation-mb` and `yarn.scheduler.minimum-allocation-mb`. This parameter needs to be configured separately for different computing tasks. In Hive, you can directly configure it for each SQL statement as follows:`set mapreduce.reduce.memory.mb=2048;`
mapreduce.reduce.cpu.vcores	CPU core count of the container requested by a single Reduce Task. Its default value is 1, and this value generally does not require adjustment.

10.2 Explain - View execution plan.

10.2.1 Explain - Execution Plan Overview.

The execution plan presented by EXPLAIN consists of a series of Stages, which have dependencies on each other. Each Stage corresponds to a MapReduce Job, a file system operation, or other actions.

If a Stage corresponds to a MapReduce Job, the computational logic of its Map side and Reduce side is described by the Map Operator Tree and Reduce Operator Tree, respectively. The Operator Tree is composed of a series of Operators, where each Operator represents a single logical operation during the Map or Reduce phase, such as TableScan Operator, Select Operator, Join Operator, etc.

10.2.2 Explain - Syntax

Syntax:


# FORMATTED: Outputs the execution plan as a JSON string.
# EXTENDED: Outputs additional information from the execution plan, typically including file names for reads and writes, etc.
# DEPENDENCY: Outputs the tables and partitions read by the execution plan.
EXPLAIN [FORMATTED | EXTENDED | DEPENDENCY] query-sql;

Example:


explain
select
	user_id,
	count(*)
from order_detail
group by user_id;

2023-01-13

Hive - Zero to One [3.1.3]

1. Introduction to Hive

1.1 What is Hive

1.2 The essence of Hive

1.3 Hive architecture principles

2. Hive Installation

3. Hive Tips and Tricks

3.1 Common Hive Commands

3.2 Hive Configuration Files

3.3 Common Hive Configuration Properties

4. DDL(Data Definition Language)

4.1 Database

4.1.1 Create Database

4.1.2 Query Database

4.1.3 Update Database

4.1.4 Delete Database

4.2 Table

Data Types

4.2.1 Create Table

4.2.2 View Table

4.2.3 Update Table

4.2.4 Delete Table

4.2.5 Truncate Table

5. DML(Data Manipulation Language)

5.1 Load

5.2 Insert

Insert the query results into a table

Insert the given values into a table

Write the query results to the target path

5.3 Export & Import

6. Query

Data Preparation

6.1 Basic query

6.1.1 Query the entire table and specific columns.

6.1.2 Column alias

6.1.3 Limit

6.1.4 Where

6.1.5 Relational operators/functions

6.1.6 Logical operators/functions

6.1.7 Aggregate functions

6.2 Group & Having

6.2.1 Group By

6.2.2 Having

6.3 Join

6.3.1 Table alias

6.3.2 inner join

6.3.3 left join

6.3.4 right join

6.3.5 full join

6.3.6 Multi-table join

6.3.7 Cartesian product

6.3.8 union & union all

6.4 Sort

6.4.1 order by (Global)

6.4.2 Sort By (Reduce End)

6.4.3 Distribute By

6.4.4 Cluster By

7. Functions

7.1 Single-row functions

7.1.1 Arithmetic operation functions

7.1.2 Numerical functions

7.1.3 String functions

7.1.4 Date functions

7.1.5 Flow control functions

7.1.6 Collection functions

7.2 Advanced aggregate functions.

7.3 Explode function.

7.4 Window functions.

7.5 Cross-row value retrieval function.

7.6 Ranking functions.

7.7 User-defined functions.

8. Partitioned tables and bucketed tables.

8.1 Partitioned table

8.1.1 Syntax

8.1.2 Secondary partitioned table

8.1.3 Dynamic partitioning

8.2 Bucketed table

8.3 Bucketed sorted table

9. File formats and compression.