APACHE HIVE: Hive Queries

Hive queries provides the following features:

Data modeling such as Creation of databases, tables, etc.
ETL functionalities such as Extraction, Transformation, and Loading data into tables
Joins to merge different data tables
User specific custom scripts for ease of code
Faster querying tool on top of Hadoop

create table "employees_manma" with 6 columns

create table employees_manma(Id INT, Name, String, Age Int, Address String, salary float, department string)

row format delimited

fields terminated by ',';

Now load data into table employees_manma.

load data local inpath '/path_location/' into table employees_manma;

Order by query:

The ORDER BY syntax in HiveQL is similar to the syntax of ORDER BY in SQL language.

Order by is the clause we use with "SELECT" statement in Hive queries, which guarantees total ordering of data. Order by clause use columns on Hive tables for grouping particular column values mentioned with Order by. For whatever the column name we are defining the order by clause the query will selects and display results by ascending or descending order the particular column values.

If the mentioned order by field is a string, then it will display the result in lexicographical order. At the back end, it has to be passed on to a single reducer

Query:

SELECT * FROM employees_manma ORDER BY Department;

Group by query:

This clause use columns on Hive tables for grouping particular column values mentioned with the group by. For whatever the column name we are defining a "groupby" clause the query will selects and display results by grouping the particular column values.

Query that is performed on the "employees_manma" table with the GROUP BY clause with Department as defined GROUP BY column name.

Output showing here is the department name, and the employees count in different departments. Here all the employees belong to the specific department is grouped by and displayed in the results. So the result is department name with the total number of employees present in each department.

Query:

SELECT Department, count(*) FROM employees_manma GROUP BY Department;

Sort by:

Sort by clause performs on column names of Hive tables to sort the output. We can mention DESC for sorting the order in descending order and mention ASC for Ascending order of the sort.

In this sort by it will sort the rows before feeding to the reducer. Always sort by depends on column types.

For instance, if column types are numeric it will sort in numeric order if the columns types are string it will sort in lexicographical order.

query that performing on the table "employees_manma" with the SORT BY clause with "id" as define SORT BY column name. We used keyword DESC.

So the output displayed will be in descending order of "id".

Query:

SELECT * from employees_manma SORT BY Id DESC;

Cluster By:

Cluster By used as an alternative for both Distribute BY and Sort BY clauses in Hive-QL.

Cluster BY clause used on tables present in Hive. Hive uses the columns in Cluster by to distribute the rows among reducers. Cluster BY columns will go to the multiple reducers.

It ensures sorting orders of values present in multiple reducers

For example, Cluster By clause mentioned on the Id column name of the table employees_manma table. The output when executing this query will give results to multiple reducers at the back end. But as front end it is an alternative clause for both Sort By and Distribute By.

This is actually back end process when we perform a query with sort by, group by, and cluster by in terms of Map reduce framework. So if we want to store results into multiple reducers, we go with Cluster By.

Query that performs CLUSTER BY clause on Id field value. Here it's going to get a sort on Id values.

It displays the Id and Names present in the employees_manma sort ordered by

Query:

SELECT Id, Name from employees_manma CLUSTER BY Id;

Distribute By:

Distribute BY clause used on tables present in Hive. Hive uses the columns in Distribute by to distribute the rows among reducers. All Distribute BY columns will go to the same reducer.

It ensures each of N reducers gets non-overlapping ranges of column

It doesn't sort the output of each reducer

DISTRIBUTE BY Clause performing on Id of "empoloyees_manma" table

Output showing Id, Name. At back end, it will go to the same reducer

Query:

SELECT Id, Name from employees_manma DISTRIBUTE BY Id;

Join queries:

Join queries can perform on two tables present in Hive. For understanding Join Concepts in clear here we are creating two tables overhere,

Sample_joins( Related to Customers Details )

Sample_joins1( Related to orders details done by Employees)

Step 1) Creation of table "sample_joins" with Column names ID, Name, Age, address and salary of the employees

create table sample_joins(Id INT, Name STRING, Age INT, Address STRING, Salary FLOAT)

row format delimited

fields terminated by ',';

Step 2) Loading and Displaying Data

load data local inpath '/local_path/Customers.txt' into TABLE sample_joins;

Loading data into sample_joins from Customers.txt

Step 3) Displaying sample_joins table contents

select * from sample_joins;

Step 4) Creation of sample_joins1 table and loading, displaying data

create table sample_joins1(OrderId INT, Date1 TIMESTAMP, Id INT, Amount FLOAT)

row format delimited

fields terminated by ',';

Step 5) Loading and Displaying Data

load data local inpath '/local_path/Orders.txt' into TABLE sample_joins1;

Step 6) Displaying sample_joins1 table contents

select * from sample_joins1;

Some points to observe in Joins:

Only Equality joins are allowed In Joins
More than two tables can be joined in the same query
LEFT, RIGHT, FULL OUTER joins exist in order to provide more control over ON Clause for which there is no match
Joins are not Commutative
Joins are left-associative irrespective of whether they are LEFT or RIGHT joins

Different type of joins. Joins are of 4 types, these are

Inner join
Left outer Join
Right Outer Join
Full Outer Join

Inner Join:

The Records common to the both tables will be retrieved by this Inner Join.

we can observe the following

Here we are performing join query using JOIN keyword between the tables sample_joins and sample_joins1 with matching condition as (c.Id= o.Id).
The output displaying common records present in both the table by checking the condition mentioned in the query

Query:

SELECT c.Id, c.Name, c.Age, o.Amount FROM sample_joins c JOIN sample_joins1 o ON(c.Id=o.Id);

Left Outer Join:

Hive query language LEFT OUTER JOIN returns all the rows from the left table even though there are no matches in right table
If ON Clause matches zero records in the right table, the joins still return a record in the result with NULL in each column from the right table

We can observe the following :

Here we are performing join query using "LEFT OUTER JOIN" keyword between the tables sample_joins and sample_joins1 with matching condition as (c.Id= o.Id). For example here we are using employee id as a reference, it checks whether id is common in right as well as left the table or not. It acts as matching condition.
The output displaying common records present in both the table by checking the condition mentioned in the query.

NULL values in the above output are columns with no values from Right table that is sample_joins1

Query:

SELECT c.Id, c.Name, o.Amount, o.Date1 FROM sample_joins c LEFT OUTER JOIN sample_joins1 o ON(c.Id=o.Id)

Right outer Join:

Hive query language RIGHT OUTER JOIN returns all the rows from the Right table even though there are no matches in left table
If ON Clause matches zero records in the left table, the joins still return a record in the result with NULL in each column from the left table
RIGHT joins always return records from a Right table and matched records from the left table. If the left table is having no values corresponding to the column, it will return NULL values in that place.

We can observe the following

Here we are performing join query using "RIGHT OUTER JOIN" keyword between the tables sample_joins and sample_joins1 with matching condition as (c.Id= o.Id).
The output displaying common records present in both the table by checking the condition mentioned in the query

Query:

SELECT c.Id, c.Name, o.Amount, o.Date1 FROM sample_joins c RIGHT OUTER JOIN sample_joins1 o ON(c.Id=o.Id)

Full outer join:

It combines records of both the tablessample_joins and sample_joins1 based on the JOIN Condition given in query.
It returns all the records from both tables and fills in NULL Values for the columns missing values matched on either side

We can observe the following:

Here we are performing join query using "FULL OUTER JOIN" keyword between the tables sample_joins and sample_joins1 with matching condition as (c.Id= o.Id).
The output displaying all the records present in both the table by checking the condition mentioned in the query. Null values in output here indicates the missing values from the columns of both tables.

Query

SELECT c.Id, c.Name, o.Amount, o.Date1 FROM sample_joins c FULL OUTER JOIN sample_joins1 o ON(c.Id=o.Id)

Sub queries:

A Query present within a Query is known as a sub query. The main query will depend on the values returned by the subqueries.

Subqueries can be classified into two types

Subqueries in FROM clause

Subqueries in WHERE clause

When to use:

To get a particular value combined from two column values from different tables
Dependency of one table values on other tables
Comparative checking of one column values from other tables

Syntax:

Subquery in FROM clause

SELECT <column names 1, 2…n>From (SubQuery) <TableName_Main>

Subquery in WHERE clause

SELECT <column names 1, 2…n> From<TableName_Main>WHERE col1 IN (SubQuery);

Example:

SELECT col1 FROM (SELECT a+b AS col1 FROM t1) t2

Here t1 and t2 are table names. The colored one is Subquery performed on table t1. Here a and b are columns that are added in a subquery and assigned to col1. Col1 is the column value present in Main table. This column "col1" present in the subquery is equivalent to the main table query in column col1.

What is HiveQL(Hive Query Language)?

Hive provides a CLI to write Hive queries using Hive Query Language (HiveQL). Generally HQL syntax is similar to the SQL syntax that most data analysts are familiar with.

Hive's SQL-inspired language separates the user from the complexity of Map Reduce programming. It reuses familiar concepts from the relational database world, such as tables, rows, columns and schema, to ease learning.

Most interactions tend to take place over a command line interface (CLI). Hive provides a CLI to write Hive queries using Hive Query Language (Hive-QL).

Generally, HiveQL syntax is similar to the SQL syntax that most data analysts are familiar with. Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File).

For single user metadata storage Hive uses derby database and
For multiple user Metadata or shared Metadata case Hive uses MYSQL

Built-in operators

Hive provides Built-in operators for Data operations to be implemented on the tables present inside Hive warehouse.

These operators are used for mathematical operations on operands, and it will return specific value as per the logic applied.

Types of Built-in Operators in HIVE are:

Relational Operators
Arithmetic Operators
Logical Operators
Operators on Complex types
Complex type Constructors

Relational Operators:

We use Relational operators for relationship comparisons between two operands.

Operators such as equals, Not equals, less than, greater than …etc
The operand types are all number types in these Operators.

The following Table will give us details about Relational operators and its usage.

Built-in Operator	Description	Operand
X = Y	TRUE if expression X is equivalent to expression Y Otherwise FALSE.	It takes all primitive types
X != Y	TRUE if expression X is not equivalent to expression Y Otherwise FALSE.	It takes all primitive types
X < Y	TRUE if expression X is less than expression Y Otherwise FALSE.	It takes all primitive types
X <= Y	TRUE if expression X is less than or equal to expression Y Otherwise FALSE.	It takes all primitive types
X>Y	TRUE if expression X is greater than expression Y Otherwise FALSE.	It takes all primitive types
X>= Y	TRUE if expression X is greater than or equal to expression Y Otherwise FALSE.	It takes all primitive types
X IS NULL	TRUE if expression X evaluates to NULL otherwise FALSE.	It takes all types
X IS NOT NULL	FALSE If expression X evaluates to NULL otherwise TRUE.	It takes all types
X LIKE Y	TRUE If string pattern X matches to Y otherwise FALSE.	Takes only Strings
X RLIKE Y	NULL if X or Y is NULL, TRUE if any substring of X matches the Java regular expression Y, otherwise FALSE.	Takes only Strings
X REGEXP Y	Same as RLIKE.	Takes only Strings

Arithmetic Operators:

We use Arithmetic operators for performing arithmetic operations on operands

Arithmetic operations such as addition, subtraction, multiplication and division between operands we use these Operators.
The operand types all are number types in these Operators

Sample Example:

2 + 3 gives result 5.

In this example, '+' is theoperator and 2 and 3 are operands. The return value is 5

The following Table will give us details about Arithmetic operators

Built-in Operator	Description	Operand
X + Y	It will return the output of adding X and Y value.	It takes all number types
X - Y	It will return the output of subtracting Y from X value.	It takes all number types
X * Y	It will return the output of multiplying X and Y values.	It takes all number types
X / Y	It will return the output of dividing Y from X.	It takes all number types
X % Y	It will return the remainder resulting from dividing X by Y.	It takes all number types
X & Y	It will return the output of bitwise AND of X and Y.	It takes all number types
X \| Y	It will return the output of bitwise OR of X and Y.	It takes all number types
X ^ Y	It will return the output of bitwise XOR of X and Y.	It takes all number types
~X	It will return the output of bitwise NOT of X.	It takes all number types

Logical Operators:

We use Logical operators for performing Logical operations on operands

Logical operations such as AND, OR, NOT between operands we use these Operators.
The operand types all are BOOLEAN type in these Operators

The following Table will give us details about Logical operators

Operators	Description	Operands
X AND Y	TRUE if both X and Y are TRUE, otherwise FALSE.	Boolean types only
X && Y	Same as X AND Y but here we using && symbol	Boolean types only
X OR Y	TRUE if either X or Y or both are TRUE, otherwise FALSE.	Boolean types only
X \|\| Y	Same as X OR Y but here we using \|\| symbol	Boolean types only
NOT X	TRUE if X is FALSE, otherwise FALSE.	Boolean types only
!X	Same as NOT X but here we using! symbol	Boolean types only

Operators on Complex types:

The following Table will give us details about Complex Type Operators . These are operators which will provide a different mechanism to access elements in complex types.

Operators	Operands	Description
A[n]	A is an Array and n is an integer type	It will return nth element in the array A. The first element has index of 0
M[key]	M is a Map<K, V> and key has type K	It will return the values belongs to the key in the map

Complex type Constructors:

The following Table will give us details about Complex type Constructors. It will construct instances on complex data types. These are of complex data types such as Array, Map and Struct types in Hive.

In this section, we are going to see the operations performed on Complex type Constructors.

Operators	Operands	Description
array	(val1, val2, ...)	It will create an array with the given elements as mentioned like val1, val2
Create_ union	(tag, val1, val2, ...)	It will create a union type with the values that is being mentioned to by the tag parameter
map	(key1, value1, key2, value2, ...)	It will create a map with the given key/value pairs mentioned in operands
Named_struct	(name1, val1, name2, val2, ...)	It will create a Struct with the given field names and values mentioned in operands
STRUCT	(val1, val2, val3, ...)	Creates a Struct with the given field values. Struct field names will be col1, col2, .

Built-in functions

These are functions that already available in Hive. First, we have to check the application requirement, and then we can use this built in functions in our applications. We can call these functions directly in our application.

The syntax and types are mentioned in the following section.

Types of Built-in Functions in HIVE

Collection Functions
Date Functions
Mathematical Functions
Conditional Functions
String Functions
Misc. Functions

Collection Functions:

These functions are used for collections. Collections mean the grouping of elements and returning single or array of elements depends on return type mentioned in function name.

Return Type	Function Name	Description
INT	size(Map<K.V>)	It will fetch and give the components number in the map type
INT	size(Array<T>)	It will fetch and give the elements number in the array type
Array<K>	Map_keys(Map<K.V>)	It will fetch and gives an array containing the keys of the input map. Here array is in unordered
Array<V>	Map_values(Map<K.V>)	It will fetch and gives an array containing the values of the input map. Here array is in unordered
Array<t>	Sort_array(Array<T>)	sorts the input array in ascending order of array and elements and returns it

Date Functions:

These are used to perform Date Manipulations and Conversion of Date types from one type to another type:

Function Name	Return Type	Description
Unix_Timestamp()	BigInt	We will get current Unix timestamp in seconds
To_date(string timestamp)	string	It will fetch and give the date part of a timestamp string:
year(string date)	INT	It will fetch and give the year part of a date or a timestamp string
quarter(date/timestamp/string)	INT	It will fetch and give the quarter of the year for a date, timestamp, or string in the range 1 to 4
month(string date)	INT	It will give the month part of a date or a timestamp string
hour(string date)	INT	It will fetch and gives the hour of the timestamp
minute(string date)	INT	It will fetch and gives the minute of the timestamp
Date_sub(string starting date, int days)	string	It will fetch and gives Subtraction of number of days to starting date
Current_date	date	It will fetch and gives the current date at the start of query evaluation
LAST _day(string date)	string	It will fetch and gives the last day of the month which the date belongs to
trunc(string date, string format)	string	It will fetch and gives date truncated to the unit specified by the format. Supported formats in this : MONTH/MON/MM, YEAR/YYYY/YY.

Mathematical Functions:

These functions are used for Mathematical Operations. Instead of creatingUDFs , we have some inbuilt mathematical functions in Hive.

Function Name	Return Type	Description
round(DOUBLE X)	DOUBLE	It will fetch and returns the rounded BIGINT value of X
round(DOUBLE X, INT d)	DOUBLE	It will fetch and returns X rounded to d decimal places
bround(DOUBLE X)	DOUBLE	It will fetch and returns the rounded BIGINT value of X using HALF_EVEN rounding mode
floor(DOUBLE X)	BIGINT	It will fetch and returns the maximum BIGINT value that is equal to or less than X value
ceil(DOUBLE a), ceiling(DOUBLE a)	BIGINT	It will fetch and returns the minimum BIGINT value that is equal to or greater than X value
rand(), rand(INT seed)	DOUBLE	It will fetch and returns a random number that is distributed uniformly from 0 to 1

Conditional Functions:

These functions used for conditional values checks.

Function Name	Return Type	Description
if(Boolean testCondition, T valueTrue, T valueFalseOrNull)	T	It will fetch and gives value True when Test Condition is of true, gives value False Or Null otherwise.
ISNULL( X)	Boolean	It will fetch and gives true if X is NULL and false otherwise.
ISNOTNULL(X )	Boolean	It will fetch and gives true if X is not NULL and false otherwise.

String Functions:

String manipulations and string operations these functions can be called.

Function Name	Return Type	Description
reverse(string X)	string	It will give the reversed string of X
rpad(string str, int length, string pad)	string	It will fetch and gives str, which is right-padded with pad to a length of length(integer value)
rtrim(string X)	string	It will fetch and returns the string resulting from trimming spaces from the end (right hand side) of X For example, rtrim(' results ') results in ' results'
space(INT n)	string	It will fetch and gives a string of n spaces.
split(STRING str, STRING pat)	array	Splits str around pat (pat is a regular expression).
Str_to_map(text[, delimiter1, delimiter2])	map<String ,String>	It will split text into key-value pairs using two delimiters.

APACHE HIVE

Thursday, 1 August 2019

Hive Queries

No comments:

Post a Comment

Hive with Semi structured data

Search This Blog