Hive queries provides the following features:
- Data modeling such as Creation of databases, tables, etc.
- ETL functionalities such as Extraction, Transformation, and Loading data into tables
- Joins to merge different data tables
- User specific custom scripts for ease of code
- Faster querying tool on top of Hadoop
create table "employees_manma" with 6 columns
create table employees_manma(Id INT, Name, String, Age Int, Address String, salary float, department string)
row format delimited
fields terminated by ',';
Now load data into table employees_manma.
load data local inpath '/path_location/' into table employees_manma;
Order by query:
The ORDER BY syntax in HiveQL is similar to the syntax of ORDER BY in SQL language.
Order by is the clause we use with "SELECT" statement in Hive queries, which guarantees total ordering of data. Order by clause use columns on Hive tables for grouping particular column values mentioned with Order by. For whatever the column name we are defining the order by clause the query will selects and display results by ascending or descending order the particular column values.
If the mentioned order by field is a string, then it will display the result in lexicographical order. At the back end, it has to be passed on to a single reducer
Query:
SELECT * FROM employees_manma ORDER BY Department;
Group by query:
This clause use columns on Hive tables for grouping particular column values mentioned with the group by. For whatever the column name we are defining a "groupby" clause the query will selects and display results by grouping the particular column values.
Query that is performed on the "employees_manma" table with the GROUP BY clause with Department as defined GROUP BY column name.
Output showing here is the department name, and the employees count in different departments. Here all the employees belong to the specific department is grouped by and displayed in the results. So the result is department name with the total number of employees present in each department.
Query:
SELECT Department, count(*) FROM employees_manma GROUP BY Department;
Sort by:
Sort by clause performs on column names of Hive tables to sort the output. We can mention DESC for sorting the order in descending order and mention ASC for Ascending order of the sort.
In this sort by it will sort the rows before feeding to the reducer. Always sort by depends on column types.
For instance, if column types are numeric it will sort in numeric order if the columns types are string it will sort in lexicographical order.
query that performing on the table "employees_manma" with the SORT BY clause with "id" as define SORT BY column name. We used keyword DESC.
So the output displayed will be in descending order of "id".
Query:
SELECT * from employees_manma SORT BY Id DESC;
Cluster By:
Cluster By used as an alternative for both Distribute BY and Sort BY clauses in Hive-QL.
Cluster BY clause used on tables present in Hive. Hive uses the columns in Cluster by to distribute the rows among reducers. Cluster BY columns will go to the multiple reducers.
It ensures sorting orders of values present in multiple reducers
For example, Cluster By clause mentioned on the Id column name of the table employees_manma table. The output when executing this query will give results to multiple reducers at the back end. But as front end it is an alternative clause for both Sort By and Distribute By.
This is actually back end process when we perform a query with sort by, group by, and cluster by in terms of Map reduce framework. So if we want to store results into multiple reducers, we go with Cluster By.
Query that performs CLUSTER BY clause on Id field value. Here it's going to get a sort on Id values.
It displays the Id and Names present in the employees_manma sort ordered by
Query:
SELECT Id, Name from employees_manma CLUSTER BY Id;
Distribute By:
Distribute BY clause used on tables present in Hive. Hive uses the columns in Distribute by to distribute the rows among reducers. All Distribute BY columns will go to the same reducer.
It ensures each of N reducers gets non-overlapping ranges of column
It doesn't sort the output of each reducer
DISTRIBUTE BY Clause performing on Id of "empoloyees_manma" table
Output showing Id, Name. At back end, it will go to the same reducer
Query:
SELECT Id, Name from employees_manma DISTRIBUTE BY Id;
Join queries:
Join queries can perform on two tables present in Hive. For
understanding Join Concepts in clear here we are creating two tables
overhere,
Sample_joins( Related to Customers Details )
Sample_joins1( Related to orders details done by Employees)
Step 1) Creation of table "sample_joins" with Column names ID,
Name, Age, address and salary of the employees
create table sample_joins(Id INT, Name STRING, Age INT, Address STRING,
Salary FLOAT)
row format delimited
fields terminated by ',';
Step 2) Loading and Displaying Data
load data local inpath '/local_path/Customers.txt' into TABLE
sample_joins;
Loading data into sample_joins from Customers.txt
Step 3) Displaying sample_joins table contents
select * from sample_joins;
Step 4) Creation of sample_joins1 table and loading, displaying
data
create table sample_joins1(OrderId INT, Date1 TIMESTAMP, Id INT, Amount
FLOAT)
row format delimited
fields terminated by ',';
Step 5) Loading and Displaying Data
load data local inpath '/local_path/Orders.txt' into TABLE
sample_joins1;
Step 6) Displaying sample_joins1 table contents
select * from sample_joins1;
Some points to observe in Joins:
- Only Equality joins are allowed In Joins
- More than two tables can be joined in the same query
- LEFT, RIGHT, FULL OUTER joins exist in order to provide more control over ON Clause for which there is no match
- Joins are not Commutative
- Joins are left-associative irrespective of whether they are LEFT or RIGHT joins
Different type of joins. Joins are of 4 types, these are
- Inner join
- Left outer Join
- Right Outer Join
- Full Outer Join
Inner Join:
The Records common to the both tables will be retrieved by this Inner
Join.
we can observe the following
- Here we are performing join query using JOIN keyword between the tables sample_joins and sample_joins1 with matching condition as (c.Id= o.Id).
- The output displaying common records present in both the table by checking the condition mentioned in the query
Query:
SELECT c.Id, c.Name, c.Age, o.Amount FROM sample_joins c JOIN sample_joins1 o ON(c.Id=o.Id);
Left Outer Join:
- Hive query language LEFT OUTER JOIN returns all the rows from the left table even though there are no matches in right table
- If ON Clause matches zero records in the right table, the joins still return a record in the result with NULL in each column from the right table
We can observe the following :
- Here we are performing join query using "LEFT OUTER JOIN" keyword between the tables sample_joins and sample_joins1 with matching condition as (c.Id= o.Id). For example here we are using employee id as a reference, it checks whether id is common in right as well as left the table or not. It acts as matching condition.
- The output displaying common records present in both the table by checking the condition mentioned in the query.
NULL values in the above output are columns with no values from Right table that is sample_joins1
Query:
SELECT c.Id, c.Name, o.Amount, o.Date1 FROM sample_joins c LEFT OUTER JOIN sample_joins1 o ON(c.Id=o.Id)
Right outer Join:
- Hive query language RIGHT OUTER JOIN returns all the rows from the Right table even though there are no matches in left table
- If ON Clause matches zero records in the left table, the joins still return a record in the result with NULL in each column from the left table
- RIGHT joins always return records from a Right table and matched records from the left table. If the left table is having no values corresponding to the column, it will return NULL values in that place.
We can observe the following
- Here we are performing join query using "RIGHT OUTER JOIN" keyword between the tables sample_joins and sample_joins1 with matching condition as (c.Id= o.Id).
- The output displaying common records present in both the table by checking the condition mentioned in the query
Query:
SELECT c.Id, c.Name, o.Amount, o.Date1 FROM sample_joins c RIGHT OUTER JOIN sample_joins1 o ON(c.Id=o.Id)
Full outer join:
- It combines records of both the tablessample_joins and sample_joins1 based on the JOIN Condition given in query.
- It returns all the records from both tables and fills in NULL Values for the columns missing values matched on either side
We can observe the following:
- Here we are performing join query using "FULL OUTER JOIN" keyword between the tables sample_joins and sample_joins1 with matching condition as (c.Id= o.Id).
- The output displaying all the records present in both the table by checking the condition mentioned in the query. Null values in output here indicates the missing values from the columns of both tables.
Query
SELECT c.Id, c.Name, o.Amount, o.Date1 FROM sample_joins c FULL OUTER JOIN sample_joins1 o ON(c.Id=o.Id)
Sub queries:
A Query present within a Query is known as a sub query. The main query will depend on the values returned by the subqueries.
Subqueries can be classified into two types
Subqueries in FROM clause
Subqueries in WHERE clause
When to use:
- To get a particular value combined from two column values from different tables
- Dependency of one table values on other tables
- Comparative checking of one column values from other tables
Syntax:
Subquery in FROM clause
SELECT <column names 1, 2…n>From (SubQuery) <TableName_Main>
Subquery in WHERE clause
SELECT <column names 1, 2…n> From<TableName_Main>WHERE col1 IN (SubQuery);
Example:
SELECT col1 FROM (SELECT a+b AS col1 FROM t1) t2
Here t1 and t2 are table names. The colored one is Subquery performed on table t1. Here a and b are columns that are added in a subquery and assigned to col1. Col1 is the column value present in Main table. This column "col1" present in the subquery is equivalent to the main table query in column col1.
What is HiveQL(Hive Query Language)?
Hive provides a CLI to write Hive queries using Hive Query Language (HiveQL). Generally HQL syntax is similar to the SQL syntax that most data analysts are familiar with.
Hive's SQL-inspired language separates the user from the complexity of Map Reduce programming. It reuses familiar concepts from the relational database world, such as tables, rows, columns and schema, to ease learning.
Most interactions tend to take place over a command line interface (CLI). Hive provides a CLI to write Hive queries using Hive Query Language (Hive-QL).
Generally, HiveQL syntax is similar to the SQL syntax that most data analysts are familiar with. Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File).
- For single user metadata storage Hive uses derby database and
- For multiple user Metadata or shared Metadata case Hive uses MYSQL
Built-in operators
Hive provides Built-in operators for Data operations to be implemented on the tables present inside Hive warehouse.
These operators are used for mathematical operations on operands, and it will return specific value as per the logic applied.
Types of Built-in Operators in HIVE are:
- Relational Operators
- Arithmetic Operators
- Logical Operators
- Operators on Complex types
- Complex type Constructors
Relational Operators:
We use Relational operators for relationship comparisons between two
operands.
- Operators such as equals, Not equals, less than, greater than …etc
- The operand types are all number types in these Operators.
The following Table will give us details about Relational operators and
its usage.
|
Built-in Operator
|
Description
|
Operand
|
|
X = Y
|
TRUE
if expression X is equivalent to expression Y Otherwise FALSE. |
It takes all primitive types
|
|
X != Y
|
TRUE
if expression X is not equivalent to expression Y Otherwise FALSE. |
It takes all primitive types
|
|
X < Y
|
TRUE
if expression X is less than expression Y Otherwise FALSE. |
It takes all primitive types
|
|
X <= Y
|
TRUE
if expression X is less than or equal to expression Y Otherwise FALSE. |
It takes all primitive types
|
|
X>Y
|
TRUE
if expression X is greater than expression Y Otherwise FALSE. |
It takes all primitive types
|
|
X>= Y
|
TRUE
if expression X is greater than or equal to expression Y Otherwise FALSE. |
It takes all primitive types
|
|
X IS NULL
|
TRUE if expression X evaluates to NULL otherwise FALSE.
|
It takes all types
|
|
X IS NOT NULL
|
FALSE
If expression X evaluates to NULL otherwise TRUE. |
It takes all types
|
|
X LIKE Y
|
TRUE
If string pattern X matches to Y otherwise FALSE. |
Takes only Strings
|
|
X RLIKE Y
|
NULL if X or Y is NULL, TRUE if any substring of X matches the Java regular expression Y, otherwise FALSE.
|
Takes only Strings
|
|
X REGEXP Y
|
Same as RLIKE.
|
Takes only Strings
|
Arithmetic Operators:
We use Arithmetic operators for performing arithmetic operations on
operands
- Arithmetic operations such as addition, subtraction, multiplication and division between operands we use these Operators.
- The operand types all are number types in these Operators
Sample Example:
2 + 3 gives result 5.
In this example, '+' is theoperator and 2 and 3 are operands. The return
value is 5
The following Table will give us details about Arithmetic operators
|
Built-in Operator
|
Description
|
Operand
|
|
X + Y
|
It will return the output of adding X and Y value.
|
It takes all number types
|
|
X - Y
|
It will return the output of subtracting Y from X value.
|
It takes all number types
|
|
X * Y
|
It will return the output of multiplying X and Y values.
|
It takes all number types
|
|
X / Y
|
It will return the output of dividing Y from X.
|
It takes all number types
|
|
X % Y
|
It will return the remainder resulting from dividing X by Y.
|
It takes all number types
|
|
X & Y
|
It will return the output of bitwise AND of X and Y.
|
It takes all number types
|
|
X | Y
|
It will return the output of bitwise OR of X and Y.
|
It takes all number types
|
|
X ^ Y
|
It will return the output of bitwise XOR of X and Y.
|
It takes all number types
|
|
~X
|
It will return the output of bitwise NOT of X.
|
It takes all number types
|
Logical Operators:
We use Logical operators for performing Logical operations on operands
- Logical operations such as AND, OR, NOT between operands we use these Operators.
- The operand types all are BOOLEAN type in these Operators
The following Table will give us details about Logical operators
|
Operators
|
Description
|
Operands
|
|
X AND Y
|
TRUE if both X and Y are TRUE, otherwise FALSE.
|
Boolean types only
|
|
X && Y
|
Same as X AND Y but here we using && symbol
|
Boolean types only
|
|
X OR Y
|
TRUE if either X or Y or both are TRUE, otherwise FALSE.
|
Boolean types only
|
|
X || Y
|
Same as X OR Y but here we using || symbol
|
Boolean types only
|
|
NOT X
|
TRUE if X is FALSE, otherwise FALSE.
|
Boolean types only
|
|
!X
|
Same as NOT X but here we using! symbol
|
Boolean types only
|
Operators on Complex types:
The following Table will give us details about Complex Type Operators .
These are operators which will provide a different mechanism to access elements
in complex types.
|
Operators
|
Operands
|
Description
|
|
A[n]
|
A is an Array and n is an integer type
|
It will return nth element in the array A. The first element has index
of 0
|
|
M[key]
|
M is a Map<K, V> and key has type K
|
It will return the values belongs to the key in the map
|
Complex type Constructors:
The following Table will give us details about Complex type
Constructors. It will construct instances on complex data types. These are of
complex data types such as Array, Map and Struct types in Hive.
In this section, we are going to see the operations performed on Complex
type Constructors.
|
Operators
|
Operands
|
Description
|
|
array
|
(val1, val2, ...)
|
It will create an array with the given elements as mentioned like
val1, val2
|
|
Create_ union
|
(tag, val1, val2, ...)
|
It will create a union type with the values that is being mentioned to
by the tag parameter
|
|
map
|
(key1, value1, key2, value2, ...)
|
It will create a map with the given key/value pairs mentioned in
operands
|
|
Named_struct
|
(name1, val1, name2, val2, ...)
|
It will create a Struct with the given field names and values
mentioned in operands
|
|
STRUCT
|
(val1, val2, val3, ...)
|
Creates a Struct with the given field values. Struct field names will
be col1, col2, .
|
Built-in functions
These are functions that already available in Hive. First, we have to
check the application requirement, and then we can use this built in functions
in our applications. We can call these functions directly in our application.
The syntax and types are mentioned in the following section.
Types of Built-in Functions in HIVE
- Collection Functions
- Date Functions
- Mathematical Functions
- Conditional Functions
- String Functions
- Misc. Functions
Collection Functions:
These functions are used for collections. Collections mean the grouping
of elements and returning single or array of elements depends on return type
mentioned in function name.
|
Return Type
|
Function Name
|
Description
|
|
INT
|
size(Map<K.V>)
|
It will fetch and give the components number in the map type
|
|
INT
|
size(Array<T>)
|
It will fetch and give the elements number in the array type
|
|
Array<K>
|
Map_keys(Map<K.V>)
|
It will fetch and gives an array containing the keys of the input map.
Here array is in unordered
|
|
Array<V>
|
Map_values(Map<K.V>)
|
It will fetch and gives an array containing the values of the input
map. Here array is in unordered
|
|
Array<t>
|
Sort_array(Array<T>)
|
sorts the input array in ascending order of array and elements and
returns it
|
Date Functions:
These are used to perform Date Manipulations and Conversion of Date
types from one type to another type:
|
Function Name
|
Return Type
|
Description
|
|
Unix_Timestamp()
|
BigInt
|
|
|
To_date(string timestamp)
|
string
|
It will fetch and give the date part of a timestamp string:
|
|
year(string date)
|
INT
|
It will fetch and give the year part of a date or a timestamp string
|
|
quarter(date/timestamp/string)
|
INT
|
It will fetch and give the quarter of the year for a date, timestamp,
or string in the range 1 to 4
|
|
month(string date)
|
INT
|
It will give the month part of a date or a timestamp string
|
|
hour(string date)
|
INT
|
It will fetch and gives the hour of the timestamp
|
|
minute(string date)
|
INT
|
It will fetch and gives the minute of the timestamp
|
|
Date_sub(string starting date, int days)
|
string
|
It will fetch and gives Subtraction of number of days to starting date
|
|
Current_date
|
date
|
It will fetch and gives the current date at the start of query
evaluation
|
|
LAST _day(string date)
|
string
|
It will fetch and gives the last day of the month which the date
belongs to
|
|
trunc(string date, string format)
|
string
|
It will fetch and gives date truncated to the unit specified by the
format.
Supported formats in this : MONTH/MON/MM, YEAR/YYYY/YY. |
Mathematical Functions:
These functions are used for Mathematical Operations. Instead of
creatingUDFs , we have some inbuilt mathematical functions in Hive.
|
Function Name
|
Return Type
|
Description
|
|
round(DOUBLE X)
|
DOUBLE
|
It will fetch and returns the rounded BIGINT value of X
|
|
round(DOUBLE X, INT d)
|
DOUBLE
|
It will fetch and returns X rounded to d decimal places
|
|
bround(DOUBLE X)
|
DOUBLE
|
It will fetch and returns the rounded BIGINT value of X using
HALF_EVEN rounding mode
|
|
floor(DOUBLE X)
|
BIGINT
|
It will fetch and returns the maximum BIGINT value that is equal to or
less than X value
|
|
ceil(DOUBLE a), ceiling(DOUBLE a)
|
BIGINT
|
It will fetch and returns the minimum BIGINT value that is equal to or
greater than X value
|
|
rand(), rand(INT seed)
|
DOUBLE
|
It will fetch and returns a random number that is distributed
uniformly from 0 to 1
|
Conditional Functions:
These functions used for
conditional values checks.
|
Function Name
|
Return Type
|
Description
|
|
if(Boolean testCondition, T valueTrue, T valueFalseOrNull)
|
T
|
It will fetch and gives value True when Test Condition is of
true, gives value False Or Null otherwise.
|
|
ISNULL( X)
|
Boolean
|
It will fetch and gives true if X is NULL and false otherwise.
|
|
ISNOTNULL(X )
|
Boolean
|
It will fetch and gives true if X is not NULL and false
otherwise.
|
String Functions:
String manipulations and string
operations these functions can be called.
|
Function Name
|
Return Type
|
Description
|
|
reverse(string X)
|
string
|
It will give the reversed string of X
|
|
rpad(string str, int length, string pad)
|
string
|
It will fetch and gives str, which is right-padded with pad to
a length of length(integer value)
|
|
rtrim(string X)
|
string
|
It will fetch and returns the string resulting from trimming
spaces from the end (right hand side) of X For example, rtrim(' results ') results
in ' results'
|
|
space(INT n)
|
string
|
It will fetch and gives a string of n spaces.
|
|
split(STRING str, STRING pat)
|
array
|
Splits str around pat (pat is a regular expression).
|
|
Str_to_map(text[, delimiter1, delimiter2])
|
map<String ,String>
|
It will split text into key-value pairs using two delimiters.
|
No comments:
Post a Comment