APACHE HIVE: July 2019

Wednesday, 31 July 2019

Hive Commands

Creation and dropping of Database in Hive:
reate Database:
For creating database in Hive shell, we have to use the command as shown in syntax below:-
Syntax:
Create database <DatabaseName>
Example: -Create database "MANMA"
Displaying existing databases using "show" command
Example:- Show databases;
Drop Database:
For Dropping database in Hive shell, we have to use the "drop" command as shown in syntax below:-
Syntax:
Drop database <DatabaseName>
Example:-
Drop database MANMA
Table Operations such as Creation, Altering, and Dropping tables in Hive can be observed in this tutorial.
In the Below screenshot, we are creating a table with columns and altering the table name.
1. Creating table manma_sample with two column names such as "empid" and "empname"
Example : create table manma_sample(empid int, empname string);
2. Displaying tables in databases
Example: show tables;
3. Altering table "manma_sample" as "manma_sampleNew"
Example : ALTER TABLE manma_sample to manma_sampleNew;
4. Dropping table manma_sampleNew.
Example : drop table manma_sampleNew

Table types and its Usage:
Coming to Tables it's just like the way that we create in traditional Relational Databases. The functionalities such as filtering, joins can be performed on the tables.
Hive deals with two types of table structures like Internal and External tables depending on the loading and design of schema in Hive.
Internal tables

Internal Table is tightly coupled in nature.In this type of table, first we have to create table and load the data.
We can call this one as data on schema.
By dropping this table, both data and schema will be removed.
The stored location of this table will be at /user/hive/warehouse.

When to Choose Internal Table:

If the processing data available in local file system
If we want Hive to manage the complete lifecycle of data including the deletion
Sample code Snippet for Internal Table

1. To create the internal table
CREATE TABLE manmahive_internaltable (id INT,Name STRING)
Row format delimited
Fields terminated by '\t';
2. Load the data into internal table

LOAD DATA INPATH '/user/manma99hive/data.txt' INTO table manmahive_internaltable

3. Display the content of the table

select * from manmahive_internaltable;

4. To drop the internal table

DROP TABLE manmahive_internaltable;

If you dropped the manmahive_internaltable, including its metadata and its data will be deleted from Hive.

External tables

External Table is loosely coupled in nature. Data will be available in HDFS.The table is going to create on HDFS data.
In other way, we can say like its creating schema on data.
At the time of dropping the table it drops only schema, the data will be still available in HDFS as before.
External tables provide an option to create multiple schemas for the data stored in HDFS instead of deleting the data every time whenever schema updates

When to Choose External Table:

If processing data available in HDFS
Useful when the files are being used outside of Hive

Sample code Snippet for External Table

1. Create External table

CREATE EXTERNAL TABLE manmahive_external(id INT,Name STRING)

Row format delimited

Fields terminated by '\t'

LOCATION '/user/manma99hive/manmahive_external;

2. If we are not specifying the location at the time of table creation, we can load the data manually

LOAD DATA INPATH '/user/manma99hive/data.txt' INTO TABLE manmahive_external;

3. Display the content of the table

select * from manmahive_external;

4. To drop the internal table

DROP TABLE manmahive_external;

Difference between Internal Vs External tables

Feature	Internal	External
Schema	Data on Schema	Schema on Data
Storage Location	/usr/hive/warehouse	HDFS location
Data availability	Within local file system	Within HDFS

What is Partitions?

Hive Partitions is a way to organizes tables into partitions by dividing tables into different parts based on partition keys.

Partition is helpful when the table has one or more Partition keys. Partition keys are basic elements for determining how the data is stored in the table.

For Example: -

Client having some E –commerce data which belongs to India operations in which each state (29 states) operations mentioned in as a whole. If we take state column as partition key and perform partitions on that India data as a whole, we can able to get Number of partitions (29 partitions) which is equal to number of states (29) present in India. Such that each state data can be viewed separately in partitions tables.

Sample Code Snippet for partitions

1. Creation of Table all states

create table all states(state string, District string,Enrollment string)

row format delimited

fields terminated by ',';

2. Loading data into created table all states

Load data local inpath '/home/hduser/Desktop/AllStates.csv' into table allstates;

3. Creation of partition table

create table state_part(District string,Enrolments string) PARTITIONED BY(state string);

4. For partition we have to set this property

sethive.exec.dynamic.partition.mode=nonstrict

5. Loading data into partition table

INSERT OVERWRITE TABLE state_partPARTITION(state) SELECT district, enrolments, state from allstates;

From the above code, we do following things

1. Creation of table all states with 3 column names such as state, district, and enrollment

2. Loading data into table all states

3. Creation of partition table with state as partition key

4. In this step Setting partition mode as non-strict( This mode will activate dynamic partition mode)

5. Loading data into partition table state_part

6. Actual processing and formation of partition tables based on state as partition key

7. There is going to 29 partition outputs in HDFS storage with the file name as state name. We will check this in this step.

In This step, we seeing the 29 partition outputs in HDFS

What is Buckets?

Buckets in hive is used in segregating of hive table-data into multiple files or directories. it is used for efficient querying.

The data i.e. present in that partitions can be divided further into Buckets
The division is performed based on Hash of particular columns that we selected in the table.
Buckets use some form of Hashing algorithm at back end to read each record and place it into buckets
In Hive, we have to enable buckets by using the set.hive.enforce.bucketing=true;

Step 1) Create Bucket

create table samplebucket {first_name string, job_id int, department string, salary string, country string}

clustered by (country) into 4 buckets

row format delimited

fields terminated by ',';

Step 2) Loading Data into table sample bucket

Assuming that "Employees table" already created in Hive system. In this step, we will see the loading of Data from employees table into table sample bucket.

Before we start moving employees data into buckets, make sure that it consist of column names such as first_name, job_id, department, salary and country.

Here we are loading data into sample bucket from employees table.

Assuming that"Employees table" already created in Hive system. In this step, we will see the loading of Data from employees table into table sample bucket.

Before we start moving employees data into buckets, make sure that it consist of column names such as first_name, job_id, department, salary and country.

Here we are loading data into sample bucket from employees table.

from employees

insert overwrite table samplebucket

select first_name, job_id, department, salary, country;

Once insert operation completed you will see data into 4 buckets which we have created at the time of table creation..

What is a View?
Views are similar to tables, which are generated based on the requirements.

We can save any result set data as a view in Hive
Usage is similar to as views used in SQL
All type of DML operations can be performed on a view

Creation of View:
Syntax:
Create VIEW < VIEWNAME> AS SELECT
Example:
Hive>Create VIEW Sample_View AS SELECT * FROM employees WHERE salary>25000
In this example, we are creating view Sample_View where it will display all the row values with salary field greater than 25000.

What is Index?

Indexes are pointers to particular column name of a table.

The user has to manually define the index

Wherever we are creating index, it means that we are creating pointer to particular column name of table

Any Changes made to the column present in tables are stored using the index value created on the column name.

Syntax:

Create INDEX < INDEX_NAME> ON TABLE < TABLE_NAME(column names)>

Example:

Create INDEX sample_Index ON TABLE manmahive_internaltable(id)

Here we are creating index on table manmahive_internaltable for column name id.

Introduction to Hive

Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System (HDFS). Hive makes job easy for performing operations like

Data encapsulation
Ad-hoc queries
Analysis of huge datasets

Important characteristics of Hive

In Hive, tables and databases are created first and then data is loaded into these tables.
Hive as data warehouse designed for managing and querying only structured data that is stored in tables.
While dealing with structured data, Map Reduce doesn't have optimization and usability features like UDFs but Hive framework does. Query optimization refers to an effective way of query execution in terms of performance.
Hive's SQL-inspired language separates the user from the complexity of Map Reduce programming. It reuses familiar concepts from the relational database world, such as tables, rows, columns and schema, etc. for ease of learning.
Hadoop's programming works on flat files. So, Hive can use directory structures to "partition" data to improve performance on certain queries.
A new and important component of Hive i.e. Metastore used for storing schema information. This Metastore typically resides in a relational database. We can interact with Hive using methods like Web GUI, Java Database Connectivity (JDBC) interface
Most interactions tend to take place over a command line interface (CLI). Hive provides a CLI to write Hive queries using Hive Query Language(HQL)
Generally, HQL syntax is similar to the SQL syntax that most data analysts are familiar with. The Sample query below display all the records present in mentioned table name.Sample query : Select * from <TableName>
Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File).
For single user metadata storage, Hive uses derby database and for multiple user Metadata or shared Metadata case Hive uses MYSQL.

Some of the key points about Hive:

The major difference between HQL and SQL is that Hive query executes on Hadoop's infrastructure rather than the traditional database.
The Hive query execution is going to be like series of automatically generated map reduce Jobs.
Hive supports partition and buckets concepts for easy retrieval of data when the client executes the query.
Hive supports custom specific UDF (User Defined Functions) for data cleansing, filtering, etc. According to the requirements of the programmers one can define Hive UDFs.

Hive Vs Relational Databases:

By using Hive, we can perform some peculiar functionality that is not achieved in Relational Databases. For a huge amount of data that is in peta-bytes, querying it and getting results in seconds is important. And Hive does this quite efficiently, it processes the queries fast and produce results in second's time.

Some key differences between Hive and relational databases are the following;

Relational databases are of "Schema on READ and Schema on Write". First creating a table then inserting data into the particular table. On relational database tables, functions like Insertions, Updates, and Modifications can be performed.

Hive is "Schema on READ only". So, functions like the update, modifications, etc. don't work with this. Because the Hive query in a typical cluster runs on multiple Data Nodes. So it is not possible to update and modify data across multiple nodes.( Hive versions below 0.13)

Also, Hive supports "READ Many WRITE Once" pattern. Which means that after inserting table we can update the table in the latest Hive versions.

NOTE: However the new version of Hive comes with updated features. Hive versions ( Hive 0.14) comes up with Update and Delete options as new features

HIVE ARCHITECTURE

The above screenshot explains the Apache Hive architecture in detail

Hive Consists of Mainly 3 core parts

Hive Clients
Hive Services
Hive Storage and Computing

Hive Clients:

Hive provides different drivers for communication with a different type of applications. For Thrift based applications, it will provide Thrift client for communication.

For Java related applications, it provides JDBC Drivers. Other than any type of applications provided ODBC drivers. These Clients and drivers in turn again communicate with Hive server in the Hive services.

Hive Services:

Client interactions with Hive can be performed through Hive Services. If the client wants to perform any query related operations in Hive, it has to communicate through Hive Services.

CLI is the command line interface acts as Hive service for DDL (Data definition Language) operations. All drivers communicate with Hive server and to the main driver in Hive services as shown in above architecture diagram.

Driver present in the Hive services represents the main driver, and it communicates all type of JDBC, ODBC, and other client specific applications. Driver will process those requests from different applications to meta store and field systems for further processing.

Hive Storage and Computing:

Hive services such as Meta store, File system, and Job Client in turn communicates with Hive storage and performs the following actions

Metadata information of tables created in Hive is stored in Hive "Meta storage database".
Query results and data loaded in the tables are going to be stored in Hadoop cluster on HDFS.

Job Execution flow

From the above screenshot we can understand the Job execution flow in Hive with Hadoop.

The data flow in Hive behaves in the following pattern;

Executing Query from the UI( User Interface)
The driver is interacting with Compiler for getting the plan. (Here plan refers to query execution) process and its related metadata information gathering
The compiler creates the plan for a job to be executed. Compiler communicating with Meta store for getting metadata request
Meta store sends metadata information back to compiler
Compiler communicating with Driver with the proposed plan to execute the query
Driver Sending execution plans to Execution engine
Execution Engine (EE) acts as a bridge between Hive and Hadoop to process the query. For DFS operations. EE should first contacts Name Node and then to Data nodes to get the values stored in tables.

EE is going to fetch desired records from Data Nodes. The actual data of tables resides in data node only. While from Name Node it only fetches the metadata information for the query.

It collects actual data from data nodes related to mentioned query.

Execution Engine (EE) communicates bi-directionally with Meta store present in Hive to perform DDL (Data Definition Language) operations. Here DDL operations like CREATE, DROP and ALTERING tables and databases are done. Meta store will store information about database name, table names and column names only. It will fetch data related to query mentioned.

Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data nodes, and job tracker to execute the query on top of Hadoop file system
Fetching results from driver
Sending results to Execution engine. Once the results fetched from data nodes to the EE, it will send results back to driver and to UI ( front end)

Hive Continuously in contact with Hadoop file system and its daemons via Execution engine. The dotted arrow in the Job flow diagram shows the Execution engine communication with Hadoop daemons.

Different modes of Hive

Hive can operate in two modes depending on the size of data nodes in Hadoop.

These modes are,

Local mode
Map reduce mode

When to use Local mode:

If the Hadoop installed under pseudo mode with having one data node we use Hive in this mode

If the data size is smaller in term of limited to single local machine, we can use this mode

Processing will be very fast on smaller data sets present in the local machine

When to use Map reduce mode:

If Hadoop is having multiple data nodes and data is distributed across different node we use Hive in this mode

It will perform on large amount of data sets and query going to execute in parallel way

Processing of large data sets with better performance can be achieved through this mode

In Hive, we can set this property to mention which mode Hive can work? By default, it works on Map Reduce mode and for local mode you can have the following setting.

Hive to work in local mode set

SET mapred.job.tracker=local;

From the Hive version 0.7 it supports a mode to run map reduce jobs in local mode automatically.

Hive Server2 (HS2)?

HiveServer2 (HS2) is a server interface that performs following functions:

Enables remote clients to execute queries against Hive

Retrieve the results of mentioned queries

From the latest version it's having some advanced features Based on Thrift RPC like;

Multi-client concurrency
Authentication

Summary:

Hive is an ETL and data warehouse tool on top of Hadoop ecosystem and used for processing structured and semi structured data.

Hive is a database present in Hadoop ecosystem performs DDL and DML operations, and it provides flexible query language such as HQL for better querying and processing of data.
It provides so many features compared to RDMS which has certain limitations.

For user specific logic to meet client requirements.

It provides option of writing and deploying custom defined scripts and User defined functions.
In addition, it provides partitions and buckets for storage specific logics.

Data types in Hive

Data types are very important elements in Hive query language and data modeling. For defining the table column types, we must have to know about the data types and its usage.

The following gives brief overview of some data types present in Hive:

Numeric Types
String Types
Date/Time Types
Complex Types

Numeric Types:

Type	Memory allocation
TINY INT	Its 1-byte signed integer (-128 to 127)
SMALL INT	2-byte signed integer (-32768 to 32767)
INT	4 –byte signed integer ( -2,147,484,648 to 2,147,484,647)
BIG INT	8 byte signed integer
FLOAT	4 – byte single precision floating point number
DOUBLE	8- byte double precision floating point number
DECIMAL	We can define precision and scale in this Type

String Types:

Type	Length
CHAR	255
VARCHAR	1 to 65355
STRING	We can define length here(No Limit)

Data/Time Types:

Type	Length
Timestamp	Supports traditional Unix timestamp with optional nanosecond precision
Date	It's in YYYY-MM-DD format. The range of values supported for the Date type is be 0000-01-01 to 9999-12-31, dependent onsupport by the primitive Java Date type

Complex Types:

Type	Length
Arrays	ARRAY<data_type> Negative values and non-constant expressions not allowed
Maps	MAP<primitive_type, data_type> Negative values and non-constant expressions not allowed
Structs	STRUCT<col_name :datat_type, ….. >
Union	UNIONTYPE<data_type, datat_type, ……>

APACHE HIVE

Wednesday, 31 July 2019

Hive Commands

Introduction to Hive

Hive with Semi structured data

Search This Blog