Wednesday, 31 July 2019

Hive Commands

Creation and dropping of Database in Hive:
reate Database:
For creating database in Hive shell, we have to use the command as shown in syntax below:- 
Syntax:
Create database <DatabaseName>
Example: -Create database "MANMA" 
Displaying existing databases using "show" command 
Example:- Show databases;
Drop Database: 
For Dropping database in Hive shell, we have to use the "drop" command as shown in syntax below:- 
Syntax:
Drop database <DatabaseName>
Example:-
Drop database MANMA
Table Operations such as Creation, Altering, and Dropping tables in Hive can be observed in this tutorial. 
In the Below screenshot, we are creating a table with columns and altering the table name. 
1. Creating table manma_sample with two column names such as "empid" and "empname"
Example : create table manma_sample(empid int, empname string);
2. Displaying tables in databases
Example: show tables;
3. Altering table "manma_sample" as "manma_sampleNew"
Example : ALTER TABLE manma_sample to manma_sampleNew;
4. Dropping table manma_sampleNew.
Example : drop table manma_sampleNew

Table types and its Usage: 
Coming to Tables it's just like the way that we create in traditional Relational Databases. The functionalities such as filtering, joins can be performed on the tables. 
Hive deals with two types of table structures like Internal and External tables depending on the loading and design of schema in Hive. 
Internal tables
  • Internal Table is tightly coupled in nature.In this type of table, first we have to create table and load the data. 
  • We can call this one as data on schema.
  • By dropping this table, both data and schema will be removed. 
  • The stored location of this table will be at /user/hive/warehouse.
When to Choose Internal Table:
  • If the processing data available in local file system
  • If we want Hive to manage the complete lifecycle of data including the deletion
  • Sample code Snippet for Internal Table 
1. To create the internal table 
CREATE TABLE manmahive_internaltable (id INT,Name STRING)
Row format delimited 
Fields terminated by '\t';
2. Load the data into internal table 
LOAD DATA INPATH '/user/manma99hive/data.txt' INTO table manmahive_internaltable
3. Display the content of the table
select * from manmahive_internaltable;
4. To drop the internal table 

DROP TABLE manmahive_internaltable;
If you dropped the manmahive_internaltable, including its metadata and its data will be deleted from Hive.
External tables
  • External Table is loosely coupled in nature. Data will be available in HDFS.The table is going to create on HDFS data. 
  • In other way, we can say like its creating schema on data.
  • At the time of dropping the table it drops only schema, the data will be still available in HDFS as before. 
  • External tables provide an option to create multiple schemas for the data stored in HDFS instead of deleting the data every time whenever schema updates
When to Choose External Table:
  • If processing data available in HDFS
  • Useful when the files are being used outside of Hive 
Sample code Snippet for External Table 
1. Create External table 
CREATE EXTERNAL TABLE manmahive_external(id INT,Name STRING)
Row format delimited
Fields terminated by '\t'
LOCATION '/user/manma99hive/manmahive_external;
2. If we are not specifying the location at the time of table creation, we can load the data manually 
LOAD DATA INPATH '/user/manma99hive/data.txt' INTO TABLE manmahive_external;
3. Display the content of the table 
select * from manmahive_external;
4. To drop the internal table 
DROP TABLE manmahive_external;

Difference between Internal Vs External tables
Feature
Internal
External
Schema
Data on Schema
Schema on Data
Storage Location
/usr/hive/warehouse
HDFS location
Data availability
Within local file system
Within HDFS
What is Partitions?
Hive Partitions is a way to organizes tables into partitions by dividing tables into different parts based on partition keys. 
Partition is helpful when the table has one or more Partition keys. Partition keys are basic elements for determining how the data is stored in the table. 
For Example: - 
Client having some E –commerce data which belongs to India operations in which each state (29 states) operations mentioned in as a whole. If we take state column as partition key and perform partitions on that India data as a whole, we can able to get Number of partitions (29 partitions) which is equal to number of states (29) present in India. Such that each state data can be viewed separately in partitions tables. 
Sample Code Snippet for partitions 
1. Creation of Table all states 
create table all states(state string, District string,Enrollment string)
row format delimited
fields terminated by ',';
2. Loading data into created table all states 
Load data local inpath '/home/hduser/Desktop/AllStates.csv' into table allstates;
3. Creation of partition table 
create table state_part(District string,Enrolments string) PARTITIONED BY(state string);
4. For partition we have to set this property 
sethive.exec.dynamic.partition.mode=nonstrict
5. Loading data into partition table 
INSERT OVERWRITE TABLE state_partPARTITION(state) SELECT district, enrolments, state from allstates;
From the above code, we do following things 
1. Creation of table all states with 3 column names such as state, district, and enrollment
2. Loading data into table all states
3. Creation of partition table with state as partition key 
4. In this step Setting partition mode as non-strict( This mode will activate dynamic partition mode)
5. Loading data into partition table state_part
6. Actual processing and formation of partition tables based on state as partition key 
7. There is going to 29 partition outputs in HDFS storage with the file name as state name. We will check this in this step. 
In This step, we seeing the 29 partition outputs in HDFS 

What is Buckets?
Buckets in hive is used in segregating of hive table-data into multiple files or directories. it is used for efficient querying. 
  • The data i.e. present in that partitions can be divided further into Buckets 
  • The division is performed based on Hash of particular columns that we selected in the table. 
  • Buckets use some form of Hashing algorithm at back end to read each record and place it into buckets 
  • In Hive, we have to enable buckets by using the set.hive.enforce.bucketing=true;
Step 1) Create Bucket 
create table samplebucket {first_name string, job_id int, department string, salary string, country string}
clustered by (country) into 4 buckets
row format delimited
fields terminated by ',';
Step 2) Loading Data into table sample bucket
Assuming that "Employees table" already created in Hive system. In this step, we will see the loading of Data from employees table into table sample bucket. 
Before we start moving employees data into buckets, make sure that it consist of column names such as first_name, job_id, department, salary and country. 
Here we are loading data into sample bucket from employees table.
Assuming that"Employees table" already created in Hive system. In this step, we will see the loading of Data from employees table into table sample bucket. 
Before we start moving employees data into buckets, make sure that it consist of column names such as first_name, job_id, department, salary and country. 
Here we are loading data into sample bucket from employees table.
from employees
insert overwrite table samplebucket
select first_name, job_id, department, salary, country;
Once insert operation completed you will see data into 4 buckets which we have created at the time of table creation..

What is a View?
Views are similar to tables, which are generated based on the requirements. 

  • We can save any result set data as a view in Hive 
  • Usage is similar to as views used in SQL
  • All type of DML operations can be performed on a view

Creation of View: 
Syntax: 
Create VIEW < VIEWNAME> AS SELECT 
Example: 
Hive>Create VIEW Sample_View AS SELECT * FROM employees WHERE salary>25000 
In this example, we are creating view Sample_View where it will display all the row values with salary field greater than 25000.

What is Index?
Indexes are pointers to particular column name of a table. 
The user has to manually define the index
Wherever we are creating index, it means that we are creating pointer to particular column name of table
Any Changes made to the column present in tables are stored using the index value created on the column name.
Syntax: 
Create INDEX < INDEX_NAME> ON TABLE < TABLE_NAME(column names)> 
Example: 
Create INDEX sample_Index ON TABLE manmahive_internaltable(id) 
Here we are creating index on table manmahive_internaltable for column name id.

No comments:

Post a Comment

Hive with Semi structured data

Working with Semi structured data using Hive (XML, JSON) Hive performs ETL functionalities in Hadoop ecosystem by acting as ETL tool. It c...