Follow below steps to move data from Hive to Dynamo DB.
Source link: https://docs.amazonaws.cn/en_us/amazondynamodb/latest/developerguide/EMRforDynamoDB.html
Below is just a quick summary from the above documentation link.
1. Create a hive table. (Ignore this step if hive table is already created)
CREATE TABLE hive_features
(feature_id BIGINT,
feature_name STRING ,
feature_class STRING ,
state_alpha STRING,
prim_lat_dec DOUBLE ,
prim_long_dec DOUBLE ,
elev_in_ft BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n';
2. Now create another external hive table which just references to the hive table (but provides the mapping between Hive and DDB tables)
CREATE EXTERNAL TABLE ddb_features
(feature_id BIGINT,
feature_name STRING,
feature_class STRING,
state_alpha STRING,
prim_lat_dec DOUBLE,
prim_long_dec DOUBLE,
elev_in_ft BIGINT)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES(
"dynamodb.table.name" = "Features",
"dynamodb.column.mapping"="feature_id:Id,feature_name:Name,feature_class:Class,state_alpha:State,prim_lat_dec:Latitude,prim_long_dec:Longitude,elev_in_ft:Elevation"
);
3. Now insert data into Dynamo DB table from Hive table (for which Hive internally uses Map-Reduce Job to do this copy)
INSERT OVERWRITE TABLE ddb_features
SELECT
feature_id,
feature_name,
feature_class,
state_alpha,
prim_lat_dec,
prim_long_dec,
elev_in_ft
FROM hive_features;
4. we can use hive QL to query DDB table
SELECT DISTINCT feature_class
FROM ddb_features
ORDER BY feature_class;
SELECT feature_name, state_alpha
FROM ddb_features
WHERE feature_class = 'Lake'
AND feature_name LIKE 'M%'
ORDER BY feature_name;
SELECT state_alpha, feature_class, COUNT(*)
FROM ddb_features
WHERE elev_in_ft > 5280
GROUP by state_alpha, feature_class
HAVING COUNT(*) >= 3
ORDER BY state_alpha, feature_class;
Source link:
ETL, big-data, cloud(AWS/GCP/Azure) technologies and possibly share random stuff along the way!!
Tuesday, August 13, 2019
Subscribe to:
Posts (Atom)
-
While running a databricks job, especially running a job with large datasets and longer running queries that creates a lot of temp space -...
-
We use s3 COPY command to move data from S3 to Redshift table. Make sure the schema for Redshift table is created before running your COPY c...
-
All Redshift monitoring queries in one place. NOTE: All queries are provided by Redshift Documentation provided by AWS. Here is ...