Apache Tajo - Управление таблицами

Таблица - это логическое представление одного источника данных. Он состоит из логической схемы, разделов, URL-адреса и различных свойств. Таблица Tajo может быть каталогом в HDFS, отдельным файлом, одной таблицей HBase или таблицей РСУБД.

Tajo поддерживает следующие два типа таблиц:

  • внешний стол
  • внутренний стол

Внешний стол

При создании таблицы внешней таблице требуется свойство местоположения. Например, если ваши данные уже существуют в виде файлов Text / JSON или таблицы HBase, вы можете зарегистрировать их как внешнюю таблицу Tajo.

Следующий запрос - это пример создания внешней таблицы.

create external table sample(col1 int,col2 text,col3 int) location ‘hdfs://path/to/table';

Вот,

  • External keyword- Используется для создания внешней таблицы. Это помогает создать таблицу в указанном месте.

  • Образец ссылается на название таблицы.

  • Location- Это каталог для HDFS, Amazon S3, HBase или локальной файловой системы. Чтобы назначить свойство местоположения для каталогов, используйте приведенные ниже примеры URI:

    • HDFS - hdfs: // localhost: порт / путь / к / таблице

    • Amazon S3 - s3: // имя-ведра / таблица

    • local file system - файл: /// путь / к / таблице

    • Openstack Swift - swift: // имя-ведра / таблица

Table Properties

An external table has the following properties −

  • TimeZone − Users can specify a time zone for reading or writing a table.

  • Compression format − Used to make data size compact. For example, the text/json file uses compression.codec property.

Internal Table

A Internal table is also called an Managed Table. It is created in a pre-defined physical location called the Tablespace.

Syntax

create table table1(col1 int,col2 text);

By default, Tajo uses “tajo.warehouse.directory” located in “conf/tajo-site.xml” . To assign new location for the table, you can use Tablespace configuration.

Tablespace

Tablespace is used to define locations in the storage system. It is supported for only internal tables. You can access the tablespaces by their names. Each tablespace can use a different storage type. If you don’t specify tablespaces then, Tajo uses the default tablespace in the root directory.

Tablespace Configuration

You have “conf/tajo-site.xml.template” in Tajo. Copy the file and rename it to “storagesite.json”. This file will act as a configuration for Tablespaces. Tajo data formats uses the following configuration −

HDFS Configuration

$ vi conf/storage-site.json { 
   "spaces": {  
      "${tablespace_name}": {  
         "uri": “hdfs://localhost:9000/path/to/Tajo"  
      } 
   } 
}

HBase Configuration

$ vi conf/storage-site.json { 
   "spaces": {  
      "${tablespace_name}": {  
         "uri": “hbase:zk://quorum1:port,quorum2:port/"  
      } 
   } 
}

Text File Configuration

$ vi conf/storage-site.json { 
   "spaces": {  
      "${tablespace_name}": {  
         “uri”: “hdfs://localhost:9000/path/to/Tajo” 
      } 
   } 
}

Tablespace Creation

Tajo’s internal table records can be accessed from another table only. You can configure it with tablespace.

Syntax

CREATE TABLE [IF NOT EXISTS] <table_name> [(column_list)] [TABLESPACE tablespace_name] 
[using <storage_type> [with (<key> = <value>, ...)]] [AS <select_statement>]

Here,

  • IF NOT EXISTS − This avoids an error if the same table has not been created already.

  • TABLESPACE − This clause is used to assign the tablespace name.

  • Storage type − Tajo data supports formats like text,JSON,HBase,Parquet,Sequencefile and ORC.

  • AS select statement − Select records from another table.

Configure Tablespace

Start your Hadoop services and open the file “conf/storage-site.json”, then add the following changes −

$ vi conf/storage-site.json { 
   "spaces": {  
      “space1”: {  
         "uri": “hdfs://localhost:9000/path/to/Tajo" 
      } 
   } 
}

Here, Tajo will refer to the data from HDFS location and space1 is the tablespace name. If you do not start Hadoop services, you can’t register tablespace.

Query

default> create table table1(num1 int,num2 text,num3 float) tablespace space1;

The above query creates a table named “table1” and “space1” refers to the tablespace name.

Data formats

Tajo supports data formats. Let’s go through each of the formats one by one in detail.

Text

A character-separated values’ plain text file represents a tabular data set consisting of rows and columns. Each row is a plain text line.

Creating Table

default> create external table customer(id int,name text,address text,age int) 
   using text with('text.delimiter'=',') location ‘file:/Users/workspace/Tajo/customers.csv’;

Here, “customers.csv” file refers to a comma separated value file located in the Tajo installation directory.

To create internal table using text format, use the following query −

default> create table customer(id int,name text,address text,age int) using text;

In the above query, you have not assigned any tablespace so it will take Tajo’s default tablespace.

Properties

A text file format has the following properties −

  • text.delimiter − This is a delimiter character. Default is ‘|’.

  • compression.codec − This is a compression format. By default, it is disabled. you can change the settings using specified algorithm.

  • timezone − The table used for reading or writing.

  • text.error-tolerance.max-num − The maximum number of tolerance levels.

  • text.skip.headerlines − The number of header lines per skipped.

  • text.serde − This is serialization property.

JSON

Apache Tajo supports JSON format for querying data. Tajo treats a JSON object as SQL record. One object equals one row in a Tajo table. Let’s consider “array.json” as follows −

$ hdfs dfs -cat /json/array.json { 
   "num1" : 10, 
   "num2" : "simple json array", 
   "num3" : 50.5 
}

After you create this file, switch to the Tajo shell and type the following query to create a table using the JSON format.

Query

default> create external table sample (num1 int,num2 text,num3 float) 
   using json location ‘json/array.json’;

Always remember that the file data must match with the table schema. Otherwise, you can omit the column names and use * which doesn’t require columns list.

To create an internal table, use the following query −

default> create table sample (num1 int,num2 text,num3 float) using json;

Parquet

Parquet is a columnar storage format. Tajo uses Parquet format for easy, fast and efficient access.

Table creation

The following query is an example for table creation −

CREATE TABLE parquet (num1 int,num2 text,num3 float) USING PARQUET;

Parquet file format has the following properties −

  • parquet.block.size − size of a row group being buffered in memory.

  • parquet.page.size − The page size is for compression.

  • parquet.compression − The compression algorithm used to compress pages.

  • parquet.enable.dictionary − The boolean value is to enable/disable dictionary encoding.

RCFile

RCFile is the Record Columnar File. It consists of binary key/value pairs.

Table creation

The following query is an example for table creation −

CREATE TABLE Record(num1 int,num2 text,num3 float) USING RCFILE;

RCFile has the following properties −

  • rcfile.serde − custom deserializer class.

  • compression.codec − compression algorithm.

  • rcfile.null − NULL character.

SequenceFile

SequenceFile is a basic file format in Hadoop which consists of key/value pairs.

Table creation

The following query is an example for table creation −

CREATE TABLE seq(num1 int,num2 text,num3 float) USING sequencefile;

This sequence file has Hive compatibility. This can be written in Hive as,

CREATE TABLE table1 (id int, name string, score float, type string) 
STORED AS sequencefile;

ORC

ORC (Optimized Row Columnar) is a columnar storage format from Hive.

Table creation

The following query is an example for table creation −

CREATE TABLE optimized(num1 int,num2 text,num3 float) USING ORC;

The ORC format has the following properties −

  • orc.max.merge.distance − ORC file is read, it merges when the distance is lower.

  • orc.stripe.size − This is the size of each stripe.

  • orc.buffer.size − The default is 256KB.

  • orc.rowindex.stride − This is the ORC index stride in number of rows.