Apache Pig - instalacja

W tym rozdziale wyjaśniono, jak pobrać, zainstalować i skonfigurować Apache Pig w twoim systemie.

Wymagania wstępne

Przed przejściem do Apache Pig ważne jest, aby mieć zainstalowane oprogramowanie Hadoop i Java w systemie. Dlatego przed zainstalowaniem Apache Pig zainstaluj Hadoop i Javę, wykonując czynności podane w poniższym łączu -

http://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm

Pobierz Apache Pig

Przede wszystkim pobierz najnowszą wersję Apache Pig z następującej strony internetowej - https://pig.apache.org/

Krok 1

Otwórz stronę główną witryny Apache Pig. W sekcjiNews, Kliknij w link release page jak pokazano na poniższej migawce.

Krok 2

Po kliknięciu określonego linku zostaniesz przekierowany do Apache Pig Releasesstrona. Na tej stronie podDownload sekcja, będziesz mieć dwa linki, a mianowicie, Pig 0.8 and later i Pig 0.7 and before. Kliknij w linkPig 0.8 and later, to zostaniesz przekierowany do strony zawierającej zestaw serwerów lustrzanych.

Krok 3

Wybierz i kliknij dowolne z tych lusterek, jak pokazano poniżej.

Krok 4

Te lustra zabiorą Cię do Pig Releasesstrona. Ta strona zawiera różne wersje Apache Pig. Kliknij najnowszą wersję spośród nich.

Krok 5

W tych folderach będziesz mieć pliki źródłowe i binarne Apache Pig w różnych dystrybucjach. Pobierz pliki tar z plików źródłowych i binarnych Apache Pig 0.15,pig0.15.0-src.tar.gz i pig-0.15.0.tar.gz.

Zainstaluj Apache Pig

Po pobraniu oprogramowania Apache Pig zainstaluj je w środowisku Linux, wykonując poniższe czynności.

Krok 1

Utwórz katalog o nazwie Pig w tym samym katalogu, w którym znajdują się katalogi instalacyjne Hadoop, Java,i inne oprogramowanie zostało zainstalowane. (W naszym samouczku utworzyliśmy katalog Pig w użytkowniku o nazwie Hadoop).

$ mkdir Pig

Krok 2

Rozpakuj pobrane pliki tar, jak pokazano poniżej.

$ cd Downloads/ 
$ tar zxvf pig-0.15.0-src.tar.gz 
$ tar zxvf pig-0.15.0.tar.gz

Krok 3

Przenieś zawartość pig-0.15.0-src.tar.gz plik do Pig katalog utworzony wcześniej, jak pokazano poniżej.

$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/

Skonfiguruj Apache Pig

Po zainstalowaniu Apache Pig musimy go skonfigurować. Aby skonfigurować, musimy edytować dwa pliki -bashrc and pig.properties.

plik .bashrc

w .bashrc plik, ustaw następujące zmienne -

PIG_HOME folder do folderu instalacyjnego Apache Pig,
PATH zmienną środowiskową do folderu bin i
PIG_CLASSPATH zmienną środowiskową do folderu etc (configuration) instalacji Hadoop (katalogu zawierającego pliki core-site.xml, hdfs-site.xml i mapred-site.xml).

export PIG_HOME = /home/Hadoop/Pig
export PATH  = $PATH:/home/Hadoop/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/conf

plik pig.properties

w conf folder Pig, mamy plik o nazwie pig.properties. W pliku pig.properties można ustawić różne parametry, jak podano poniżej.

pig -h properties

Obsługiwane są następujące właściwości -

Logging: verbose = true|false; default is false. This property is the same as -v
       switch brief=true|false; default is false. This property is the same 
       as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO.             
       This property is the same as -d switch aggregate.warning = true|false; default is true. 
       If true, prints count of warnings of each type rather than logging each warning.		 
		 
Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory).
       Note that this memory is shared across all large bags used by the application.         
       pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
       Specifies the fraction of heap available for the reducer to perform the join.
       pig.exec.nocombiner = true|false; default is false.
           Only disable combiner as a temporary workaround for problems.         
       opt.multiquery = true|false; multiquery is on by default.
           Only disable multiquery as a temporary workaround for problems.
       opt.fetch=true|false; fetch is on by default.
           Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.         
       pig.tmpfilecompression = true|false; compression is off by default.             
           Determines whether output of intermediate jobs is compressed.         
       pig.tmpfilecompression.codec = lzo|gzip; default is gzip.
           Used in conjunction with pig.tmpfilecompression. Defines compression type.         
       pig.noSplitCombination = true|false. Split combination is on by default.
           Determines if multiple small files are combined into a single map.         
			  
       pig.exec.mapPartAgg = true|false. Default is false.             
           Determines if partial aggregation is done within map phase, before records are sent to combiner.         
       pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.             
           If the in-map partial aggregation does not reduce the output num records by this factor, it gets disabled.
			  
Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch
       pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
       udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
       stop.on.failure = true|false; default is false. Set to true to terminate on the first error.         
       pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
           Determines the timezone used to handle datetime datatype and UDFs.
Additionally, any Hadoop property can be specified.

Weryfikacja instalacji

Sprawdź instalację Apache Pig, wpisując polecenie version. Jeśli instalacja się powiedzie, otrzymasz wersję Apache Pig, jak pokazano poniżej.

$ pig –version 
 
Apache Pig version 0.15.0 (r1682971)  
compiled Jun 01 2015, 11:44:35