Java Code to Upload File to Hdfs

5. Working with the Hadoop File System

A common task in Hadoop is interacting with its file system, whether for provisioning, adding new files to be processed, parsing results, or performing cleanup. Hadoop offers several means to reach that: one tin can use its Coffee API (namely FileSystem or use the hadoop control line, in particular the file organization shell. However there is no centre ground, one either has to use the (somewhat verbose, full of checked exceptions) API or fall back to the command line, exterior the awarding. SHDP addresses this effect by bridging the ii worlds, exposing both the FileSystem and the fs beat through an intuitive, easy-to-employ Java API. Add together your favorite JVM scripting language correct inside your Leap for Apache Hadoop awarding and y'all have a powerful combination.

5.1 Configuring the file-system

The Hadoop file-system, HDFS, tin can exist accessed in various means - this section will cover the most popular protocols for interacting with HDFS and their pros and cons. SHDP does not enforce any specific protocol to be used - in fact, as described in this section any FileSystem implementation can be used, allowing fifty-fifty other implementations than HDFS to be used.

The table beneath describes the common HDFS APIs in use:

Table 5.1. HDFS APIs

File System Comm. Method Scheme / Prefix Read / Write Cross Version

HDFS

RPC

hdfs://

Read / Write

Same HDFS version only

HFTP

HTTP

hftp://

Read only

Version contained

WebHDFS

HTTP (REST)

webhdfs://

Read / Write

Version independent

This affiliate focuses on the cadre file-organization protocols supported by Hadoop. S3, FTP and the rest of the other FileSystem implementations are supported as well - Spring for Apache Hadoop has no dependency on the underlying system rather just on the public Hadoop API.

hdfs:// protocol should be familiar to most readers - most docs (and in fact the previous chapter as well) mention it. It works out of the box and information technology'south fairly efficient. Notwithstanding because it is RPC based, it requires both the customer and the Hadoop cluster to share the same version. Upgrading one without the other causes serialization errors meaning the client cannot interact with the cluster. Every bit an alternative i tin can use hftp:// which is HTTP-based or its more than secure brother hsftp:// (based on SSL) which gives you a version contained protocol significant you can utilise it to interact with clusters with an unknown or different version than that of the client. hftp is read simply (write operations will fail correct away) and it is typically used with distcp for reading data. webhdfs:// is 1 of the additions in Hadoop 1.0 and is a mixture between hdfs and hftp protocol - it provides a version-independent, read-write, REST-based protocol which means that you tin can read and write to/from Hadoop clusters no matter their version. Furthermore, since webhdfs:// is backed by a Residue API, clients in other languages tin use it with minimal effort.

[Note] Note

Not all file systems work out of the box. For example WebHDFS needs to be enabled outset in the cluster (through dfs.webhdfs.enabled property, see this document for more information) while the secure hftp, hsftp requires the SSL configuration (such as certificates) to be specified. More about this (and how to use hftp/hsftp for proxying) in this page.

Once the scheme has been decided upon, one tin specify information technology through the standard Hadoop configuration, either through the Hadoop configuration files or its backdrop:

            <hdp:configuration>            fs.defaultFS=webhdfs://localhost   ...            </hdp:configuration>          

This instructs Hadoop (and automatically SHDP) what the default, implied file-system is. In SHDP, one can create additional file-systems (potentially to connect to other clusters) and specify a unlike scheme:

              <hdp:file-system            uri="webhdfs://localhost"            />                        <hdp:file-system            id="former-cluster"            uri="hftp://old-cluster/"            />          

Equally with the residuum of the components, the file systems can be injected where needed - such as file shell or inside scripts (meet the next section).

5.2 Using HDFS Resources Loader

In Spring the ResourceLoader interface is meant to be implemented by objects that can render (i.e. load) Resource instances.

            public            interface            ResourceLoader {   Resource getResource(String location); }

All application contexts implement the ResourceLoader interface, and therefore all application contexts may be used to obtain Resource instances.

When you lot call getResource() on a specific application context, and the location path specified doesn't have a specific prefix, you volition get back a Resources blazon that is appropriate to that particular application context. For example, assume the post-obit snippet of code was executed against a ClassPathXmlApplicationContext instance:

Resource template = ctx.getResource("some/resource/path/myTemplate.txt");

What would be returned would be a ClassPathResource; if the aforementioned method was executed against a FileSystemXmlApplicationContext example, you'd get dorsum a FileSystemResource. For a WebApplicationContext, you lot'd go back a ServletContextResource, and then on.

Equally such, you tin load resources in a way appropriate to the item application context.

On the other paw, you may likewise force ClassPathResource to exist used, regardless of the application context type, past specifying the special classpath: prefix:

Resource template = ctx.getResource("classpath:some/resource/path/myTemplate.txt");
[Note] Note

More information about the generic usage of resource loading, check the Leap Framework Documentation .

Jump Hadoop is adding its own functionality into generic concept of resource loading. Resource abstraction in Leap has always been a way to ease resources access in terms of non having a demand to know where there resources is and how it's accessed. This abstraction also goes beyond a unmarried resource by allowing to use patterns to admission multiple resources.

Lets start see how HdfsResourceLoader is used manually.

            <hdp:file-organisation />            <hdp:resource-loader            id="loader"            file-arrangement-ref="hadoopFs"                          />            <hdp:resource-loader            id="loaderWithUser"            user="myuser"            uri="hdfs://localhost:8020"                          />          

In in a higher place configuration we created two beans, one with reference to existing Hadoop FileSystem bean and one with impersonated user.

              Resources resource = loader.getResource("/tmp/file.txt");  Resource resource = loaderWithUser.getResource("/tmp/file.txt");   Resource resource = loader.getResource("file.txt");  Resource resource = loaderWithUser.getResource("file.txt");   Resource[] resources = loader.getResources("/tmp/*");  Resource[] resources = loader.getResources("/tmp/**/*");  Resource[] resources = loader.getResources("/tmp/?ile?.txt");

What would be returned in above examples would be instances of HdfsResources.

If there is a need for Spring Application Context to be aware of HdfsResourceLoader it needs to exist registered using hdp:resource-loader-registrar namespace tag.

            <hdp:file-system />            <hdp:resource-loader            file-organization-ref="hadoopFs"            handle-noprefix="false"                          />            <hdp:resource-loader-registrar />          
[Note] Notation

On default the HdfsResourceLoader will handle all resources paths without prefix. Attribute handle-noprefix can be used to control this behaviour. If this attribute is ready to false , not-prefixed resource uris will be handled past Spring Application Context .

              Resource[] resources = context.getResources("hdfs:default.txt");  Resource[] resources = context.getResources("hdfs:/*");  Resource[] resource = context.getResources("classpath:cfg*properties");

What would be returned in above examples would exist instances of HdfsResources and ClassPathResource for the last ane. If requesting resources paths without existing prefix, this example would fall back into Bound Application Context . Information technology may be appropriate to allow HdfsResourceLoader to handle paths without prefix if your application doesn't rely on loading resources from underlying context without prefixes.

Tabular array 5.ii.hdp:resource-loader attributes

Proper name Values Description

file-arrangement-ref

Edible bean Reference

Reference to existing Hadoop FileSystem bean

use-codecs

Boolean(defaults to truthful)

Indicates whether to utilize (or not) the codecs found inside the Hadoop configuration when accessing the resources input stream.

user

Cord

The security user (ugi) to use for impersonation at runtime.

uri

String

The underlying HDFS system URI.

handle-noprefix

Boolean(defaults to true)

Indicates if loader should handle resource paths without prefix.


Table 5.3.hdp:resource-loader-registrar attributes

Name Values Description

loader-ref

Edible bean Reference

Reference to existing Hdfs resource loader bean. Default value is 'hadoopResourceLoader'.


5.iii Scripting the Hadoop API

SHDP scripting supports whatever JSR-223 (also known as javax.scripting) compliant scripting engine. Simply add the engine jar to the classpath and the application should exist able to discover it. Most languages (such as Corking or JRuby) provide JSR-233 back up out of the box; for those that do not encounter the scripting projection that provides various adapters.

Since Hadoop is written in Java, accessing its APIs in a native way provides maximum control and flexibility over the interaction with Hadoop. This holds true for working with its file systems; in fact all the other tools that one might employ are built upon these. The main entry point is the org.apache.hadoop.fs.FileSystem abstruse form which provides the foundation of well-nigh (if not all) of the bodily file organisation implementations out there. Whether one is using a local, remote or distributed store through the FileSystem API she tin query and manipulate the available resources or create new ones. To do so however, one needs to write Java code, compile the classes and configure them which is somewhat cumbersome especially when performing simple, straightforward operations (similar re-create a file or delete a directory).

JVM scripting languages (such as Keen, JRuby, Jython or Rhino to proper noun just a few) provide a nice solution to the Java linguistic communication; they run on the JVM, tin can interact with the Coffee code with no or few changes or restrictions and have a nicer, simpler, less formalism syntax; that is, there is no need to ascertain a class or a method - simply write the code that you want to execute and you are done. SHDP combines the two, taking care of the configuration and the infrastructure and so one tin collaborate with the Hadoop environment from her language of choice.

Let us take a look at a JavaScript example using Rhino (which is part of JDK half dozen or college, meaning one does not need any extra libraries):

            <beans            xmlns="http://world wide web.springframework.org/schema/beans"            ...>            <hdp:configuration            .../>            <hdp:script            id="inlined-js"            language="javascript"            run-at-startup="true"            >            try {load("nashorn:mozilla_compat.js");} take hold of (e) {} // for Java eight     importPackage(coffee.util);      name = UUID.randomUUID().toString()     scriptName = "src/examination/resources/exam.backdrop"     //  - FileSystem example based on 'hadoopConfiguration' bean     // call FileSystem#copyFromLocal(Path, Path)     .copyFromLocalFile(scriptName, name)     // return the file length     .getLength(name)            </hdp:script>            </beans>          

The script element, part of the SHDP namespace, builds on top of the scripting back up in Spring permitting script declarations to be evaluated and declared as normal bean definitions. Furthermore it automatically exposes Hadoop-specific objects, based on the existing configuration, to the script such equally the FileSystem (more than on that in the next section). Equally ane can see, the script is fairly obvious: it generates a random name (using the UUID course from java.util package) and so copies a local file into HDFS nether the random name. The last line returns the length of the copied file which becomes the value of the declaring edible bean (in this instance inlined-js) - notation that this might vary based on the scripting engine used.

[Note] Notation

The circumspect reader might accept noticed that the arguments passed to the FileSystem object are not of blazon Path but rather Cord. To avoid the cosmos of Path object, SHDP uses a wrapper grade SimplerFileSystem which automatically does the conversion so you don't have to. For more than information see the implicit variables section.

Annotation that for inlined scripts, i can use Spring's belongings placeholder configurer to automatically expand variables at runtime. Using 1 of the examples seen before:

            <beans            ...                          >            <context:property-placeholder            location="classpath:hadoop.properties"                          />            <hdp:script            language="javascript"            run-at-startup="true"            >            ...     tracker=     ...            </hdp:script>            </beans>          

Notice how the script to a higher place relies on the property placeholder to expand ${hard disk drive.fs} with the values from hadoop.backdrop file bachelor in the classpath.

Every bit yous might have noticed, the script element defines a runner for JVM scripts. And just like the rest of the SHDP runners, it allows 1 or multiple pre and post actions to be specified to be executed before and later on each run. Typically other runners (such as other jobs or scripts) can be specified but whatever JDK Callable tin be passed in. Do annotation that the runner will not run unless triggered manually or if run-at-startup is fix to true. For more data on runners, come across the dedicated affiliate.

v.iii.one Using scripts

Inlined scripting is quite handy for doing uncomplicated operations and coupled with the property expansion is quite a powerful tool that can handle a variety of use cases. However when more than logic is required or the script is afflicted past XML formatting, encoding or syntax restrictions (such as Jython/Python for which white-spaces are important) ane should consider externalization. That is, rather than declaring the script directly within the XML, one can declare information technology in its own file. And speaking of Python, consider the variation of the previous instance:

              <hdp:script              location="org/visitor/bones-script.py"              run-at-startup="true"              />            

The definition does not bring any surprises but do observe there is no demand to specify the language (as in the example of a inlined declaration) since script extension (py) already provides that information. Simply for abyss, the basic-script.py looks as follows:

              from              java.util              import              UUID              from              org.apache.hadoop.fs              import              Path              print              "Abode dir is "              + str(fs.homeDirectory)              print              "Work dir is "              + str(fs.workingDirectory)              impress              "/user exists "              + str(fs.exists("/user"))  name = UUID.randomUUID().toString() scriptName =              "src/test/resources/test.properties"              fs.copyFromLocalFile(scriptName, name)              print              Path(proper name).makeQualified(fs)

v.iv Scripting implicit variables

To ease the interaction of the script with its enclosing context, SHDP binds by default the so-chosen implicit variables. These are:

Tabular array 5.four. Implicit variables

Name Blazon Description

cfg

Configuration

Hadoop Configuration (relies on hadoopConfiguration bean or singleton blazon lucifer)

cl

ClassLoader

ClassLoader used for executing the script

ctx

ApplicationContext

Enclosing awarding context

ctxRL

ResourcePatternResolver

Enclosing application context ResourceLoader

distcp

DistCp

Programmatic access to DistCp

fs

FileSystem

Hadoop File System (relies on 'hadoop-fs' bean or singleton blazon match, falls back to creating one based on 'cfg')

fsh

FsShell

File System shell, exposing hadoop 'fs' commands as an API

hdfsRL

HdfsResourceLoader

Hdfs resources loader (relies on 'hadoop-resource-loader' or singleton type lucifer, falls back to creating one automatically based on 'cfg')


[Note] Notation

If no Hadoop Configuration can exist detected (either past proper noun hadoopConfiguration or by blazon), several log warnings will be made and none of the Hadoop-based variables (namely cfg , distcp , fs , fsh , distcp or hdfsRL) will be bound.

As mentioned in the Description column, the variables are kickoff looked (either by name or by blazon) in the application context and, in case they are missing, created on the spot based on the existing configuration. Note that it is possible to override or add new variables to the scripts through the property sub-element that can set values or references to other beans:

            <hdp:script            location="org/company/basic-script.js"            run-at-startup="true"            >            <hdp:belongings            name="foo"            value="bar"            />            <hdp:property            name="ref"            ref="some-bean"            />            </hdp:script>          

5.4.1 Running scripts

The script namespace provides various options to adjust its behaviour depending on the script content. By default the script is only declared - that is, no execution occurs. Ane however tin change that so that the script gets evaluated at startup (as all the examples in this section do) through the run-at-startup flag (which is by default false) or when invoked manually (through the Callable). Similarily, by default the script gets evaluated on each run. Notwithstanding for scripts that are expensive and return the same value every time one has various caching options, and then the evaluation occurs but when needed through the evaluate aspect:

Tabular array five.5.script attributes

Name Values Description

run-at-startup

false(default), true

Wether the script is executed at startup or not

evaluate

E'er(default), IF_MODIFIED, ONCE

Wether to actually evaluate the script when invoked or used a previous value. ALWAYS ways evaluate every fourth dimension, IF_MODIFIED evaluate if the backing resource (such as a file) has been modified in the meantime and Once only once.


5.four.2 Using the Scripting tasklet

For Spring Batch environments, SHDP provides a dedicated tasklet to execute scripts.

              <script-tasklet              id="script-tasklet"              >              <script              language="groovy"              >              inputPath = "/user/gutenberg/input/word/"     outputPath = "/user/gutenberg/output/word/"     if (fsh.exam(inputPath)) {       fsh.rmr(inputPath)     }     if (fsh.test(outputPath)) {       fsh.rmr(outputPath)     }     inputFile = "src/main/resources/data/nietzsche-chapter-ane.txt"     fsh.put(inputFile, inputPath)              </script>              </script-tasklet>            

The tasklet above embedds the script as a nested element. You can also declare a reference to another script definition, using the script-ref attribute which allows y'all to externalize the scripting lawmaking to an external resource.

              <script-tasklet              id="script-tasklet"              script-ref="clean-up"              />              <hdp:script              id="clean-upward"              location="org/company/myapp/clean-up-wordcount.groovy"              />            

5.v File Organization Shell (FsShell)

A handy utility provided by the Hadoop distribution is the file system trounce which allows UNIX-similar commands to be executed against HDFS. Ane can check for the existence of files, delete, move, re-create directories or files or gear up permissions. All the same the utility is only available from the command-line which makes it hard to use from/inside a Java awarding. To address this problem, SHDP provides a lightweight, fully embeddable vanquish, called FsShell which mimics nearly of the commands available from the command line: rather than dealing with System.in or Arrangement.out, ane deals with objects.

Let usa take a look at using FsShell by edifice on the previous scripting examples:

            <hdp:script            location="org/company/basic-script.dandy"            run-at-startup="truthful"            />          
name = UUID.randomUUID().toString() scriptName =            "src/test/resource/examination.properties"            fs.copyFromLocalFile(scriptName, proper name)   dir =            "script-dir"            if            (!fsh.test(dir)) {    fsh.mkdir(dir); fsh.cp(name, dir); fsh.chmodr(700, dir)    println            "File content is "            + fsh.cat(dir + name).toString() } println fsh.ls(dir).toString() fsh.rmr(dir)

As mentioned in the previous department, a FsShell instance is automatically created and configured for scripts, nether the name fsh . Discover how the entire block relies on the usual commands: examination, mkdir, cp and so on. Their semantics are exactly the same as in the command-line version however one has access to a native Java API that returns actual objects (rather than String`s) making it easy to use them programmatically whether in Java or some other language. Furthermore, the form offers enhanced methods (such as `chmodr which stands for recursive chmod) and multiple overloaded methods taking advantage of varargs so that multiple parameters can be specified. Consult the API for more information.

To be as close as possible to the command-line shell, FsShell mimics even the messages existence displayed. Take a await at line 9 which prints the effect of fsh.cat(). The method returns a Collection of Hadoop Path objects (which ane tin employ programatically). However when invoking toString on the collection, the same printout as from the command-line beat out is beingness displayed:

File content is

The aforementioned goes for the rest of the methods, such as ls. The aforementioned script in JRuby would await something like this:

require            'java'            name = java.util.UUID.randomUUID().to_s scriptName =            "src/test/resources/examination.backdrop"            $fs.copyFromLocalFile(scriptName, name)   dir =            "script-dir/"            ... impress $fsh.ls(dir).to_s

which prints out something like this:

drwx------   - user     supergroup          0 2012-01-26 xiv:08 /user/user/script-dir -rw-r--r--   3 user     supergroup        344 2012-01-26 14:08 /user/user/script-dir/520cf2f6-a0b6-427e-a232-2d5426c2bc4e

As y'all can run across, non only tin you reuse the existing tools and commands with Hadoop inside SHDP, simply you can also code confronting them in various scripting languages. And as you might have noticed, there is no special configuration required - this is automatically inferred from the enclosing application context.

[Note] Note

The careful reader might have noticed that besides the syntax, there are some modest differences in how the various languages interact with the java objects. For example the automatic toString call chosen in Java for doing automated String conversion is not necessarily supported (hence the to_s in Ruby or str in Python). This is to exist expected as each language has its own semantics - for the most part these are easy to pick up but practise pay attention to details.

v.five.1 DistCp API

Similar to the FsShell, SHDP provides a lightweight, fully embeddable DistCp version that builds on top of the distcp from the Hadoop distro. The semantics and configuration options are the same however, ane can use it from inside a Java application without having to use the command-line. Come across the API for more data:

              <hdp:script              linguistic communication="groovy"              >distcp.copy("${distcp.src}", "${distcp.dst}")</hdp:script>            

The bean above triggers a distributed copy relying again on Spring's property placeholder variable expansion for its source and destination.

diazmustrien.blogspot.com

Source: https://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-fs.html

0 Response to "Java Code to Upload File to Hdfs"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel