Tag Archives: hadoop

Wordcount execution on the Hadoop Cluster

Questions: I follow a tutorial to learn Hadoop with Java. I write the Wordcount program in the IntelliJ and the job was successful and I can see the proper output file. Now, I would like to run the app in the Hadoop cluster and that fails. The Hadoop setup is itself fine and starts properly.… Read More »

Implementing Apriori Algorithm on Hadoop

Questions: I am attempting to implement the Apriori algorithm on using Hadoop. I have already implemented a non-distributed version of the Apriori algorithm but my lack of familiarity with Hadoop and MapReduce has presented a number of concerns. My question, is therefore how, in terms of tips or methods, to implement a part of the… Read More »

What is more efficient to copy a folder with a lot of files in Amazon S3 using Hadoop API in java, FileUtil.copy() or DistCp.run()

Questions: I’m trying to create a copy of a folder with a lot of files in Amazon S3.Both the source path and the target path are in a s3 bucket.But i don’t really know what option is more efficient, the FileUtil.copy() option or the Distcp.run() option. Both options are easy to implement, i’m just worried… Read More »

Maven exception in compiling Hadoop for 64bits

Questions: I’m compilling hadoop 2.4.0 on Ubunto 64bit using Maven here is the information about versions of Maven and JDK $ mvn -version Apache Maven 3.5.3 (3383c37e1f9e9b3bc3df5050c29c8aff9f295297; 2018-02-24T19:49:05Z) Maven home: /opt/apache-maven Java version: 1.8.0_181, vendor: Oracle Corporation Java home: /usr/lib/jvm/java-8-oracle/jre Default locale: fr_FR, platform encoding: UTF-8 OS name: “linux”, version: “4.15.0-29-generic”, arch: “amd64”, family: “unix”… Read More »

Apache Spark error using hadoop to offload data to AWS S3

Questions: I’m using Apache Spark v2.3.1 and try to offload data to AWS S3 after processing that. Something like that: data.write().parquet(“s3a://” + bucketName + “/” + location); Configuration seems to be fine: String region = System.getenv(“AWS_REGION”); String accessKeyId = System.getenv(“AWS_ACCESS_KEY_ID”); String secretAccessKey = System.getenv(“AWS_SECRET_ACCESS_KEY”); spark.sparkContext().hadoopConfiguration().set(“fs.s3a.impl”, “org.apache.hadoop.fs.s3a.S3AFileSystem”); spark.sparkContext().hadoopConfiguration().set(“fs.s3a.awsRegion”, region); spark.sparkContext().hadoopConfiguration().set(“fs.s3a.awsAccessKeyId”, accessKeyId); spark.sparkContext().hadoopConfiguration().set(“fs.s3a.awsSecretAccessKey”, secretAccessKey); %HADOOP_HOME% leads to… Read More »

Maven can't find target/class directory using code from Hadoop: the definitive guide 4th edition

Questions: Recently I am learning Hadoop: the definitive guide 4th edition and trying to run the code (available on https://github.com/tomwhite/hadoop-book). But after installing hadoop, maven, and type the % mvn package -DskipTests I encounter the following problem: [INFO] Hadoop: The Definitive Guide, Project ………….. SUCCESS [ 1.336 s] [INFO] Common Code …………………………………. FAILURE [ 3.347… Read More »

spark java.io.IOException: Failed to create local dir in /data5/hadoop/tmp [on hold]

Questions: org.apache.spark.shuffle.FetchFailedException: java.io.FileNotFoundException: /data4/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1530162545944_82254/blockmgr-6e73b913-6c19-4f28-844b-23b783738a14/2e/shuffle_0_25_0.index at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.<init>(FileInputStream.java:146) at org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:198) at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:278) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:60) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:60) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:60) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:159) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:107) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)… Read More »