Tag Archives: hadoop

How to convert .txt file to Hadoop's sequence file format

Questions: To effectively utilise map-reduce jobs in Hadoop, i need data to be stored in hadoop’s sequence file format. However,currently the data is only in flat .txt format.Can anyone suggest a way i can convert a .txt file to a sequence file? Answers: So the way more simplest answer is just an “identity” job that… Read More »

Reading file as single record in hadoop

Questions: I have huge no. of small files, i want to use CombineFileInputFormat to merge the files such that each file data comes as a single record in my MR job. I have followed http://yaseminavcular.blogspot.in/2011/03/many-small-input-files.html and tried to convert it into the new api I am facing 2 problems: a) I am just testing it… Read More »

Hadoop DistributedCache is deprecated – what is the preferred API?

Questions: My map tasks need some configuration data, which I would like to distribute via the Distributed Cache. The Hadoop MapReduce Tutorial shows the usage of the DistributedCache class, roughly as follows: // In the driver JobConf conf = new JobConf(getConf(), WordCount.class); … DistributedCache.addCacheFile(new Path(filename).toUri(), conf); // In the mapper Path[] myCacheFiles = DistributedCache.getLocalCacheFiles(job); …… Read More »

Why IdentityMapper disappears in the org.apache.hadoop.mapreduce library?

Questions: In the older version of hadoop library (i.e., org.apache.hadoop.mapred.lib), there is a basic implementation of Mapper called IdentityMapper, which essentially passes all the key-value pairs to a Reducer. However, I found in the newer version of hadoop library (org.apache.hadoop.mapreduce.lib), it does not have any class called IdentityMapper (all the subclasses of Mapper can be… Read More »

how to import the package org.apache.hadoop.mapreduce.lib.chain in a hadoop 0.20.2 project?

Questions: I’m trying to chain maps and reduces phases in one job. The problem is that I’m running under hadoop 0.20.2 and the package org.apache.hadoop.mapred.lib.Chain seems to be deprecated and replaced by the package org.apache.hadoop.mapreduce.lib.chain which is not available with the 0.20.2 version (only 0.21.0 version) My question is: what should I do to import… Read More »

Eclipse Map and Reduce Plugin & Hadoop Tutorial

Questions: I’m brand new to Hadoop and I’m following this Yahoo Tutorial (http://developer.yahoo.com/hadoop/tutorial/). I’m currently trying to configure eclipse and the map and reduce plugin to connect to the virtual machine. One of the settings I need to configure is the hadoop.job.ugi. It does not appear under the Advanced Settings tab of the plugin. Without… Read More »

hadoop No FileSystem for scheme: file

Questions: I am trying to run a simple NaiveBayesClassifer using hadoop, getting this error Exception in thread “main” java.io.IOException: No FileSystem for scheme: file at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:180) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.mahout.classifier.naivebayes.NaiveBayesModel.materialize(NaiveBayesModel.java:100) Code : Configuration configuration = new Configuration(); NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), configuration);// error in… Read More »