1.
现有某电商关于商品点击情况的数据文件,表名为goods_click,包含两个字段(商品分类,商品点击次数),分隔符“ ”,由于数据很大,所以为了方便统计我们只截取它的一部分数据,内容如下
52127 5 52120 93 52092 93 52132 38 52006 462 52109 28 52109 43 52132 0 52132 34 52132 9 52132 30 52132 45 52132 24 52009 2615 52132 25 52090 13 52132 6 52136 0 52090 10 52024 347
要求使用mapreduce统计出每类商品的平均点击次数。
源代码:
package mapreduce; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.Reducer.Context; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import mapreduce.WordCount.MyMapper; import mapreduce.WordCount.MyReducer; public class MyAverage { public static class Map extends Mapper<Object, Text, Text, IntWritable> { private static Text newKey = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { String line = itr.nextToken(); String arr[]=line.split(" "); newKey.set(arr[0]); int click=Integer.parseInt((arr[1].trim())); context.write(newKey, new IntWritable(click)); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int num=0; int count=0; for(IntWritable val:values) { num+=val.get(); count++; } int avg=num/count; context.write(key, new IntWritable(avg)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); System.out.println("start"); Job job = new Job(conf, "MyAverage"); job.setJarByClass(MyAverage.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); Path in = new Path("hdfs://localhost:9000/mymapreduce3/in/goods_click"); Path out = new Path("hdfs://localhost:9000/mymapreduce3/out"); FileInputFormat.addInputPath(job, in); FileOutputFormat.setOutputPath(job, out); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
统计数据:
52006 462 52009 2615 52024 347 52090 11 52092 93 52109 35 52120 93 52127 5 52132 23 52136 0
遇到的问题:
1.原本给到的数据中第一列是“商家ID 点击次数”,但是在程序中无法将点击次数从“字符串”转换成“int"型。后来在元数据中去掉了这一行。
2.无法将”点击次数“的数据从String型转化成int型。刚开始发现获得的”点击次数“数据周围包含空格,然后用String.trim()去空格但是不管用。然后源数据中去掉空格。
猜想应该是从数据库导出时数据库中就保存的数据加空格吧。