1、原理阐述
适用于关联表中有小表的情形;
可以将小表分发到所有的map节点,这样,map节点就可以在本地对自己所读到的大表数据进行join并输出最终结果,可以大大提高join操作的并发度,加快处理速度
2、实现示例
–先在mapper类中预先定义好小表,进行join
–引入实际场景中的解决方案:一次加载数据库或者用
第一步:定义mapJoin
public class Map extends Mapper<LongWritable, Text,Text,Text> { HashMap<String, String> map = new HashMap<>(); String line=""; @Override protected void setup(Context context) throws IOException, InterruptedException { URI[] cacheFiles = DistributedCache.getCacheFiles(context.getConfiguration()); FileSystem fileSystem = FileSystem.get(cacheFiles[0], context.getConfiguration()); FSDataInputStream open = fileSystem.open(new Path(cacheFiles[0])); BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(open)); while ((line=bufferedReader.readLine())!=null){ String[] split = line.split(","); map.put(split[0],split[1]+" "+split[2]+" "+split[3]); } bufferedReader.close(); fileSystem.close(); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] split = value.toString().split(","); String s = map.get(split[2]); context.write(new Text(split[2]),new Text(s+" "+split[0]+" "+split[1]+" "+split[3])); } }
第二步:定义程序运行main方法
public class Driver { public static void main(String[] args)throws Exception { Configuration configuration = new Configuration(); //注意,这里的缓存文件的添加,只能将缓存文件放到hdfs文件系统当中 DistributedCache.addCacheFile(new URI("hdfs:// IP :8020/目录"),configuration); Job job = Job.getInstance(configuration, "方法名"); job.setJarByClass(Driver.class); job.setMapperClass(Map.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); FileInputFormat.addInputPath(job,new Path("加载路径")); FileOutputFormat.setOutputPath(job,new Path("写入路径")); boolean b = job.waitForCompletion(true); System.exit(b?0:1); } }