spark sql中支持sechema合并的操作。
直接上官方的代码吧。
val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sqlContext from the previous example is used in this example. // This is used to implicitly convert an RDD to a DataFrame. import sqlContext.implicits._ // Create a simple DataFrame, stored into a partition directory val df1 = sparkContext.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double") df1.saveAsParquetFile("data/test_table/key=1") // Create another DataFrame in a new partition directory, // adding a new column and dropping an existing column val df2 = sparkContext.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple") df2.saveAsParquetFile("data/test_table/key=2") // Read the partitioned table val df3 = sqlContext.parquetFile("data/test_table") df3.printSchema() // The final schema consists of all 3 columns in the Parquet files together // with the partiioning column appeared in the partition directory paths. // root // |-- single: int (nullable = true) // |-- double: int (nullable = true) // |-- triple: int (nullable = true) // |-- key : int (nullable = true)
也就是说df1和df2都保存在data/test_table目录下了。
df1列名分别为single,double,key
df2列名分别为single,triple,key。
然后df3直接读取test_table后,会将df1,df2的列都加在一起,那么dfs的列分别就是single,double,triple,key
然后将df3.show。结果就 是:
single double triple key 3 6 null 1 4 8 null 1 5 10 null 1 1 2 null 1 2 4 null 1 8 null 24 2 9 null 27 2 10 null 30 2 6 null 18 2 7 null 21 2
大家看,是不是df1和df2合起来的集成呢(不需要做关联)