1. joinby 命令:多对多的匹配
*输入数据 clear input group str3 x1 1 "A" 1 "B" 1 "C" 1 "D" end save file1.dta,replace clear input group str3 x2 1 "M" 1 "N" end save file2.dta,replace
*merge 多对多匹配
use file1.dta,clear merge m:m group using file2.dta list, clean noobs
*joinby 多对多匹配
use file1.dta,clear joinby group using file2.dta list, clean noobs
merge
命令多对多匹配结果:
group x1 x2 _merge 1 A M matched (3) 1 B N matched (3) 1 C N matched (3) 1 D N matched (3)
可以看出,merge
命令的多对多合并是有问题的,其会以较少数据文件的最后一行值 (比如这里的 file2.dta 的最后一行数据「group1,x2=N」) 进行重复合并。
joinby
命令多对多匹配结果:
group x1 x2 1 A N 1 A M 1 B M 1 B N 1 C N 1 C M 1 D N 1 D M
可以看出,joinby
命令显然更符合我们的要求。关于 joinby
命令更多详细介绍,请查看帮助文件 help joinby
。
2. nearmrg 命令:相似值的匹配
*生成一份数据
sysuse auto.dta, clear keep make price mpg keep if make == "Toyota Celica" | /// make == "BMW 320i" | /// make == "Cad. Seville" | /// make == "Pont. Grand Prix" | /// make == "Datsun 210" rename make make2 save "using.dta", replace list, clean noobs
列出数据:
make2 price mpg Cad. Seville 15,906 21 Pont. Grand Prix 5,222 19 BMW 320i 9,735 25 Datsun 210 4,589 35 Toyota Celica 5,899 18
然后,我们将该数据与 auto.dta 进行合并,并找出 using.dta 数据中价格浮动在 $50 上下的数据。
sysuse auto.dta, clear nearmrg using "using.dta", upper nearvar(price) genmatch(usingmatch) limit(50) keep make price mpg make2 _merge usingmatch list, clean noobs
make price mpg make2 _merge usingm~h Datsun 210 4,589 35 Datsun 210 matched (3) 4,589 Buick Regal 5,189 20 Pont. Grand Prix matched (3) 5,222 Pont. Grand Prix 5,222 19 Pont. Grand Prix matched (3) 5,222 Olds Cutl Supr 5,172 19 Pont. Grand Prix matched (3) 5,222 Dodge Magnum 5,886 16 Toyota Celica matched (3) 5,899 Toyota Celica 5,899 18 Toyota Celica matched (3) 5,899 BMW 320i 9,735 25 BMW 320i matched (3) 9,735 Audi 5000 9,690 17 BMW 320i matched (3) 9,735 Cad. Seville 15,906 21 Cad. Seville matched (3) 15,906
可以看出,using data 中原有 5 行数据,合并后变成了 9 行数据。之所以如此,是因为 auto.dta 中价格浮动在 50 之内的数据都被保留了下来。