爬虫笔记之w3cschool注册页面滑块验证码破解（巨简单滑块位置识别，非鼠标模拟轨迹）

一、背景介绍

最开始接触验证码破解的时候就是破解的w3cschool的使用手机号找回密码页面的验证码，详见：验证码识别之w3cschool字符图片验证码(easy级别)，这次破解一下他们注册页面的滑块验证码，有点忐忑，我这么跟人过不去不会被打吧...

阅读前请知悉：本篇文章只涉及到滑块验证码的滑块位置识别，主要知识集中在图像处理方面，并不涉及到模拟鼠标轨迹等知识。

二、分析

首先打开这个页面：https://www.w3cschool.cn/register，观察下这个滑块验证码长啥样：

一般来说这种滑块验证码都是我每次拖动松开鼠标的时候向服务器发送一个请求验证此次拖动是否成功，打开F12，拖动失败一次，拖动成功一次，观察一下网络请求及返回值：

验证失败：

Request：

Response：

验证成功：

Request：

Response：

发送请求的时候需要携带一个point参数，从第二次的请求中大致可以推测出这个值要和返回值中的data一致才可以成功，经过试验这个值就是小黑块距离左边界的距离：

因此只需要从背景图中识别出小黑块最左边一列在整张图片中的x值即可。

再来看一下怎么把这个背景图下载下来，按ctrl+shift+c选一下这张图片：

看下这个background-image长啥样：

用过css雪碧图的应该一看就明白了，这是很多小长条形的图片拼接成的一张大图，然后页面上每个gt_cut_fullbg_slice样式的div对应其中的一小块，块大小是13px*58px，是一个长条形，每个div使用css背景偏移定位到自己对应的那个小长条块，background-position的值就是这个小长条图片的左上角在大图上的位置坐标（x,y），所以接下来要做的就是从网页中解析出来这张大图的url，然后下载下来根据css偏移量重新组装。

在html中搜索一下.gt_cut_fullbg_slice可以找到图片的位置：

用正则解析出来下载到本地，然后再解析DOM，提取出所有带gt_cut_fullbg_slice样式的div的background-position属性，从大图上对应位置抠出来13px*58px大小的像素依次写入到一张新的图片，组装成新图片的效果：

1545067573482

接下来就是比较头疼的事情了，怎么能够识别出小黑块的位置，并且正确率能够比较高，这个我开始试了几种方案，比如按照亮度、饱和度、色彩，但均会有一定程度的误判，效果并不理想，后来发现背景图虽然不断的变化，但是好像来来回回就那几张，只是小黑块的位置不同而已，多下载几张图片观察一下规律：

所以我只需要想办法将它们的背景图还原（没有小黑块），然后将有小黑块的图和没有小黑块的图做一个diff就能够准确的识别到小黑块的位置，理论上准确率能够达到100%。

这个猜想的基础是这些看上去一样的的图片实际上确实是一样的，因为有些图片虽然肉眼看上去是一样的，但是亮度、饱和度方面还是有些差别的，所以要支撑我的猜想我需要先做一个实验，就是找两张看上去一样的图片diff一下，看看它们有区别的像素的分布情况，diff结果如下：

第三张图是第一张图和第二张图diff的结果，其中白色部分是它们相同的部分，黑色是不相同的部分，可以看到确实只有小黑块部分不一样，说明它们本来应该是一张图，只是随机加了小黑块。

接下来就比较简单了，先想办法从这些有小黑块的图中还原出不带小黑块的原图来，这里的思路就是先下载1000张图片：

然后给每张图片取固定位置的几个像素做为特征，比如这里取了不太容易被小黑块覆盖住的四个顶点位置的像素的十六进制拼接作为特征，然后将这1000张图按照特征进行分组：

果然如我所料，背景图来来回回只有四张，每个分组对应着一个文件夹，每个文件夹下图片的背景都是一样的，区别只是小黑块的位置不同而已：

看上面的图片，假设背景图的(100,100)位置在第一张图是被小黑块覆盖的，但是第二张图并不一定是，所有图的(100,100)位置都被小黑块覆盖掉的几率太小了，所以只需要取出所有图在（100,100）位置的像素，然后select rgb from all_image_100_100_rgb_value group by value order by count(1) desc limit 1即可，同理，对图片上的每个像素点都如此处理，能够得到原图。

对每个分组如此处理，得到每个分组对应的原图：

哈，看上去比较神奇，但原理确实比较简单，接下来就是使用带小黑块的图片：

1545070677068

对其提取特征（四个顶点rgb值的十六进制拼接），然后找到这个特征对应的原图，即不带小黑块的图片：

ff6161ff4b4bfc5252fc5858

它俩之间做一个diff，从左到右按列扫描，取第一个rgb值不相等的像素所在列的下标作为偏移，这个值就是point。

三、代码实现

对上面分析的代码实现：

package cc11001100.crawler.w3cschool;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import org.apache.commons.io.FilenameUtils;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Stream;

import static java.lang.Integer.parseInt;
import static java.util.Collections.emptyMap;
import static java.util.Comparator.comparingInt;
import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.toList;
import static jodd.io.FileNameUtil.concat;

/**
 * w3c school简单滑块验证码破解
 *
 * <a>https://www.w3cschool.cn/register</a>
 *
 * @author CC11001100
 */
public class W3cSchoolRegisterCaptcha {

	private static final Logger log = LoggerFactory.getLogger(W3cSchoolRegisterCaptcha.class);

	private static final Pattern extractBgImgUrlPattern = Pattern.compile("\.gt_cut_fullbg_slice[\s\S]+?background-image: url\("(.+)"\)");
	private static final Pattern cssStyleBackgroundPositionPattern = Pattern.compile("background-position:-?(\d+)px -?(\d+)px;");

	private Map<String, BufferedImage> fingerprintToPerfectBackgroundImageMap = new HashMap<>();

	// 加载指纹和对应完整图片的映射关系
	public boolean load(String perfectImageDir) {
		File[] perfectFiles = new File(perfectImageDir).listFiles();
		if (perfectFiles == null) {
			log.error("load fingerprint mapping failed, dir {} empty", perfectImageDir);
			return false;
		}
		for (File file : perfectFiles) {
			try {
				BufferedImage img = ImageIO.read(file);
				String fingerprint = extractBackgroundImageFingerprint(img);
				fingerprintToPerfectBackgroundImageMap.put(fingerprint, img);
			} catch (IOException e) {
				log.error("IOException", e);
			}
		}
		return false;
	}

	// 为HttpUtil保存的当前Session设置滑块验证ok的标志
	public boolean touch() {
		String htmlContent = HttpUtil.downloadText("https://www.w3cschool.cn/register", emptyMap(), null, null);
		Document document = Jsoup.parse(htmlContent);
		String imgUrl = extractBgImgUrl(htmlContent);
		if (imgUrl == null) {
			throw new RuntimeException("cant find background image url");
		}
		imgUrl = "https://www.w3cschool.cn/" + imgUrl;
		BufferedImage garbledImage = HttpUtil.downloadImage(imgUrl);
		BufferedImage normalImage = splice(document, garbledImage);
		String fingerprint = extractBackgroundImageFingerprint(normalImage);
		BufferedImage perfectImage = fingerprintToPerfectBackgroundImageMap.get(fingerprint);
		int firstDiffColumnIndex = firstDiffColumnIndex(normalImage, perfectImage);
		return tellServerIamOk(firstDiffColumnIndex);
	}

	private boolean tellServerIamOk(int offset) {
		Map<String, String> params = new HashMap<>();
		params.put("point", Integer.toString(offset));
		String responseContent = HttpUtil.downloadText("https://www.w3cschool.cn/dragcheck", emptyMap(),
				connection -> connection.method(Connection.Method.POST).data(params), null);
		if (responseContent == null) {
			throw new RuntimeException("drag check response null");
		}
		JSONObject o = JSON.parseObject(responseContent);
		if (o.getIntValue("statusCode") != 200) {
			throw new RuntimeException("drag check failed, response=" + responseContent);
		}
		int data = o.getIntValue("data");
		// 用于对比矫正扫描效果
		log.info("offset={}, data={}", offset, data);
		return offset == data;
	}

	// 从html页面中抽取验证码图片的url
	private String extractBgImgUrl(String htmlContent) {
		Matcher matcher = extractBgImgUrlPattern.matcher(htmlContent);
		if (matcher.find()) {
			return matcher.group(1);
		}
		return null;
	}

	// 将精神错乱的背景图重新组装正常
	private BufferedImage splice(Document document, BufferedImage img) {
		Elements blockElts = document.select(".gt_cut_fullbg_slice");
		if (blockElts.isEmpty()) {
			throw new RuntimeException("cannot find captcha elements, ensure in register page.");
		}
		// 验证码背景图大小260px*116px
		SpliceImage spliceImage = new SpliceImage(260, 116);
		blockElts.forEach(elt -> {
			String style = elt.attr("style");
			Matcher matcher = cssStyleBackgroundPositionPattern.matcher(style);
			if (matcher.find()) {
				int x = parseInt(matcher.group(1));
				int y = parseInt(matcher.group(2));
				// 组成背景图的每个块的大小是13px*58px
				BufferedImage block = img.getSubimage(x, y, 13, 58);
				spliceImage.append(block);
			} else {
				log.info("style:{}, cannot extract background-position", style);
			}
		});
		return spliceImage.getBufferedImage();
	}

	// 从左到右扫描，返回第一个不同列的偏移
	public int firstDiffColumnIndex(BufferedImage src, BufferedImage dest) {
		int w = src.getWidth();
		int h = src.getHeight();
		for (int x = 0; x < w; x++) {
			for (int y = 0; y < h; y++) {
				if (src.getRGB(x, y) != dest.getRGB(x, y)) {
					return x;
				}
			}
		}
		return -1;
	}

	public class SpliceImage {
		private BufferedImage bufferedImage;
		private int nextX;
		private int nextY;

		public SpliceImage(int width, int height) {
			bufferedImage = new BufferedImage(width, height, BufferedImage.TYPE_INT_RGB);
		}

		public void append(BufferedImage img) {
			bufferedImage.getGraphics().drawImage(img, nextX, nextY, null);
			nextX += img.getWidth();
			// new line
			if (nextX >= bufferedImage.getWidth()) {
				nextX = 0;
				nextY += img.getHeight();
			}
		}

		public BufferedImage getBufferedImage() {
			return bufferedImage;
		}

	}

	/*----------------------------------------- 以下为预处理部分的代码 ----------------------------------------------------*/

	// 下载一些背景图到本地
	public void downloadBackgroundImage(String saveBaseDir, int num) {
		log.info("download prepare");
		ExecutorService executorService = Executors.newFixedThreadPool(3);
		for (int i = 0; i < num; i++) {
			executorService.execute(() -> {
				long threadId = Thread.currentThread().getId();
				log.info("prepare " + threadId);
				String htmlContent = HttpUtil.downloadText("https://www.w3cschool.cn/register", emptyMap(), null, null);
				Document document = Jsoup.parse(htmlContent);
				String imgUrl = extractBgImgUrl(htmlContent);
				if (imgUrl == null) {
					throw new RuntimeException("cant find bg img url");
				}
				imgUrl = "https://www.w3cschool.cn/" + imgUrl;
				BufferedImage bgImg = HttpUtil.downloadImage(imgUrl);
				BufferedImage perfectImg = splice(document, bgImg);
				String filePath = FilenameUtils.concat(saveBaseDir, System.currentTimeMillis() + ".png");
				try {
					ImageIO.write(perfectImg, "png", new File(filePath));
				} catch (IOException e) {
					log.error("download background image failed", e);
				}
				log.info("end " + threadId);
			});
		}
		executorService.shutdown();
		try {
			executorService.awaitTermination(10, TimeUnit.DAYS);
		} catch (InterruptedException e) {
			log.error("InterruptedException", e);
		}
		log.info("download done.");
	}

	// 用于对比两张背景图的差异性，用于验证复原背景图再对比的方案是否可行
	public void diff(String srcImagePath, String destImagePath, String resultSavePath) throws IOException {
		BufferedImage srcImage = ImageIO.read(new FileInputStream(srcImagePath));
		BufferedImage destImage = ImageIO.read(new FileInputStream(destImagePath));
		int w = srcImage.getWidth();
		int h = srcImage.getHeight();

		BufferedImage resultImage = new BufferedImage(w, h, BufferedImage.TYPE_INT_RGB);
		for (int x = 0; x < w; x++) {
			for (int y = 0; y < h; y++) {
				// 相同部分置为白色，不同部分置为黑色
				if (srcImage.getRGB(x, y) == destImage.getRGB(x, y)) {
					resultImage.setRGB(x, y, 0X00FFFFFF);
				} else {
					resultImage.setRGB(x, y, 0X00000000);
				}
			}
		}
		ImageIO.write(resultImage, "png", new File(resultSavePath));
	}

	// 对图片按照特征分组
	public void groupBy(String srcImageDir, String groupByResultDir) {
		File[] imageFiles = new File(srcImageDir).listFiles();
		if (imageFiles == null) {
			log.error("no image file in " + srcImageDir);
			return;
		}
		Stream.of(imageFiles).map(f -> {
			try {
				return ImageIO.read(f);
			} catch (IOException e) {
				log.error("IOException", e);
			}
			return null;
		}).filter(Objects::nonNull)
				// 对图片按照特征分组
				.collect(groupingBy(this::extractBackgroundImageFingerprint))
				.forEach((key, value) -> {
					String basePath = concat(groupByResultDir, key);
					File basePathDirFile = new File(basePath);
					if (basePathDirFile.exists()) {
						basePathDirFile.delete();
					}
					basePathDirFile.mkdirs();
					value.forEach(img -> {
						String imgPath = concat(basePath, System.currentTimeMillis() + ".png");
						try {
							ImageIO.write(img, "png", new File(imgPath));
						} catch (IOException e) {
							log.error("IOException", e);
						}
					});
				});
	}

	// 抽取背景图特征，以几个点的颜色作为特征
	private String extractBackgroundImageFingerprint(BufferedImage img) {
		int w = img.getWidth();
		int h = img.getHeight();
		// 暂时用四个角的像素作为特征看看效果怎么样
		int[][] points = {
				{0, 0},
				{w - 1, 0},
				{0, h - 1},
				{w - 1, h - 1}
		};
		StringBuilder sb = new StringBuilder();
		for (int[] point : points) {
			sb.append(Integer.toString(img.getRGB(point[0], point[1]) & 0X00FFFFFF, 16));
		}
		return sb.toString();
	}

	// 将每个分区下的图像融合为一个无缺块的原图
	public void merge(String groupByDir, String mergeResultSaveDir) throws IOException {
		File[] groups = new File(groupByDir).listFiles();
		if (groups == null) {
			log.error(groupByDir + " empty");
			return;
		}
		for (File group : groups) {
			mergeSingleGroup(group, mergeResultSaveDir);
		}
	}

	private void mergeSingleGroup(File group, String mergeResultSaveDir) {
		String[] imgs = group.list();
		if (imgs == null) {
			log.warn("group {} empty", group.getName());
			return;
		}
		List<BufferedImage> imgList = Stream.of(imgs).limit(100).map(imgPath -> {
			try {
				return ImageIO.read(new File(concat(group.getPath(), imgPath)));
			} catch (IOException e) {
				log.error("IOException", e);
			}
			return null;
		}).filter(Objects::nonNull)
				.collect(toList());
		int w = imgList.get(0).getWidth();
		int h = imgList.get(0).getHeight();
		ImageMerge imageMerge = new ImageMerge(w, h);
		for (int x = 0; x < w; x++) {
			for (int y = 0; y < h; y++) {
				imageMerge.prepare(x, y);
				for (BufferedImage img : imgList) {
					imageMerge.vote(img.getRGB(x, y));
				}
				imageMerge.declareTheResult();
			}
		}
		File mergeResultSaveDirFile = new File(mergeResultSaveDir);
		if (mergeResultSaveDirFile.exists()) {
			mergeResultSaveDirFile.delete();
		}
		mergeResultSaveDirFile.mkdirs();

		String path = concat(mergeResultSaveDir, group.getName() + ".png");
		try {
			ImageIO.write(imageMerge.getBufferedImage(), "png", new File(path));
		} catch (IOException e) {
			log.error("IOException", e);
		}
	}

	// 使用同一个分组下的多张有缺块的残图合并为一张无缺块的原图
	// 像素级别的选举：每张残缺图将自己(x,y)点的rgb值交给此类作为选票，出现次数最多的选票获胜作为最终结果
	public class ImageMerge {
		private BufferedImage bufferedImage;
		private int x;
		private int y;
		private Map<Integer, Integer> voteCountMap = new HashMap<>();

		public ImageMerge(int width, int height) {
			this.bufferedImage = new BufferedImage(width, height, BufferedImage.TYPE_INT_RGB);
		}

		public void prepare(int x, int y) {
			this.x = x;
			this.y = y;
			this.voteCountMap.clear();
		}

		public void vote(int voteValue) {
			int count = voteCountMap.getOrDefault(voteValue, 0);
			voteCountMap.put(voteValue, count + 1);
		}

		public void declareTheResult() {
			int theWinnerVoteValue = voteCountMap.entrySet().stream().max(comparingInt(Map.Entry::getValue)).get().getKey();
			bufferedImage.setRGB(x, y, theWinnerVoteValue);
		}

		public BufferedImage getBufferedImage() {
			return this.bufferedImage;
		}

	}

	// 为了避免给对方产生过多无用账号（产品经理会大喜，用户数猛增哈哈），这里不使用注册接口了
	private void test() {
		int totalTimes = 100;
		int successTimes = 0;
		for (int i = 0; i < totalTimes; i++) {
			if (touch()) {
				successTimes++;
			}
			HttpUtil.clearCookie();
		}
		System.out.println("success rate " + (100.0 * successTimes / totalTimes) + "%");
	}

	public static void main(String[] args) throws IOException {

		W3cSchoolRegisterCaptcha captchaBackgroundImage = new W3cSchoolRegisterCaptcha();

		// 先下载一些背景图到本地观察一下它们的规律
//		captchaBackgroundImage.downloadBackgroundImage("data/w3c/raw", 1000);

		// 用于验证是否除了滑块部分其它部分都一样
//		captchaBackgroundImage.diff("data/w3c/diff/1545119857099.png", "data/w3c/diff/1545119857138.png", "data/w3c/diff/diff-result.png");

		// 对下载下来的图片进行分组，相同图片放到一组中
//		captchaBackgroundImage.groupBy("data/w3c/raw", "data/w3c/groupBy");

		// 每个组下都有一些带缺块的图片，使用这些带缺块的图片合成出不带缺块的原图来
//		captchaBackgroundImage.merge("data/w3c/groupBy", "data/w3c/merge");

		// 测试一下效果怎么样
		captchaBackgroundImage.load("data/w3c/merge");
		captchaBackgroundImage.test();

	}

}

用到的HttpUtil工具类：

package cc11001100.crawler.w3cschool;


import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

/**
 * @author CC11001100
 */
public class HttpUtil {

	private static final Logger log = LoggerFactory.getLogger(HttpUtil.class);

	// 用来持久化cookie以保存会话
	private static Map<String, String> cookieMap = new HashMap<>();

	public static byte[] downloadBytes(String url, Map<String, String> params, ConnectionSetting connectionSetting, ResponseCheck responseCheck) {
		for (int i = 1; i <= 5; i++) {
			long start = System.currentTimeMillis();
			try {
				Connection connection = Jsoup.connect(url)
						.ignoreContentType(true)
						.ignoreHttpErrors(true)
						.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36")
						.data(params)
						.cookies(cookieMap);
				if (connectionSetting != null) {
					connectionSetting.setting(connection);
				}
				Connection.Response response = connection.execute();
				byte[] responseBody = response.bodyAsBytes();
				if (responseCheck != null && !responseCheck.check(response, responseBody)) {
					throw new RuntimeException();
				}
				cookieMap.putAll(response.cookies());
				long cost = System.currentTimeMillis() - start;
				log.info("request ok, tryTimes={}, url={}, cost={}", i, url, cost);
				return responseBody;
			} catch (IOException e) {
				long cost = System.currentTimeMillis() - start;
				log.info("request failed, tryTimes={}, url={}, cost={}", i, url, cost);
			}
		}
		return null;
	}

	public static String downloadText(String url, Map<String, String> params, ConnectionSetting connectionSetting, ResponseCheck responseCheck) {
		byte[] responseContent = downloadBytes(url, params, connectionSetting, responseCheck);
		if (responseContent == null) {
			return null;
		}
		return new String(responseContent);
	}

	public static BufferedImage downloadImage(String url) {
		byte[] imgBytes = downloadBytes(url, Collections.emptyMap(), null, null);
		if (imgBytes == null) {
			return null;
		}
		try {
			return ImageIO.read(new ByteArrayInputStream(imgBytes));
		} catch (IOException e) {
			log.error("download image error, img url=" + url, e);
		}
		return null;
	}

	public static void clearCookie(){
		cookieMap.clear();
	}

	@FunctionalInterface
	public static interface ResponseCheck {
		boolean check(Connection.Response response, byte[] responseBody);
	}

	@FunctionalInterface
	public static interface ConnectionSetting {
		void setting(Connection connection);
	}

}

运行一下，测试一下识别的效果：

果然如我所料，识别率能够达到100%。

相关阅读:
补充函数详解
 Python web前端 11 form 和 ajax
进程线程之间的通信
 面向对象epoll并发
 socket发送静态页面
 进程与线程的表示，属性，守护模式
 并发
 django, tornado
并行
 非阻塞套接字编程， IO多路复用(epoll)
原文地址：https://www.cnblogs.com/cc11001100/p/10140949.html