[HttpClient+Jsoup+Selenium]小红书爬取去水印图片

Author：唐宋丶
发布时间：September 28, 2021
1825 views
3 comments
12624 words
Categories：开发项目

<div class="tip share">请注意，本文编写于 1365 天前，最后修改于 1038 天前，其中某些信息可能已经过时。</div>

2022年08月22日14:14:13，前段时间小红书修改了下发形式，原图不再下发，从下发的资料中找不到规律，后续有机会尝试通过JS逆向获取。

## 前言

小红书中的图片保存就会自带水印，用浏览器打开的话展示的也是加了水印的图，导致不能很好的收藏小姐姐（不是）。
加上各种小程序都是抠抠搜搜只给你免费用个三两次，就打算自己搞一个。

## 方案

小红书加水印的方法是在用户上传图片时，上传一份原图+一份上水印的图到云床上。返回的时候默认展示的是带水印的照片。
万幸的是这两份照片的msgId都能在 `window.__INITIAL_SSR_STATE__`中爬到，因此只需要拿到原图的msgId来一招狸猫换太子就可以直接拿到小红书的照片全路径了。

第一版中，由于写死的 `cookie`的有效期为7天，且cookie字段存在于 `Request Header`中没办法直接通过Jsoup拿到。所以通过 `selenium`去模拟Chrome浏览器访问来获取cookie。
当cookie过期时会根据地址自动获取到最近的cookie并更新。（因为是7天更新一次cookie所以没有解决因为频繁获取cookie而引发的数美滑块验证，一次性爬太多也同理，具体阈值还没有测出来）
**建议使用 `Pyhton + Selenium`不要用Java！坑实在是太多了（下文讲）！**

### **`ImageController`**

解析小红书APP**“复制链接”**后的文案（去文字+emo表情），拿到短链接。

```java
/**
 * 描述：小红书图片去水印解析
 *
 * @author CaiTianXin
 * @date 2021/9/9 15:55
 */
@Controller
@RequestMapping(value = "/api", params = "service=image")
public class ImageController {

@Autowired
    private ImageService imageService;

/**
     * 根据url对其中图片进行去水印解析
     *
     * @param url 分享链接（完整）
     *            44奈奈无鱼发布了一篇小红书笔记，快来看吧！[laughing]WsaHfdrcQ6mC8Ir[laughing]http://xhslink.com/jbhS5d，复制本条信息，打开【小红书】App查看精彩内容！
     * @return {@link String}
     */
    @RequestMapping(params = "method=getImageList")
    public String getImageList(@RequestParam("url") String url, Model model) {
        String shareUrl;
        List<String> imageList;
        if(StrUtil.startWith(url, "http")) {
            // 具体链接
            shareUrl = url;
        } else {
            try {
                // 电脑端分享地址
                shareUrl = StrUtil.removeSuffix(url.split("  ")[2], "，复制本条信息，打开【小红书】App查看精彩内容！");
            } catch (Exception e) {
                // emoji表情处理
                try {
                    // 安卓分享地址
                    shareUrl = StrUtil.removeSuffix(url.split("\\[laughing\\]")[2], "，复制本条信息，打开【小红书】App查看精彩内容！");
                } catch (Exception ex) {
                    // 苹果分享地址
                    shareUrl = StrUtil.removeSuffix(url.split("\uD83D\uDE06 ")[2], "，复制本条信息，打开【小红书】App查看精彩内容！");
                }
            }
        }
        imageList = imageService.getImageList(shareUrl);
        if (imageList.isEmpty()) {
            return "update";
        }
        model.addAttribute("imageList", imageList);
        return "list";
    }
}
```

### **`ImageServiceImpl`**

1.根据短连接重定向出具体的realUrl链接。
2.根据realUrl，拼接包括cookie在内的头参，通过HttpClient获取到数据。
3.再根据jsoup将数据转为Document形式解析 `<script></script>`标签，拼接无水印的图片地址。
4.如果cookie过期则通过系统命令行执行python脚本，通过流方式获取到cookie，再更新。

```java
@Service
@Log4j2
public class ImageServiceImpl implements ImageService {

private String cookie = "xhsTracker=url=noteDetail&xhsshare=CopyLink; xhsTrackerId=096eb980-05dd-4302-c333-b7b47527ba31; extra_exp_ids=gif_exp1,ques_clt2; timestamp2=202109151bf286650fd065d7ba66395f; timestamp2.sig=6xpJcjGtCWzhM-jZ0lCZTNIv7ndcYzoC-7n0YdCvYwM";

/**
     * 根据url对其中图片进行去水印解析
     *
     * @param shareUrl 分享url/网页url
     * @return {@link List<String>}
     */
    @Override
    public List<String> getImageList(String shareUrl) {
        String realUrl = this.getRealUrl(shareUrl);
        List<String> result = new ArrayList<>(16);

HttpContext httpContext = new BasicHttpContext();
        CloseableHttpClient httpClient = HttpClients.createDefault();

HttpGet httpGet = new HttpGet(realUrl);
        httpGet.setHeader("accept", " text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");
        httpGet.setHeader("accept-encoding", " gzip, deflate, br");
        httpGet.setHeader("accept-language", " zh-CN,zh;q=0.9");
        httpGet.setHeader("cookie", cookie);
        httpGet.setHeader("upgrade-insecure-requests", "1");
        httpGet.setHeader("user-agent", " Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3876.400 QQBrowser/10.8.4503.400");

HttpEntity entity;
        CloseableHttpResponse response = null;
        try {
            // 发送http请求
            response = httpClient.execute(httpGet, httpContext);
            if (response.getStatusLine().getStatusCode() == 200) {
                // 获取响应数据
                entity = response.getEntity();
                // 将响应数据转Html源码
                String html = EntityUtils.toString(entity, "UTF-8");
                Document doc = Jsoup.parse(html);

// 过标签解析
                Element script = doc.getElementsByTag("script").last();
                String value = StrUtil.removePrefix(script.html(), "window.__INITIAL_SSR_STATE__=");
                Map<String, Object> map = JSONUtil.toBean(value, Map.class, true);
                Map<String, Object> noteView = (Map<String, Object>) map.get("NoteView");
                Map<String, Object> noteInfo = (Map<String, Object>) noteView.get("noteInfo");
                List<Map<String, String>> imageList = (List<Map<String, String>>) noteInfo.get("imageList");
                imageList.forEach(item -> result.add("http://ci.xiaohongshu.com/" + item.get("traceId")));
            }

} catch (Exception e) {
            this.updateCookie(this.getCookie(realUrl));
            log.info("cookie过期，自动更新。");
        } finally {
            try {
                if (response != null) {
                    response.close();
                }
                if (httpClient != null) {
                    httpClient.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return result;
    }

/**
     * 获取重定向后的真实地址
     *
     * @param shareUrl 分享链接的网址
     * @return {@link String}
     */
    private String getRealUrl(String shareUrl) {
        CloseableHttpClient httpClient = HttpClients.createDefault();
        HttpGet httpGet = new HttpGet(shareUrl);
        HttpContext httpContext = new BasicHttpContext();
        CloseableHttpResponse response = null;
        RedirectLocations redirectLocations = null;
        try {
            response = httpClient.execute(httpGet, httpContext);
            //获取实际的请求对象的URI
            redirectLocations = (RedirectLocations) httpContext.getAttribute(HttpClientContext.REDIRECT_LOCATIONS);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (response != null) {
                    response.close();
                }
                if (httpClient != null) {
                    httpClient.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return redirectLocations.get(0).toString();
    }

@Override
    public void updateCookie(String cookie) {
        this.cookie = cookie;
    }

private String getCookie(String realUrl) {
        if (StrUtil.isBlank(realUrl)) {
            realUrl = "https://www.xiaohongshu.com/discovery/item/6151406b00000000210395ce";
        }
        String execString = "python /xxx/xhs-cookie.py " + realUrl;
        StringBuilder success = new StringBuilder();
        StringBuilder error = new StringBuilder();
        try {
            Process pop = Runtime.getRuntime()
                    .exec(execString);

// 获取其正常的输出流
            InputStream inputStream = pop.getInputStream();
            InputStreamReader inputStreamReader = new InputStreamReader(inputStream);
            BufferedReader br = new BufferedReader(inputStreamReader);
            String line;
            while ((line = br.readLine()) != null) {
                success.append(line);
            }

// 获取其错误的输出流
            InputStream errorStream = pop.getErrorStream();
            InputStreamReader errorStreamReader = new InputStreamReader(errorStream);
            BufferedReader errorBr = new BufferedReader(errorStreamReader);
            String errorLine;
            while ((errorLine = errorBr.readLine()) != null) {
                error.append(errorLine);
            }
            if (StrUtil.isNotBlank(error.toString())) {
                log.error(error.toString());
            }

pop.waitFor();
        } catch (IOException | InterruptedException e) {
            e.printStackTrace();
        }
        return success.toString();
    }
}
```

[Jsoup基础用法](https://www.cnblogs.com/zhangyinhua/p/8037599.html)
[Jsoup获取Js中的变量值](https://blog.csdn.net/weinichendian/article/details/51490503)

### **`xhs-cookkie.py`**

1.安装Python环境，根据[【python】linux下使用selenium（环境部署）](https://www.jianshu.com/p/cbc01d32c7b0)安装 `Chrom`和 `ChromDriver`。
2.安装脚本环境：`pip install selenium`
（参数具体解释放Java实现中）

```python
from time import sleep
from selenium import webdriver
import sys

ch_options = webdriver.ChromeOptions()

ch_options.add_argument('--no-sandbox')
ch_options.add_argument('--disable-dev-shm-usage')
ch_options.add_argument('--headless')
ch_options.add_argument('blink-settings=imagesEnabled=false')
ch_options.add_argument('--disable-gpu')

ch_options.add_argument('window-size=1920x1080')  # 指定浏览器分辨率

# 设置为开发者模式
ch_options.add_experimental_option('excludeSwitches', ['enable-automation'])
ch_options.add_experimental_option('useAutomationExtension', False)
browser = webdriver.Chrome(options=ch_options)
browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
   'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'
})
browser.get(sys.argv[1])
sleep(1)
cookies = browser.get_cookies()
res = []
for cookie in cookies:
    res.append("%s=%s" % (cookie["name"], cookie["value"]))
print(";".join(res))
browser.close()
```

### **`CookieServiceImpl`**

目前 `org.seleniumhq.selenium selenium-java` 移除了 `executeCdpCommand()`，试了各种版本都无效，这块耗时太久了如果采用Java方案则先不作处理。调用过于频繁可能会导致被反爬抓到。也可以尝试自定义实现该方法。

1.同样根据[【python】linux下使用selenium（环境部署）](https://www.jianshu.com/p/cbc01d32c7b0)安装 `Chrom`和 `ChromDriver`。
2.Java方式需要代码中 `System.setProperty("webdriver.chrome.driver", "url.../chromedriver.exe")`指定虚拟器路径。
3.编辑各种Chrome设置
4.获取到cookie列表并拼接成最终格式

[基于Python的反反爬](https://blog.csdn.net/pythonauto/article/details/104744743)
[ChromeDriver自定义实现executeCdpCommand](https://blog.csdn.net/yang_wen_wu/article/details/106292896)

```maven

<dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-java</artifactId>
    <version>4.0.0-alpha-5</version>
</dependency>
```

```java
@Service
public class CookieServiceImpl implements CookieService {

@Override
    public String getCookie(String realUrl) {
        // 指定虚拟器
        System.setProperty("webdriver.chrome.driver", "F://chromedriver/chromedriver.exe");

ChromeOptions options = new ChromeOptions();
        // 开发者模式
        options.setExperimentalOption("excludeSwitches", Collections.singletonList("enable-automation"));
        // 开始最大化
        options.addArguments("--start-maximized");
        // 隐藏滚动条, 应对一些特殊页面
        options.addArguments("--hide-scrollbars");
        // 取消沙盒模式，解决DevToolsActivePort文件不存在的报错
        options.addArguments("--no-sandbox");
        // 忽略证书错误
        options.addArguments("--ignore-certificate-errors");
        // 禁止默认浏览器检查
        options.addArguments("no-default-browser-check");
        // 禁用扩展
        options.addArguments("disable-extensions");
        // 禁用弹出拦截
        // options.addArguments("--disable-popup-blocking");
        // 禁用在linux中的浏览器共享资源（缓存之类的）
        options.addArguments("--disable-dev-shm-usage");
        // 无头模式，防止linux不支持可视化时报错
        options.addArguments("--headless");
        // 无图模式
        options.addArguments("blink-settings=imagesEnabled=false");
        // 谷歌文档提到需要加上这个属性来规避bug
        options.addArguments("--disable-gpu");
        // 设置分辨率
        options.addArguments("window-size=1920x1080");
        ChromeDriver driver = new ChromeDriver(options);

/**
         * TODO:
         * 目前 org.seleniumhq.selenium selenium-java 移除了executeCdpCommand()
         * 而在 org.openqa.selenium.chromium 仍存在该方法（太小众了）
         * 过于频繁时可能会被反爬抓到
         */
        /*
        Map<String,Object> command = Maps.newHashMap();
        command.put("source","Object.defineProperty(navigator, 'webdriver', {get: () => undefined})");
        driver.executeCdpCommand("Page.addScriptToEvaluateOnNewDocument",command);
        */

driver.get(realUrl);
        try {
            Thread.sleep(1000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

return driver.manage().getCookies()
                .stream()
                .map( cookie -> cookie.getName() + "=" + cookie.getValue())
                .collect(Collectors.joining(";"));
    }
}
```

Last modification：August 22, 2022

喵ฅฅ

3 comments

唐宋丶
October 29th, 2021 at 03:27 pm

查找以删除的文章会解析失败

Reply
唐宋丶
October 27th, 2021 at 02:06 pm

请求头更新：
Python：
agent = browser.execute_script("return navigator.userAgent")
Java：
httpGet.setHeader("user-agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/94.0.4606.54 Safari/537.36");

Reply
唐宋丶
October 8th, 2021 at 05:12 pm

国庆后小红书做了浏览器的cookie拆分，需要把user-agent更新为"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"

Reply

[HttpClient+Jsoup+Selenium]小红书爬取去水印图片

唐宋丶 • 2021 年 09 月 28 日

<div class="tip share">请注意，本文编写于 1365 天前，最后修改于 1038 天前，其中某些信息可能已经过时。</div>

2022年08月22日14:14:13，前段时间小红书修改了下发形式，原图不再下发，从下发的资料中找不到规律，后续有机会尝试通过JS逆向获取。

## 前言

## 方案

### **`ImageController`**

解析小红书APP**“复制链接”**后的文案（去文字+emo表情），拿到短链接。

@Autowired
    private ImageService imageService;

### **`ImageServiceImpl`**

```java
@Service
@Log4j2
public class ImageServiceImpl implements ImageService {

HttpContext httpContext = new BasicHttpContext();
        CloseableHttpClient httpClient = HttpClients.createDefault();

@Override
    public void updateCookie(String cookie) {
        this.cookie = cookie;
    }

pop.waitFor();
        } catch (IOException | InterruptedException e) {
            e.printStackTrace();
        }
        return success.toString();
    }
}
```

[Jsoup基础用法](https://www.cnblogs.com/zhangyinhua/p/8037599.html)
[Jsoup获取Js中的变量值](https://blog.csdn.net/weinichendian/article/details/51490503)

### **`xhs-cookkie.py`**

```python
from time import sleep
from selenium import webdriver
import sys

ch_options = webdriver.ChromeOptions()

ch_options.add_argument('window-size=1920x1080')  # 指定浏览器分辨率

### **`CookieServiceImpl`**

[基于Python的反反爬](https://blog.csdn.net/pythonauto/article/details/104744743)
[ChromeDriver自定义实现executeCdpCommand](https://blog.csdn.net/yang_wen_wu/article/details/106292896)

```java
@Service
public class CookieServiceImpl implements CookieService {

@Override
    public String getCookie(String realUrl) {
        // 指定虚拟器
        System.setProperty("webdriver.chrome.driver", "F://chromedriver/chromedriver.exe");

driver.get(realUrl);
        try {
            Thread.sleep(1000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

3 comments

Leave a Comment Cancel reply

[HttpClient+Jsoup+Selenium]小红书爬取去水印图片