Loading... <div class="tip share">请注意,本文编写于 1319 天前,最后修改于 991 天前,其中某些信息可能已经过时。</div> 2022年08月22日14:14:13,前段时间小红书修改了下发形式,原图不再下发,从下发的资料中找不到规律,后续有机会尝试通过JS逆向获取。 ## 前言 小红书中的图片保存就会自带水印,用浏览器打开的话展示的也是加了水印的图,导致不能很好的收藏小姐姐(不是)。 加上各种小程序都是抠抠搜搜只给你免费用个三两次,就打算自己搞一个。 <!--more--> ## 方案 小红书加水印的方法是在用户上传图片时,上传一份原图+一份上水印的图到云床上。返回的时候默认展示的是带水印的照片。 万幸的是这两份照片的msgId都能在 `window.__INITIAL_SSR_STATE__`中爬到,因此只需要拿到原图的msgId来一招狸猫换太子就可以直接拿到小红书的照片全路径了。 第一版中,由于写死的 `cookie`的有效期为7天,且cookie字段存在于 `Request Header`中没办法直接通过Jsoup拿到。所以通过 `selenium`去模拟Chrome浏览器访问来获取cookie。 当cookie过期时会根据地址自动获取到最近的cookie并更新。(因为是7天更新一次cookie所以没有解决因为频繁获取cookie而引发的数美滑块验证,一次性爬太多也同理,具体阈值还没有测出来) **建议使用 `Pyhton + Selenium`不要用Java!坑实在是太多了(下文讲)!** ### **`ImageController`** 解析小红书APP**“复制链接”**后的文案(去文字+emo表情),拿到短链接。 ```java /** * 描述:小红书图片去水印解析 * * @author CaiTianXin * @date 2021/9/9 15:55 */ @Controller @RequestMapping(value = "/api", params = "service=image") public class ImageController { @Autowired private ImageService imageService; /** * 根据url对其中图片进行去水印解析 * * @param url 分享链接(完整) * 44奈奈无鱼发布了一篇小红书笔记,快来看吧![laughing]WsaHfdrcQ6mC8Ir[laughing]http://xhslink.com/jbhS5d,复制本条信息,打开【小红书】App查看精彩内容! * @return {@link String} */ @RequestMapping(params = "method=getImageList") public String getImageList(@RequestParam("url") String url, Model model) { String shareUrl; List<String> imageList; if(StrUtil.startWith(url, "http")) { // 具体链接 shareUrl = url; } else { try { // 电脑端分享地址 shareUrl = StrUtil.removeSuffix(url.split(" ")[2], ",复制本条信息,打开【小红书】App查看精彩内容!"); } catch (Exception e) { // emoji表情处理 try { // 安卓分享地址 shareUrl = StrUtil.removeSuffix(url.split("\\[laughing\\]")[2], ",复制本条信息,打开【小红书】App查看精彩内容!"); } catch (Exception ex) { // 苹果分享地址 shareUrl = StrUtil.removeSuffix(url.split("\uD83D\uDE06 ")[2], ",复制本条信息,打开【小红书】App查看精彩内容!"); } } } imageList = imageService.getImageList(shareUrl); if (imageList.isEmpty()) { return "update"; } model.addAttribute("imageList", imageList); return "list"; } } ``` ### **`ImageServiceImpl`** 1.根据短连接重定向出具体的realUrl链接。 2.根据realUrl,拼接包括cookie在内的头参,通过HttpClient获取到数据。 3.再根据jsoup将数据转为Document形式解析 `<script></script>`标签,拼接无水印的图片地址。 4.如果cookie过期则通过系统命令行执行python脚本,通过流方式获取到cookie,再更新。 ```java @Service @Log4j2 public class ImageServiceImpl implements ImageService { private String cookie = "xhsTracker=url=noteDetail&xhsshare=CopyLink; xhsTrackerId=096eb980-05dd-4302-c333-b7b47527ba31; extra_exp_ids=gif_exp1,ques_clt2; timestamp2=202109151bf286650fd065d7ba66395f; timestamp2.sig=6xpJcjGtCWzhM-jZ0lCZTNIv7ndcYzoC-7n0YdCvYwM"; /** * 根据url对其中图片进行去水印解析 * * @param shareUrl 分享url/网页url * @return {@link List<String>} */ @Override public List<String> getImageList(String shareUrl) { String realUrl = this.getRealUrl(shareUrl); List<String> result = new ArrayList<>(16); HttpContext httpContext = new BasicHttpContext(); CloseableHttpClient httpClient = HttpClients.createDefault(); HttpGet httpGet = new HttpGet(realUrl); httpGet.setHeader("accept", " text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"); httpGet.setHeader("accept-encoding", " gzip, deflate, br"); httpGet.setHeader("accept-language", " zh-CN,zh;q=0.9"); httpGet.setHeader("cookie", cookie); httpGet.setHeader("upgrade-insecure-requests", "1"); httpGet.setHeader("user-agent", " Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3876.400 QQBrowser/10.8.4503.400"); HttpEntity entity; CloseableHttpResponse response = null; try { // 发送http请求 response = httpClient.execute(httpGet, httpContext); if (response.getStatusLine().getStatusCode() == 200) { // 获取响应数据 entity = response.getEntity(); // 将响应数据转Html源码 String html = EntityUtils.toString(entity, "UTF-8"); Document doc = Jsoup.parse(html); // 过标签解析 Element script = doc.getElementsByTag("script").last(); String value = StrUtil.removePrefix(script.html(), "window.__INITIAL_SSR_STATE__="); Map<String, Object> map = JSONUtil.toBean(value, Map.class, true); Map<String, Object> noteView = (Map<String, Object>) map.get("NoteView"); Map<String, Object> noteInfo = (Map<String, Object>) noteView.get("noteInfo"); List<Map<String, String>> imageList = (List<Map<String, String>>) noteInfo.get("imageList"); imageList.forEach(item -> result.add("http://ci.xiaohongshu.com/" + item.get("traceId"))); } } catch (Exception e) { this.updateCookie(this.getCookie(realUrl)); log.info("cookie过期,自动更新。"); } finally { try { if (response != null) { response.close(); } if (httpClient != null) { httpClient.close(); } } catch (IOException e) { e.printStackTrace(); } } return result; } /** * 获取重定向后的真实地址 * * @param shareUrl 分享链接的网址 * @return {@link String} */ private String getRealUrl(String shareUrl) { CloseableHttpClient httpClient = HttpClients.createDefault(); HttpGet httpGet = new HttpGet(shareUrl); HttpContext httpContext = new BasicHttpContext(); CloseableHttpResponse response = null; RedirectLocations redirectLocations = null; try { response = httpClient.execute(httpGet, httpContext); //获取实际的请求对象的URI redirectLocations = (RedirectLocations) httpContext.getAttribute(HttpClientContext.REDIRECT_LOCATIONS); } catch (Exception e) { e.printStackTrace(); } finally { try { if (response != null) { response.close(); } if (httpClient != null) { httpClient.close(); } } catch (IOException e) { e.printStackTrace(); } } return redirectLocations.get(0).toString(); } @Override public void updateCookie(String cookie) { this.cookie = cookie; } private String getCookie(String realUrl) { if (StrUtil.isBlank(realUrl)) { realUrl = "https://www.xiaohongshu.com/discovery/item/6151406b00000000210395ce"; } String execString = "python /xxx/xhs-cookie.py " + realUrl; StringBuilder success = new StringBuilder(); StringBuilder error = new StringBuilder(); try { Process pop = Runtime.getRuntime() .exec(execString); // 获取其正常的输出流 InputStream inputStream = pop.getInputStream(); InputStreamReader inputStreamReader = new InputStreamReader(inputStream); BufferedReader br = new BufferedReader(inputStreamReader); String line; while ((line = br.readLine()) != null) { success.append(line); } // 获取其错误的输出流 InputStream errorStream = pop.getErrorStream(); InputStreamReader errorStreamReader = new InputStreamReader(errorStream); BufferedReader errorBr = new BufferedReader(errorStreamReader); String errorLine; while ((errorLine = errorBr.readLine()) != null) { error.append(errorLine); } if (StrUtil.isNotBlank(error.toString())) { log.error(error.toString()); } pop.waitFor(); } catch (IOException | InterruptedException e) { e.printStackTrace(); } return success.toString(); } } ``` [Jsoup基础用法](https://www.cnblogs.com/zhangyinhua/p/8037599.html) [Jsoup获取Js中的变量值](https://blog.csdn.net/weinichendian/article/details/51490503) ### **`xhs-cookkie.py`** 1.安装Python环境,根据[【python】linux下使用selenium(环境部署)](https://www.jianshu.com/p/cbc01d32c7b0)安装 `Chrom`和 `ChromDriver`。 2.安装脚本环境:`pip install selenium` (参数具体解释放Java实现中) ```python from time import sleep from selenium import webdriver import sys ch_options = webdriver.ChromeOptions() ch_options.add_argument('--no-sandbox') ch_options.add_argument('--disable-dev-shm-usage') ch_options.add_argument('--headless') ch_options.add_argument('blink-settings=imagesEnabled=false') ch_options.add_argument('--disable-gpu') ch_options.add_argument('window-size=1920x1080') # 指定浏览器分辨率 # 设置为开发者模式 ch_options.add_experimental_option('excludeSwitches', ['enable-automation']) ch_options.add_experimental_option('useAutomationExtension', False) browser = webdriver.Chrome(options=ch_options) browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', { 'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})' }) browser.get(sys.argv[1]) sleep(1) cookies = browser.get_cookies() res = [] for cookie in cookies: res.append("%s=%s" % (cookie["name"], cookie["value"])) print(";".join(res)) browser.close() ``` ### **`CookieServiceImpl`** 目前 `org.seleniumhq.selenium selenium-java` 移除了 `executeCdpCommand()`,试了各种版本都无效,这块耗时太久了如果采用Java方案则先不作处理。调用过于频繁可能会导致被反爬抓到。也可以尝试自定义实现该方法。 1.同样根据[【python】linux下使用selenium(环境部署)](https://www.jianshu.com/p/cbc01d32c7b0)安装 `Chrom`和 `ChromDriver`。 2.Java方式需要代码中 `System.setProperty("webdriver.chrome.driver", "url.../chromedriver.exe")`指定虚拟器路径。 3.编辑各种Chrome设置 4.获取到cookie列表并拼接成最终格式 [基于Python的反反爬](https://blog.csdn.net/pythonauto/article/details/104744743) [ChromeDriver自定义实现executeCdpCommand](https://blog.csdn.net/yang_wen_wu/article/details/106292896) ```maven <!-- https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-java --> <dependency> <groupId>org.seleniumhq.selenium</groupId> <artifactId>selenium-java</artifactId> <version>4.0.0-alpha-5</version> </dependency> ``` ```java @Service public class CookieServiceImpl implements CookieService { @Override public String getCookie(String realUrl) { // 指定虚拟器 System.setProperty("webdriver.chrome.driver", "F://chromedriver/chromedriver.exe"); ChromeOptions options = new ChromeOptions(); // 开发者模式 options.setExperimentalOption("excludeSwitches", Collections.singletonList("enable-automation")); // 开始最大化 options.addArguments("--start-maximized"); // 隐藏滚动条, 应对一些特殊页面 options.addArguments("--hide-scrollbars"); // 取消沙盒模式,解决DevToolsActivePort文件不存在的报错 options.addArguments("--no-sandbox"); // 忽略证书错误 options.addArguments("--ignore-certificate-errors"); // 禁止默认浏览器检查 options.addArguments("no-default-browser-check"); // 禁用扩展 options.addArguments("disable-extensions"); // 禁用弹出拦截 // options.addArguments("--disable-popup-blocking"); // 禁用在linux中的浏览器共享资源(缓存之类的) options.addArguments("--disable-dev-shm-usage"); // 无头模式,防止linux不支持可视化时报错 options.addArguments("--headless"); // 无图模式 options.addArguments("blink-settings=imagesEnabled=false"); // 谷歌文档提到需要加上这个属性来规避bug options.addArguments("--disable-gpu"); // 设置分辨率 options.addArguments("window-size=1920x1080"); ChromeDriver driver = new ChromeDriver(options); /** * TODO: * 目前 org.seleniumhq.selenium selenium-java 移除了executeCdpCommand() * 而在 org.openqa.selenium.chromium 仍存在该方法(太小众了) * 过于频繁时可能会被反爬抓到 */ /* Map<String,Object> command = Maps.newHashMap(); command.put("source","Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"); driver.executeCdpCommand("Page.addScriptToEvaluateOnNewDocument",command); */ driver.get(realUrl); try { Thread.sleep(1000); } catch (InterruptedException e) { e.printStackTrace(); } return driver.manage().getCookies() .stream() .map( cookie -> cookie.getName() + "=" + cookie.getValue()) .collect(Collectors.joining(";")); } } ``` Last modification:August 22, 2022 © Allow specification reprint Like 0 喵ฅฅ
3 comments
查找以删除的文章会解析失败
请求头更新:
Python:
agent = browser.execute_script("return navigator.userAgent")
Java:
httpGet.setHeader("user-agent", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/94.0.4606.54 Safari/537.36");
国庆后小红书做了浏览器的cookie拆分,需要把user-agent更新为"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"