[聚合文章] Zombie.js in node.js fails to scrape certain websites

JavaScript 2017-12-14 20 阅读

The simple script below returns a bunch of rubbish. It works for most websites, but not william hill:

var Browser = require("zombie");
var assert = require("assert");

// Load the page from localhost
browser = new Browser()
browser.visit("http://sports.williamhill.com/bet/en-gb/betting/y/5/et/Football.html", function () {
browser.wait(function(){
console.log(browser.html());
});
});

run with node

output:

S锟斤拷锟斤拷J锟斤拷锟斤拷戟�RU�锟�kf锟�6锟斤拷锟�Efr2锟�Riz锟斤拷锟斤拷锟�^锟斤拷0锟�X锟� 锟斤拷{锟�^锟�a锟�yp锟斤拷p锟斤拷锟斤拷锟轿�锟斤拷`锟斤拷(锟斤拷锟�S]-锟斤拷'N锟�8q锟斤拷锟斤拷锟�/锟斤拷锟�?锟捷伙拷锟�u;锟捷�锟阶�锟�Ei俨>锟斤拷-锟斤拷锟�3锟桔�G锟�Ee锟�,锟斤拷mF锟斤拷锟�MI锟斤拷Q锟桔诧拷锟斤拷锟斤拷锟节�锟�ZG锟斤拷O锟�J锟�^S锟�C~g锟斤拷JO锟界饭锟�O�锟斤拷锟�P锟斤拷锟斤拷ET锟�n;v锟斤拷锟斤拷锟斤拷v锟斤拷锟�D锟�tvJn锟斤拷J锟�8'锟斤拷��r锟�v:锟斤拷m锟斤拷J锟斤拷Z锟�nh锟�]锟斤拷 锟斤拷锟斤拷Z锟斤拷锟斤拷.{Z锟斤拷硬l锟�B'锟�.露D锟�~$n锟�/锟斤拷u"锟�z锟斤拷锟斤拷锟�Ni锟斤拷"�锟斤拷\00_I\00\锟斤拷S锟斤拷O锟�E8{"锟�m;锟�h锟斤拷,o锟斤拷Q锟�y锟斤拷;锟斤拷a[锟斤拷锟斤拷锟斤拷c锟斤拷q锟�D锟诫��?锟斤拷/|?:锟�;锟斤拷Z!}锟斤拷/锟�w�锟�h锟�<锟斤拷锟斤拷锟斤拷锟�%锟斤拷锟斤拷锟斤拷A锟�K=-a锟斤拷~'

(actual output is much longer)

Anyone know why this happens, and specifically why it happens on the only site i actually want to scrape???

Thanks

Problem courtesy of: Hugh M Halford-Thompson

Solution

I have abandoned this method long ago, but in case anyone is interested I got a reply from one of the zombie.js devs.

https://github.com/assaf/zombie/issues/251#issuecomment-5969175

He says: "Zombie will now send accept-encoding header to indicate it does not support gzip."

Thank you all who looked into this.

Solution courtesy of: Hugh M Halford-Thompson

注:本文内容来自互联网,旨在为开发者提供分享、交流的平台。如有涉及文章版权等事宜,请你联系站长进行处理。