The simple script below returns a bunch of rubbish. It works for most websites, but not william hill:
var Browser = require("zombie"); var assert = require("assert"); // Load the page from localhost browser = new Browser() browser.visit("http://sports.williamhill.com/bet/en-gb/betting/y/5/et/Football.html", function () { browser.wait(function(){ console.log(browser.html()); }); });
run with node
output:
S锟斤拷锟斤拷J锟斤拷锟斤拷戟�RU�锟�kf锟�6锟斤拷锟�Efr2锟�Riz锟斤拷锟斤拷锟�^锟斤拷0锟�X锟� 锟斤拷{锟�^锟�a锟�yp锟斤拷p锟斤拷锟斤拷锟轿�锟斤拷`锟斤拷(锟斤拷锟�S]-锟斤拷'N锟�8q锟斤拷锟斤拷锟�/锟斤拷锟�?锟捷伙拷锟�u;锟捷�锟阶�锟�Ei俨>锟斤拷-锟斤拷锟�3锟桔�G锟�Ee锟�,锟斤拷mF锟斤拷锟�MI锟斤拷Q锟桔诧拷锟斤拷锟斤拷锟节�锟�ZG锟斤拷O锟�J锟�^S锟�C~g锟斤拷JO锟界饭锟�O�锟斤拷锟�P锟斤拷锟斤拷ET锟�n;v锟斤拷锟斤拷锟斤拷v锟斤拷锟�D锟�tvJn锟斤拷J锟�8'锟斤拷��r锟�v:锟斤拷m锟斤拷J锟斤拷Z锟�nh锟�]锟斤拷 锟斤拷锟斤拷Z锟斤拷锟斤拷.{Z锟斤拷硬l锟�B'锟�.露D锟�~$n锟�/锟斤拷u"锟�z锟斤拷锟斤拷锟�Ni锟斤拷"�锟斤拷\00_I\00\锟斤拷S锟斤拷O锟�E8{"锟�m;锟�h锟斤拷,o锟斤拷Q锟�y锟斤拷;锟斤拷a[锟斤拷锟斤拷锟斤拷c锟斤拷q锟�D锟诫��?锟斤拷/|?:锟�;锟斤拷Z!}锟斤拷/锟�w�锟�h锟�<锟斤拷锟斤拷锟斤拷锟�%锟斤拷锟斤拷锟斤拷A锟�K=-a锟斤拷~'
(actual output is much longer)
Anyone know why this happens, and specifically why it happens on the only site i actually want to scrape???
Thanks
Problem courtesy of: Hugh M Halford-Thompson
Solution
I have abandoned this method long ago, but in case anyone is interested I got a reply from one of the zombie.js devs.
https://github.com/assaf/zombie/issues/251#issuecomment-5969175
He says: "Zombie will now send accept-encoding header to indicate it does not support gzip."
Thank you all who looked into this.
Solution courtesy of: Hugh M Halford-Thompson
注:本文内容来自互联网,旨在为开发者提供分享、交流的平台。如有涉及文章版权等事宜,请你联系站长进行处理。