Page 1 of 1

mirroring with wget

Posted: Sat Sep 22, 2007 11:13 pm
by wesyah234
I'm interested in mirroring my cmsms site using wget.  I"m planning to schedule regular wget mirrors from my main server that runs php and cmsms over to another server that just serves static content.  My plan is if the main site becomes unavailable, I can switch dns to the other static site and just serve the site statically until I resolve the issue.

I'm trying this command:

wget --mirror -w 2 -p --html-extension --convert-links -P .

and I'm having a bit of trouble with how it gets the CSS... the generated mirror works pretty well but doesn't get the CSS properly.

I'm just wondering if anyone's ever done this successfully.

Thanks!

Re: mirroring with wget

Posted: Sun Sep 23, 2007 9:36 am
by KO
CSS most likely comes from stylesheet in database created by {style} tag in page template. If you look at source of your mirror you probably see what is the link to stylesheet. How does that look? Is it accessible by mirror?

br, K

Re: mirroring with wget

Posted: Sun Sep 23, 2007 1:38 pm
by Pierre M.
Hello,
wesyah234 wrote: and I'm having a bit of trouble with how it gets the CSS... the generated mirror works pretty well but doesn't get the CSS properly.
Yes, wget is a (wonderful) network tool but not an xHTML parser.
But you seem skilled. May be you can combine wget with some parser to know what are the remaining URLs to fetch ?
And don't forget to fetch script.js to ;-)

If this is too much boring (or not enough comand line fun, your choice) have a look at httrack. But I'd rather see a solution with wget :-)

Pierre M.

Re: mirroring with wget

Posted: Sun Sep 23, 2007 2:31 pm
by tsw
because I'm lazy I just use wget -r http://domain.tld/

by default -r stays in the same domain, might not be what you want tho...

Re: mirroring with wget

Posted: Sun Sep 23, 2007 2:41 pm
by Pierre M.
Does wget -r parse the xHTML to discover and fetch the CSS too ?
Pierre

Re: mirroring with wget

Posted: Sun Sep 23, 2007 3:14 pm
by tsw
it will find all images / links / css and whatever from the start page, get them check the new html pages for images / links / css and so on and get them also (as long as they are in the same domain)

man wget will tell more ;)

Re: mirroring with wget (proposed solution)

Posted: Mon Sep 24, 2007 3:51 pm
by wesyah234
Sorry, I should have popped back into this discussion sooner.

I've done more research, and it does fetch the CSS, and save it locally, but when it's served up by the mirror, it comes across with a content-type of text/plain.  And then firefox doesn't apply the css to the page, because it's not text/css.

When I check the real page, via CMSMS, it gets the proper content type set to text/css (because stylesheet.php sets this)

(Interestingly, IE displays the mirror correctly with the CSS, presumably because it's more forgiving, and will apply css even if it has the wrong content type.)

If you'd like to see for yourself, here are the urls (use firebug, or livehttpheaders in firefox to inspect the headers)

main site via CMSMS:
http://www.yawebhost.com
stylesheet (that has correct contenttype)
http://www.yawebhost.com/stylesheet.php?templateid=12

mirror site:
http://mirror.yawebhost.com/www.yawebhost.com/
mirrored stylesheet (that has wrong contenttype of text/plain)
http://mirror.yawebhost.com/www.yawebho ... plateid=12

I haven't found a way to force wget to look at the content-type and append an appropriate extension to the file (in this case, a css extension would do the trick)

So, I propose a solution inside CMSMS... here it is:

Change the creation of the stylesheet urls so that they have an extra dummy parameter (at the end, of course) of something like filename=stylesheet.css.  This would cause wget to save the local file with a .css extension, and thus, apache, when serving up the mirror, would set the contenttype corectly.

I tested this by manually changing the mirrored stylesheet filename, and also changing the stylesheet reference in the index.html, the results were a success, and you can see them here:

http://mirror.yawebhost.com/fixedcss/

It should be relatively simple to add this dummy parameter to the codebase... what do you think?

Re: mirroring with wget

Posted: Mon Sep 24, 2007 7:29 pm
by KO
One way could be that you have a link to stylesheet hardcoded in your template header the "the old fashioned" way as file. Editing would not happen through Admin tools then. Some like it - some don't.

Re: mirroring with wget

Posted: Mon Sep 24, 2007 8:33 pm
by wesyah234
KO wrote: One way could be that you have a link to stylesheet hardcoded in your template header the "the old fashioned" way as file. Editing would not happen through Admin tools then. Some like it - some don't.
I thought of that... it's definitely a way to force the .css extension, but I much prefer altering CMSMS to generate stylesheet links with that dummy parameter.  It would help so many more people who'd like to mirror a static snapshot of their site.

I tried modifying function.stylesheet.php.

I changed this line:
                        $stylesheet .= "" />\n";

to:
                        $stylesheet .= "&dummy=f.css" />\n";

Did the wget again, and tried the resulting mirror, and it worked, the css was saved with a css extension, and thus, served by apache as text/css

Could this modification be considered in the core product?

Re: mirroring with wget

Posted: Tue Sep 25, 2007 5:41 pm
by Pierre M.
Hello,

1°)My mistake, wget is even more powerful than I had thought, because it parses xHTML to fetch needed images, CSS...
2°)I like your patch, because :
-it is simple,
-it should work as you describe, although I haven't tested,
-it beautifies CSS URLs a bit,
-it simplifies mirroring with wget (it is in fact automagic reliable static mirroring)

I hope a dev will merge it. But it may be too late for 1.2.

Pierre M.

Re: mirroring with wget

Posted: Tue Sep 25, 2007 7:39 pm
by tsw
maybe stylsheets could have their own rewrite rules just like content pages have.

that way you can use cleaner css includes (e.g. stylesheet_$templateid.css) and it will still be processed through php. (of course this needs the rewrite rule)

my 2cents

Re: mirroring with wget

Posted: Tue Sep 25, 2007 7:45 pm
by wesyah234
Not exactly sure what you mean, but here's another idea along the same lines:

the {stylesheet} tag could take an optional parameter of the "extension", like

{stylesheet extension=.css}

then the stylesheet function would use this param to add to the end of the stylesheet url link.  Similar to what I recommended above, however more flexible as one can make use of it only if they desire.  (downside is that people trying to mirror will still be frustrated until they discover this, whereas putting it in by default would save people the hassle in the first place, with no negative effect on others)

Re: mirroring with wget

Posted: Tue Sep 25, 2007 8:16 pm
by tsw
with mod_rewrite (pretty urls) page url can be changed from domain.tld/index.php?page=foo to domain.tld/foo.html same method could be applied to stylesheets also.

Re: mirroring with wget

Posted: Tue Sep 25, 2007 8:23 pm
by wesyah234
I understand... though the configuration would then need to know that you're using mod_rewrite and generate stylesheet links correctly to match the rewriting... either way, a change to the way stylesheet links are created would be required. 

Re: mirroring with wget

Posted: Tue Sep 25, 2007 9:01 pm
by tsw
code change is required anyway.

we already have a config flag for mod_rewrite ;)