How to import lots of static content?

For questions and problems with the CMS core. This board is NOT for any 3rd party modules, addons, PHP scripts or anything NOT distributed with the CMS made simple package itself.
Locked
jelle

How to import lots of static content?

Post by jelle »

Hi Forum,

The site I am trying to convert is fairly big: ~4000 html pages in all (and ~300 pdf's).
That's no fun to convert by hand, so I'll have to find another way to at least automate part of it. Is there a standard way how I can do this, or maybe take some shortcuts?

It's all on a new 0.11.2 install, running on Debian Linux. 

Any pointers how to accomplish this from within php are welcome too..
roman
Forum Members
Forum Members
Posts: 77
Joined: Thu May 12, 2005 9:38 am
Location: slovakia

Re: How to import lots of static content?

Post by roman »

For this time, you can only look at table structure of pages, and make your own direct migration script to import data into CMSMS
nils73
Power Poster
Power Poster
Posts: 520
Joined: Wed Sep 08, 2004 3:32 pm

Re: How to import lots of static content?

Post by nils73 »

jelle wrote: The site I am trying to convert is fairly big: ~4000 html pages in all (and ~300 pdf's).
That's no fun to convert by hand, so I'll have to find another way to at least automate part of it. Is there a standard way how I can do this, or maybe take some shortcuts?
jelle, this is a fair amount of content ... and I assume that it is formatted content, right? How much of the "old" HTML can you re-use? Do you have a new design? Is it probably "worst case", i.e. old website with tables and new website without?

The reason why I ask is that I might have a solution for you ... but before I recommend just anything I would like to know what kind of tools match your request best.

Regards,
Nils
jelle

Re: How to import lots of static content?

Post by jelle »

That is correct, it is from a frame-using moderate table-tagsoup site. It is HTML so it is machine parsable. There is some navigation cruft that needs to be trimmed, and maybe the tables ripped out (not per-se) but I don't think it is too hard to script. The biggest problem as I see it now is getting the structure right.

What did you have in mind?
Last edited by jelle on Mon Feb 06, 2006 11:48 pm, edited 1 time in total.
nils73
Power Poster
Power Poster
Posts: 520
Joined: Wed Sep 08, 2004 3:32 pm

Re: How to import lots of static content?

Post by nils73 »

Okay, that sounds reasonable. I would try to run some HTML test-files through "HTML Stripper" which is a very lightweight but yet powerful tool where you can specify what kind of tags should be filtered. Default language is German, but you can switch to English. To run the program in batch mode, you will need to configure the program first (important!!!) and then drag and drop all files onto the right pane of the program-window. Then you get more or less what you want. Then all you need to do is to write all files into one (not too difficult) or create a sql-statement.

This is probably the easiest way which works pretty well with simple static websites and little tag-soup. If things are more complex you might like to try some more advanced tools (not free anymore) or write your own regexp-php-code ... which might be the most time-consuming task.

Regards,
Nils
jelle

Re: How to import lots of static content?

Post by jelle »

Alas, I was hoping for some php script that imports pages directly into cmsms, but I fear I'll have to write that one myself... I don't have any windows machines to run it on, any suffering german And wine may be a little too much asked.

But in this remark:
Then all you need to do is to write all files into one (not too difficult) or create a sql-statement.
That last statement sounds interesting. Are you suggesting to import the stuff directly from sql?

My vague idea was this: somehow find a way to map all pages into a sitemap with meaningfull titles instead of abbreviated filenames, The according to that list read a file, strip out unneeded stuff like ^>*, and all the other stuff I don't need, rewrite img urls  (need /uploads in front of them).
then instantiate a new page object, fill in the blanks and write it to the db with its own methods. (assuming it is a proper object and can be used that way)
That is probably a lot more work then it sounds, hence my question here. In the meantime I think I'll leave this till My client has decided to hire my services  ;)
User avatar
kermit
Power Poster
Power Poster
Posts: 693
Joined: Thu Jan 26, 2006 11:46 am

Re: How to import lots of static content?

Post by kermit »

there is a somewhat-related feature request submitted (to automatically create a site structure).

http://dev.cmsmadesimple.org/tracker/in ... 6&atid=104

i'm not familiar with the typo3 feature the above refers to, but i did come up with a couple of ideas...

unfortunately i am not a programmer, but perhaps this little brainstorm will give someone a place to start:

first idea:

a module to import a zip file with pre-arranged content.

filenames of zip contents would include the menu tree position and page alias. optional special tags at the beginning of each file could specify things like the page's template, title, menu text, etc. (or other page settings), and it would use pre-defined defaults if any are not present.

you'd have to strip down files to import to just the 'content' part of the pages (or create them new), (re)name them accordingly, and add extra tags as described if desired..

zip contents could look something like this:

1.home
1.1.about
1.2.contact
2.services
2.1.hosting
2.2.design
3.sales
3.1.desktops
3.2.laptops
3.3.servers
3.4.network
3.5.components
3.5.1.video
3.5.2.audio
3.5.2.storage
3.5.2.1.flash
3.5.2.2.harddrives
3.5.2.3.raid
3.5.2.4.nas
4.support
4.1.faq
5.downloads

etc.  (with or without an htm or html extension. having those on the files would make them easily browsable on a local system before creating the zip file for import).

along with an included images directory
/uploads/images/
containing any images linked in the pages to be imported.

i can also see where a 'replace all' (with suitable confirmation) or 'append to end of tree' (where imported pages are added after the last existing root menu item instead of as defined in the first integer of the filename) options would come in handy.

it could also be written to handle special tags that denote where content blocks start and end, so that it could support multiple blocks on a single page if they're present.


second idea:

module to import a single html file (between the body tags), where site structure is defined by heading tags, each is a page or child page. every is a root page, underneath an and before the next is a child of that root page, underneath an starts a child of the , while an or directly under an (and before the next or ) would just be headings and content on that page..

to illustrate:

Code: Select all

</__body>

<h1>Home</h1>
	<p>This is the home page</p>
	
<h1>Another Root Page</h1>
	<!-- {page_alias="second"} -->
	<p>This is another page on at the root level</p>

	<h2>Child page</h2>
		<p>This is a child of "Another Root Page"</p>

<h1>Third Root</h1>
	<!-- {page_alias="third"} -->
	<!-- {page_template="Alternate 2 column layout"} -->
	<p>This is the third root menu item.</p>

	<!-- {content1} -->
	<p>This is a second content block on the Third Root page</p>
	<!-- {/content1} -->

	<h3>This is just a page heading</h3>
	<p>Since there isn't an H2 preceeding this, and after the last H1, it's just a heading and content on 
	the page "Third Root"</p>

	<h2>First child of "Third Root"</h2>
		<p>This is a child page of "Third Root"</p>
<h1>This is an external Link</h1>
	<!-- {external_link="http://www.cmsmadesimple.org"} -->

<h1>Fourth Root</h1>
	<!-- {page_template="Alternate 1 column layout"} -->
	<p>The next root page</p>

<h1>This loads some specified module</h1>
	<!-- {cms_module="modulename" parameter1="something" parameter2="something else"} -->

<__body>
reasonable defaults would be used if special tags aren't present, such as auto-generating page aliases, the site default template, etc.  additional tags could be used for other specific page settings (such as menu name, keywords & other meta data, etc).

this second method would enable someone to create/edit an entire site's content from a single html file using their own favorite editor on their own system. again, a 'replace all' option (with appropriate warnings & confirmation) would be a useful feature.


either of these would satisfy the request for an easy way to create an entire site structure in one go, and also to facilitate the import of existing (or newly created) static content. they could also be used as a way to update an entire existing site at once, or to create editable (and re/importable) exports/backups of an entire site's pages.
eternity (n); 1. infinite time, 2. a seemingly long or endless time, 3. the length of time it takes a frozen pizza to cook when you're starving.
4,930,000,000 (n); 1. a very large number, 2. the approximate world population in 1986 when Microsoft Corp issued its IPO. 3. Microsoft's net profit (USD) for the quarter (3 months) ending 31 March 2007.
CMSMS migration and setup services | Hosting with CMSMS installed and ready to go | PM me for Info
jelle

Re: How to import lots of static content?

Post by jelle »

Before you can import on a big scale, you'd have to work out what to do with the code. Replacing old href's with their new ones, naming of the pages and figuring out an actual hierachy are other issues I can think of.  Especially the name of a page can be very ambiguous. The scheme that Kermit proposes assumes readable filenames, which may not be the case. a suitable title may also be found in the of the page, in the biggest in the page or it the link leading to the page.

Just to elaborate on Kermits proposal:

1) A zipfile is nice, but if wget is installed only the old url would be needed. In the end getting a temporary structure on the filesystem somewhere is the immediate goal.
2) Parse the tree to extract links, titles and filenames and fill a table with them.
3) present the table to the user so he can select what title is used and where non-html files should be placed.
4) check the user corrected table for inconsistencies and let the user solve them
5) Let the user decide where the content begins, and what tags should be removed, modified or left alone.
6)start the actual importing process but remember what pages were added so that it may be possible to undo this step if neccesary.
7) allow the user to look smug.

I would like to help develop such a module, but I don't think doing so all by myself has much chance of completion...
User avatar
kermit
Power Poster
Power Poster
Posts: 693
Joined: Thu Jan 26, 2006 11:46 am

Re: How to import lots of static content?

Post by kermit »

my ideas don't make any assumptions on what's actually in the content.

because every case would be different, and quite different at that, AND validation/reformatting of imported content would infinately complicate an import module; i'd leave it up to whoever's prepping content for import to ensure that it's format is consistant with what is expected by the module. at least until such time that someone does have the time and energy to come up with a robust content conversion & importing routine.

inconsistancies from page to page are bound to be present as well, especially in static, non-template driven sites. be hard to parse the code to separate various content areas if sometimes they are not always wrapped around the same (and unique on the page) tags, text or comments.

even if you had to load each individual page in notepad++ or gedit, make a few quick edits and re-save. it's still faster than generating entirely new site structure and copy/pasting content into the admin interface for every page. search/replace, tedious and boring yes, but no more so than doing everything manually. shell scripts or editor macros could be used to expedite the reformatting of the data to import. a large, auto-imported, site is still going to have to be manually checked and edited anyway to ensure everything is in it's proper place and linked properly.

so, i think that the first step in creating an import module would be to start with importing a consistant and documented format, similar to one of the suggestions i offered. something thats both easy to create and contains enough special 'tags' to set various page settings automatically. should be based on standard (x)html using comments for tags and instructions for the import routines. once that's working, it can be expanded-upon to perhaps be able to handle alternate import formats or user-specified rules for what goes where and how.


for handling existing links that won't remain after import: mod_rewrite rules to translate old url's to new; or an expanded 'friendly' url feature in cmsms core so you can specify a page's exact url (there is a similar free addon for mambo that allows that. basically it auto-creates the mod_rewrite rules for you).


i'm more interested in creating content to import from scratch than importing a 5,000 page static site (i'm curious as to how well cmsms scales to sites that large though). having a site's content in separate and editable file(s) facilitates the content editing and approval process when you're dealing with someone else's site (aka. clients). the largest project i'm working on securing now is ~20 static pages and 3 addons (events calendar, business directory, 1-2 newsletters). not that tough to either copy/paste into an admin area and start from scratch, or to modify in order to match an import routine's required format. in this case, i'm more concerned about having suitable code to implement the site's extra features than getting the content itself in a back-end interface.
eternity (n); 1. infinite time, 2. a seemingly long or endless time, 3. the length of time it takes a frozen pizza to cook when you're starving.
4,930,000,000 (n); 1. a very large number, 2. the approximate world population in 1986 when Microsoft Corp issued its IPO. 3. Microsoft's net profit (USD) for the quarter (3 months) ending 31 March 2007.
CMSMS migration and setup services | Hosting with CMSMS installed and ready to go | PM me for Info
jelle

Re: How to import lots of static content?

Post by jelle »

True, 5000 pages is quite a lot, and too much to do with copy/pasting. But that is where an import module would really be usefull, In your 20+ pages case it would only be  a convenience. If you can manually prepare each page by hand, there is little reason why you can't manually cut&past each page by hand into cmsms directly.
My 4000 pages case actually is all static .htm files, created over 6+ years time by a diligent editor (They are an political thinktank so produce lots of texts). So you can be sure that there are inconsistencies in the html. Html-tidy may improve that a bit (had lots of cases  of improper nesting with nchors).

The mod_rewrite idea is a good one, but maybe there is a better way later on to implement it. Basically it does a regular expression replacement on the incoming urls. What if you could apply a regular expression /search & replace to a selection of pages? You'd have a lot of opportunities to wreck your site, but also to change links typos and other stuff in one go. 

I totally agree with you on starting small with a limited feature set. But I don't think aiming for trivial sizes only is the right way. After all, when it works correctly it is/should be  no more effort to import large than it is to import a small site. 

Anyway, this might not be the best place to plan such a module, i'll go and find a more appropriate forum.
Locked

Return to “CMSMS Core”