Page 1 of 1

Indexing PDFs

Posted: Thu Aug 23, 2007 5:07 am
by davidlanier
I would like to add PDF files to the list of things I can search for on my CMSMS site.  In other words, I'd like to be able to upload a bunch of PDFs, index them into the normal search index tables, and have them show up in search results, labeled as PDFs.

I figure the hardest part would simply be to get the PDFs indexed.
And the other hard part would be to keep the index current, to reflect new PDFs and deleted PDFs.

Where can I look in the existing Search plugin for examples of how it does the indexing part?
Does a similar module already exist that I can look at as an example?
Should I look at creating a new module to implement this feature, or would you recommend a different method?
Are cron jobs implemented as a part of CMSMS anywhere? (for keeping the index up to date)

Re: Indexing PDFs

Posted: Thu Aug 23, 2007 6:03 am
by cyberman
There's no module for indexing pdf or documents. Would be nice if you could create someone or enhance Search module :).
davidlanier wrote: Are cron jobs implemented as a part of CMSMS anywhere? (for keeping the index up to date)
It's not a real cron job but there's something you can use for it. It's named EventManager (admin panel, Extensions > EventManager).

CMSms have a set of registred Events, which can be enhanced by modules. If the event is coming (e.g. editing content) you can run an action which must be defined as user defined tag (or by module). Current Search module have such an action.

Re: Indexing PDFs

Posted: Thu Aug 23, 2007 2:11 pm
by calguy1000
1) Search will not index pdf's..... some code would have to be written for this. 
2) The event manager is not a cron thing.
3) It is possible to put a cron job in that uses wget on a CMS URL.
    However, you either have to modify code to put that action on the frontend (not using any authentication)
    or figure out how to create a session with wget so that it can automatically log in to the admin section
    and trigger the action.

    I haven't looked into this yet.

Re: Indexing PDFs

Posted: Thu Aug 23, 2007 2:14 pm
by calguy1000
actually, after reading the wget manual and faq it should be possible with multiple wget commands to login to cms made simple, and then to issue admin commands.

Re: Indexing PDFs

Posted: Thu Aug 23, 2007 4:50 pm
by davidlanier
Great thoughts.  Thanks for the replies.  I don't know if I'll be able to dig that deep.  But in case I do, where is the code in the search module that actually DOES the indexing?  (so I can see a model to follow)

(this is probably more of a feature request): if there is interest in enabling a cron mechanism, a way to implement might be like this:
1- create a cron.php script
2- execute it either from command-line "/usr/bin/php cron.php" or via url, using appropriate wget coolness.
3- add some sort of api hooks, so that cron.php knows which things to execute.

Using this method avoids the problem of having to add a separate cron for each task that needs to be done. - let CMSMS handle that.

Re: Indexing PDFs

Posted: Mon Dec 07, 2009 8:06 pm
by jtcreate
Has anyone made any progress on this? I need to do the exact same thing. Thanks!

Jeff T.