Skip to main content.
home | support | download

Back to List Archive

Re: Proposed changes to pp2html.pm and XLtoHTML.pm

From: Nick <newsgroups(at)not-real.2thebatcave.com>
Date: Tue May 10 2005 - 22:00:11 GMT
Should the (.+) be a (.*)?  What if you had a file in the root dir like so:

/my document.doc

That would be rare but possible, and in this case I don't think it would
strip the /.

What do you think about the possiblity of a '/' or a '\' in the name of
the file?  That seems highly unlikely (I don't even think windows will let
you do it), but as far as I know it is possible under *nix.

>
>
> Nick scribbled on 5/10/05 4:43 PM:
>> I was thinking that, but I didn't know how to do it right.  I'm not that
>> familiar with the perl regex, what is this doing to split it?  My
>> concern
>> was that the filename might contain a '/' or a '\' char in it and I
>> didn't
>> know how to reliably split it.
>
> I see in a test that my original doesn't work for long path names.
>
> I don't have a windows box to test it on, so I don't know which path
> separator
> the filter uses under Windows: '\' or '/'.
>
> But this should catch either, I would think:
>
>    $content =~ s,<title>(.+)[\\/]([^<]+)</title>,<title>$2</title>,i;
>
> that says, in english:
>
> match '<title>' followed by one or more characters, till you find a / or a
> \
> (escaping the \ since it is a special char), followed by one or more 'not
> <'
> characters, followed by '</title>'
>
> the .+ is greedy, so it should match multiple instances of .+[\\/] till it
> hits
> the end of the path name.
>
> try it out and see if it works for you. If it does, I'll make the change
> and
> check it in.
>
>
>
>
>>
>>>how about retaining at least the file name without the leading path?
>>>
>>>     my $content = $self->run_ppthtml( $doc->fetch_filename ) || return;
>>>+   $content =~ s,<title>(.+?)/([^<]+)</title>,<title>$2</title>,i;
>>>
>>>
>>>
>>>
>>>Nick scribbled on 5/10/05 9:12 AM:
>>>
>>>>These two modules create titles inconsistent with the other ones.  This
>>>>is
>>>>due to the filtering programs using the full path as the title.
>>>>
>>>>Obviously it would be best to have a "real" document title, but if we
>>>>can't have that I think that it would be better to use only the name of
>>>>the file itself, not the full path.  This way it would be consistent
>>>>between all the modules.
>>>>
>>>>I see this comment in pp2html.pm so I don't think I'm too off base
>>>> here:
>>>>
>>>>Currently produces document titles like /tmp/foo1234.  Need to alter
>>>>to pass actual document title.
>>>>
>>>>
>>>>Below are diffs for both modules.  I realize that this isn't best (it
>>>>would be nice to have a "real" title), but I think it is better than it
>>>>was before.
>>>>
>>>>
>>>>--- XLtoHTML.pm 2004-10-02 18:09:14.000000000 -0500
>>>>+++ XLtoHTML.pm.patched 2005-05-10 09:08:18.000000000 -0500
>>>>@@ -37,6 +37,9 @@
>>>>     # update the document's content type
>>>>     $doc->set_content_type( 'text/html' );
>>>>
>>>>+    # remove the full path in the title
>>>>+    $content_ref =~ s/<title>.*<\/title>/<title><\/title>/i;
>>>>+
>>>>     # If filtered must return either a reference to the doc or a
>>>>pathname.
>>>>     return \$content_ref;
>>>>
>>>>
>>>>--- pp2html.pm  2005-03-23 23:55:06.000000000 -0600
>>>>+++ pp2html.pm.patched  2005-05-10 09:08:11.000000000 -0500
>>>>@@ -15,6 +15,10 @@
>>>>    my $content = $self->run_ppthtml( $doc->fetch_filename ) || return;
>>>>    # update the document's content type
>>>>    $doc->set_content_type( 'text/html' );
>>>>+
>>>>+   # remove the full path in the title
>>>>+   $content =~ s/<title>.*<\/title>/<title><\/title>/i;
>>>>+
>>>>    return \$content;
>>>> }
>>>>
>>>
>>>--
>>>Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
>>>
>
> --
> Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
>
Received on Tue May 10 15:00:11 2005