Skip to main content.
home | support | download

Back to List Archive

Re: Proposed changes to pp2html.pm and XLtoHTML.pm

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Tue May 10 2005 - 21:51:45 GMT
Nick scribbled on 5/10/05 4:43 PM:
> I was thinking that, but I didn't know how to do it right.  I'm not that
> familiar with the perl regex, what is this doing to split it?  My concern
> was that the filename might contain a '/' or a '\' char in it and I didn't
> know how to reliably split it.

I see in a test that my original doesn't work for long path names.

I don't have a windows box to test it on, so I don't know which path separator 
the filter uses under Windows: '\' or '/'.

But this should catch either, I would think:

   $content =~ s,<title>(.+)[\\/]([^<]+)</title>,<title>$2</title>,i;

that says, in english:

match '<title>' followed by one or more characters, till you find a / or a \ 
(escaping the \ since it is a special char), followed by one or more 'not <' 
characters, followed by '</title>'

the .+ is greedy, so it should match multiple instances of .+[\\/] till it hits 
the end of the path name.

try it out and see if it works for you. If it does, I'll make the change and 
check it in.




> 
>>how about retaining at least the file name without the leading path?
>>
>>     my $content = $self->run_ppthtml( $doc->fetch_filename ) || return;
>>+   $content =~ s,<title>(.+?)/([^<]+)</title>,<title>$2</title>,i;
>>
>>
>>
>>
>>Nick scribbled on 5/10/05 9:12 AM:
>>
>>>These two modules create titles inconsistent with the other ones.  This
>>>is
>>>due to the filtering programs using the full path as the title.
>>>
>>>Obviously it would be best to have a "real" document title, but if we
>>>can't have that I think that it would be better to use only the name of
>>>the file itself, not the full path.  This way it would be consistent
>>>between all the modules.
>>>
>>>I see this comment in pp2html.pm so I don't think I'm too off base here:
>>>
>>>Currently produces document titles like /tmp/foo1234.  Need to alter
>>>to pass actual document title.
>>>
>>>
>>>Below are diffs for both modules.  I realize that this isn't best (it
>>>would be nice to have a "real" title), but I think it is better than it
>>>was before.
>>>
>>>
>>>--- XLtoHTML.pm 2004-10-02 18:09:14.000000000 -0500
>>>+++ XLtoHTML.pm.patched 2005-05-10 09:08:18.000000000 -0500
>>>@@ -37,6 +37,9 @@
>>>     # update the document's content type
>>>     $doc->set_content_type( 'text/html' );
>>>
>>>+    # remove the full path in the title
>>>+    $content_ref =~ s/<title>.*<\/title>/<title><\/title>/i;
>>>+
>>>     # If filtered must return either a reference to the doc or a
>>>pathname.
>>>     return \$content_ref;
>>>
>>>
>>>--- pp2html.pm  2005-03-23 23:55:06.000000000 -0600
>>>+++ pp2html.pm.patched  2005-05-10 09:08:11.000000000 -0500
>>>@@ -15,6 +15,10 @@
>>>    my $content = $self->run_ppthtml( $doc->fetch_filename ) || return;
>>>    # update the document's content type
>>>    $doc->set_content_type( 'text/html' );
>>>+
>>>+   # remove the full path in the title
>>>+   $content =~ s/<title>.*<\/title>/<title><\/title>/i;
>>>+
>>>    return \$content;
>>> }
>>>
>>
>>--
>>Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
>>

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Tue May 10 14:51:46 2005