Skip to main content.
home | support | download

Back to List Archive

Re: Proposed changes to pp2html.pm and XLtoHTML.pm

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Tue May 10 2005 - 22:08:16 GMT
Nick scribbled on 5/10/05 4:58 PM:
> Should the (.+) be a (.*)?  What if you had a file in the root dir like so:
> 
> /my document.doc


that's true. then .* is better.


> What do you think about the possiblity of a '/' or a '\' in the name of
> the file?  That seems highly unlikely (I don't even think windows will let
> you do it), but as far as I know it is possible under *nix.
> 

what do I think of the possibility? I think it gives me cold chills to imagine 
people putting slashes of either persuasion in their file names. but people are 
people.

karpet@cartermac 175% touch \foo
karpet@cartermac 176% ls -l *foo
-rw-r--r--  1 karpet  admin  0 10 May 16:59 foo
karpet@cartermac 177% touch \\foo
karpet@cartermac 178% ls -l *foo
-rw-r--r--  1 karpet  admin  0 10 May 17:00 \foo
-rw-r--r--  1 karpet  admin  0 10 May 16:59 foo
karpet@cartermac 179%


yep, possible under OS X, anyway.

What the regexp would do is match up until the last slash. So if your file name was:

/path/to/\foo.ppt

it would leave you with:

foo.ppt

or:
/path/to/somename\foo.ppt => foo.ppt

so it wouldn't break, per se, since the title is only for display. the full path 
name is kept in a different property.




> 
>>
>>Nick scribbled on 5/10/05 4:43 PM:
>>
>>>I was thinking that, but I didn't know how to do it right.  I'm not that
>>>familiar with the perl regex, what is this doing to split it?  My
>>>concern
>>>was that the filename might contain a '/' or a '\' char in it and I
>>>didn't
>>>know how to reliably split it.
>>
>>I see in a test that my original doesn't work for long path names.
>>
>>I don't have a windows box to test it on, so I don't know which path
>>separator
>>the filter uses under Windows: '\' or '/'.
>>
>>But this should catch either, I would think:
>>
>>   $content =~ s,<title>(.+)[\\/]([^<]+)</title>,<title>$2</title>,i;
>>
>>that says, in english:
>>
>>match '<title>' followed by one or more characters, till you find a / or a
>>\
>>(escaping the \ since it is a special char), followed by one or more 'not
>><'
>>characters, followed by '</title>'
>>
>>the .+ is greedy, so it should match multiple instances of .+[\\/] till it
>>hits
>>the end of the path name.
>>
>>try it out and see if it works for you. If it does, I'll make the change
>>and
>>check it in.
>>
>>
>>
>>
>>
>>>>how about retaining at least the file name without the leading path?
>>>>
>>>>    my $content = $self->run_ppthtml( $doc->fetch_filename ) || return;
>>>>+   $content =~ s,<title>(.+?)/([^<]+)</title>,<title>$2</title>,i;
>>>>
>>>>
>>>>
>>>>
>>>>Nick scribbled on 5/10/05 9:12 AM:
>>>>
>>>>
>>>>>These two modules create titles inconsistent with the other ones.  This
>>>>>is
>>>>>due to the filtering programs using the full path as the title.
>>>>>
>>>>>Obviously it would be best to have a "real" document title, but if we
>>>>>can't have that I think that it would be better to use only the name of
>>>>>the file itself, not the full path.  This way it would be consistent
>>>>>between all the modules.
>>>>>
>>>>>I see this comment in pp2html.pm so I don't think I'm too off base
>>>>>here:
>>>>>
>>>>>Currently produces document titles like /tmp/foo1234.  Need to alter
>>>>>to pass actual document title.
>>>>>
>>>>>
>>>>>Below are diffs for both modules.  I realize that this isn't best (it
>>>>>would be nice to have a "real" title), but I think it is better than it
>>>>>was before.
>>>>>
>>>>>
>>>>>--- XLtoHTML.pm 2004-10-02 18:09:14.000000000 -0500
>>>>>+++ XLtoHTML.pm.patched 2005-05-10 09:08:18.000000000 -0500
>>>>>@@ -37,6 +37,9 @@
>>>>>    # update the document's content type
>>>>>    $doc->set_content_type( 'text/html' );
>>>>>
>>>>>+    # remove the full path in the title
>>>>>+    $content_ref =~ s/<title>.*<\/title>/<title><\/title>/i;
>>>>>+
>>>>>    # If filtered must return either a reference to the doc or a
>>>>>pathname.
>>>>>    return \$content_ref;
>>>>>
>>>>>
>>>>>--- pp2html.pm  2005-03-23 23:55:06.000000000 -0600
>>>>>+++ pp2html.pm.patched  2005-05-10 09:08:11.000000000 -0500
>>>>>@@ -15,6 +15,10 @@
>>>>>   my $content = $self->run_ppthtml( $doc->fetch_filename ) || return;
>>>>>   # update the document's content type
>>>>>   $doc->set_content_type( 'text/html' );
>>>>>+
>>>>>+   # remove the full path in the title
>>>>>+   $content =~ s/<title>.*<\/title>/<title><\/title>/i;
>>>>>+
>>>>>   return \$content;
>>>>>}
>>>>>
>>>>
>>>>--
>>>>Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
>>>>
>>
>>--
>>Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
>>

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Tue May 10 15:08:17 2005