Skip to main content.
home | support | download

Back to List Archive

Re: Searching only a specific div class

From: Thomas Sewell <tsewell(at)not-real.chelseainteractive.com>
Date: Sat Mar 13 2004 - 15:32:35 GMT
I reproduced your example and got it working with my setup and the t.html below (eliminating the error by making the case of div consistent.).
 
When I tried it with my actual html pages, I ran into some further errors. I've isolated (at least the first ones) to the lack of correct xml usage with the meta tag.
 
Example:
<html>
<head>
<title>
Test1
</title>
<meta name="keywords" content="John and Jane">
</head>
<body>
<div class="content">
<div class="product-details">
<div class="product-authors">
John Doe
</div>
</div>
<div class="product-details">
<div class="product-authors">
Jane Doe
</div>
</div>
</div>
</body>
</html>

If I add, </meta> (or a />), then it works, indicating that the parser doesn't like the regular html way.
 
Am I stuck here with having to convert all pages over to strict xhtml in order to be able to use the XML2 parser and grab the class attribute, or is an external program the only way? Since I'm indexing a few million pages averaging 50K each, I'd like to avoid the extra overhead of running them all through an external program each time to reformat them for the indexer.
 
Of course, the more I mess with trying to make one of the pages strict xhtml so that it will process, the better writing an external program sounds...
 
Thanks,
 
Thomas

	-----Original Message----- 
	From: Bill Moseley [mailto:moseley@hank.org] 
	Sent: Fri 3/12/2004 4:18 PM 
	To: Multiple recipients of list 
	Cc: 
	Subject: [SWISH-E] Re: Searching only a specific div class
	
	

	[...]
	Yes, that's a feature of the XML parser.  They are the same parser,
	really, but there's just a check to see if parsing HTML and if so skip
	the part that deals with XML attributes.  Might be able to modify
	parser.c to make it work with HTML, too -- there's just a lot of
	attributes in normal html.
	
	I think libxml2 is more forgiving when parsing HTML, for one thing.  But
	I'm not really clear on the differences in the parsers internal to
	libxml2.
	
	Now the other problem is the UndefinedMetaTags ignore is a bit too
	agressive.  It ignores everything until the closing tag -- even if you
	have a tag defined inbetween.  That behavior is questionable.
	
	My suggestion is to use an program to extract out the data you want
	indexed.
	
	Anyway, here's your example:
	[...]
Received on Sat Mar 13 07:32:36 2004