Can't limit site index to specific folders

Hi

I'm having trouble limiting my crawl to specific directories

 

Here's what I want to do:

  • Start at www.site.com/ (not a real url, obviously)
  • Only index content found within www.site.com/courses

Here's what happens:

  • The crawl only finds two urls - www.site.com and www.site.com/courses

I know that the contents of the /courses directory is valid and crawlable, because:

  • When I set the crawl to include all content within www.site.com all the content within the /courses sub-directory was included.

HELP!!!

 

thanks

mark

 

Hi Mark -

 

This seems pretty straightforward:

 

- Set your seed URL to be the top-level courses page (e.g. "http://www.site.com/courses/")

- Set your include pattern to be the same (e.g. "http://www.site.com/courses/")

- Ensure your exclude patterns aren't knocking any content out

 

The Admin UI has a built-in tool to assist in debugging, along with the collection's crawl log files  - Administer > Collection Tools > Check URL

You'd have thought so, yes, and that's how we started.

 

However, when I run the 'check url'  tool on that /courses directory, I get an 'all ok' message - 117 live URLs found.

 

But when I change the settings of the collection to start at /courses and include only /courses, the update process fails because:

Swap Views: This data set has only 0.8% (1/123) as many docs as the last one.

...which means, I guess, that the indexer has only found a single URL in that directory - and that's what happens if we start with these settings.

 

There's no change in the 'exclude' settings, by the way.

 

A bit flummoxed....

 

mark

Any more detail you've been able to extract from crawl.*.log or stored.log?