Can't limit site index to specific folders

mark_l_sanders · March 9, 2015, 11:07am

Hi

I'm having trouble limiting my crawl to specific directories

Here's what I want to do:

Here's what happens:

I know that the contents of the /courses directory is valid and crawlable, because:

When I set the crawl to include all content within www.site.com all the content within the /courses sub-directory was included.

HELP!!!

thanks

mark

gordongrace · March 9, 2015, 11:59am

Hi Mark -

This seems pretty straightforward:

- Set your seed URL to be the top-level courses page (e.g. "http://www.site.com/courses/")

- Set your include pattern to be the same (e.g. "http://www.site.com/courses/")

- Ensure your exclude patterns aren't knocking any content out

The Admin UI has a built-in tool to assist in debugging, along with the collection's crawl log files - Administer > Collection Tools > Check URL

mark_l_sanders · March 9, 2015, 12:58pm

You'd have thought so, yes, and that's how we started.

However, when I run the 'check url' tool on that /courses directory, I get an 'all ok' message - 117 live URLs found.

But when I change the settings of the collection to start at /courses and include only /courses, the update process fails because:

Swap Views: This data set has only 0.8% (1/123) as many docs as the last one.

...which means, I guess, that the indexer has only found a single URL in that directory - and that's what happens if we start with these settings.

There's no change in the 'exclude' settings, by the way.

A bit flummoxed....

mark

gordongrace · March 10, 2015, 10:57pm

Any more detail you've been able to extract from crawl.*.log or stored.log?