Hi
I'm having trouble limiting my crawl to specific directories
Here's what I want to do:
- Start at www.site.com/ (not a real url, obviously)
- Only index content found within www.site.com/courses
Here's what happens:
- The crawl only finds two urls - www.site.com and www.site.com/courses
I know that the contents of the /courses directory is valid and crawlable, because:
- When I set the crawl to include all content within www.site.com all the content within the /courses sub-directory was included.
HELP!!!
thanks
mark
Hi Mark -
This seems pretty straightforward:
- Set your seed URL to be the top-level courses page (e.g. "http://www.site.com/courses/")
- Set your include pattern to be the same (e.g. "http://www.site.com/courses/")
- Ensure your exclude patterns aren't knocking any content out
The Admin UI has a built-in tool to assist in debugging, along with the collection's crawl log files - Administer > Collection Tools > Check URL
You'd have thought so, yes, and that's how we started.
However, when I run the 'check url' tool on that /courses directory, I get an 'all ok' message - 117 live URLs found.
But when I change the settings of the collection to start at /courses and include only /courses, the update process fails because:
Swap Views: This data set has only 0.8% (1/123) as many docs as the last one.
...which means, I guess, that the indexer has only found a single URL in that directory - and that's what happens if we start with these settings.
There's no change in the 'exclude' settings, by the way.
A bit flummoxed....
mark
Any more detail you've been able to extract from crawl.*.log or stored.log?