Rank: Newbie Groups: Member
Joined: 4/5/2007 Posts: 1 Points: 3 Location: US
|
I am running a trial version of the software and I had to stop it as the page count was going into the hundreds of thousands of pages whereas we only have ~10,000. Looking at the pages being returned it was giving some very strange urls. You would see a normal url (that was actually a redirect that we use to handle some of our dynamice pages) followed by, it seemed, by the rest of the pages on the site. For example: http://www.csrees.usda.gov/fo/plantsappliedgenomicscapnri.html is a redirect for us to a dynamic a page but among the pages the tool is bringing back is (notice the 2 ".html"): http://www.csrees.usda.gov/fo/plantsappliedgenomicscapnri.html/about/about.htmlhttp://www.csrees.usda.gov/fo/plantsappliedgenomicscapnri.html/about/background.htmlhttp://www.csrees.usda.gov/fo/plantsappliedgenomicscapnri.html/about/leadership.htmletc. until it looks as if it doing this for our whole site (and then again for the next redirect). Also, we have a rediret for one string that goes to a different site and the tool then goes to try and spider that entire site too. Help?
|
|
|
|
Rank: Administration Groups: Administration
Joined: 1/31/2007 Posts: 409 Points: 541 Location: Chicago, IL
|
I will run through your site later today. FYI - If the spider finds a URL with a different domain name, it will not keep it or try to spider it. Also, are any of your pages automatically redirecting to other pages? If so, how is it coded? META tag? javascript? The spider will navigate to this page and since it's a real pagfe and get a 200 HTTP status, it will save it as a webpage. It doesn't matter if it's redirecting or now...the program cannot determine that.
|
Rank: Administration Groups: Administration
Joined: 1/31/2007 Posts: 409 Points: 541 Location: Chicago, IL
|
I ran the spider against your entire website and just as I suspected, your HTML is the culprit. For example, go to: http://www.csrees.usda.gov/newsroom/news/2006news/water_quality.htmlThen view the source code and search for "/plantsappliedgenomicscapnri.html/". You will find the following code: <a href="http://www.csrees.usda.gov/fo/plantsappliedgenomicscapnri.html/"><strong>CSREES National Research Initiative: Applied Plant Genomics Coordinated Agricultural Project</strong></a> That URL is incorrect and needs to be fixed by you.
|
Rank: Administration Groups: Administration
Joined: 1/31/2007 Posts: 409 Points: 541 Location: Chicago, IL
|
Locking topic as issue has been explained.
|