Welcome Guest Search | Active Topics | Members | Log In | Register

problems with spidering massive website Options
onlineshop
Posted: Monday, December 31, 2007 8:54:50 PM
Rank: Newbie
Groups: Member

Joined: 12/31/2007
Posts: 2
Points: 6
Hi I have purchased the i architect site map generator some months ago and found it to be one of the best site map generators.

I have run into a problem though since one one of my websites i have built some what a massive website with hundereds of thousands of pages.

I am using a free affiliate store builder script called affilistore http://www.affilistore.com and this is a scripts working in conjunction with a mysql database.
I use merchants data feeds to create webpages and content and i so far am using about 30 data feeds with a total of more than 200,000 at least... probably alot more...products , each producing their own unique webpage per product.
However I when I went to spider the website with Iarchitect it gets upto about 17,500 unique urls quite quickly and then it pretty much seems to jam up

Any suggestions on whats happening here? I have heard that websites that have mysql databases that use scripts to generate content can sometimes be a little complicated to make site maps for and wanted to know if there are any queires or url strings etc that I might need to filter in order to get the site map maker functioning again properly and be able to spider more than 17,500 pages before mucking up?

My site that i am having difficulty with is http://www.online-shop.com.au the starting folder where the affilistore script is used is http://www.online-shop.com.au/catalogues/

Any help or suggestions would be much appreciated.

Our Websites: Online shopping * Online shop catalogues * Cell Phone * fishing tackle store
Sponsor
Posted: Monday, December 31, 2007 8:54:50 PM
Get your Sitemap Generator license today! http://www.keylimetie.com/Checkout/Quick-PayPal/
KeyLimeTie
Posted: Wednesday, January 02, 2008 12:13:30 PM
Rank: Administration
Groups: Administration

Joined: 1/31/2007
Posts: 380
Points: 539
Location: Chicago, IL
Hi onlineshop,

We have successfully spidered sites with over 4 million pages.
These sites have been PHP, ASP, .NET and Java...and with a MySQL, SQL Server and DB/2 database.
Sometimes the software can seem hung up for a minute (usually less) when it's spidering a very large page and has to check if all of the links have been added to the queue already.
Next time you run it, can you see if the processor/memory is going up and down? If so, the application is working...and not just looping.

Thanks,
iArchitect
onlineshop
Posted: Wednesday, January 02, 2008 1:26:43 PM
Rank: Newbie
Groups: Member

Joined: 12/31/2007
Posts: 2
Points: 6
Hi thanks for the resposne Well I have had the application open and the spidering process going for 24 hours now.
and it does seem to be working yes as we have more unique pages. however it found the aboyut 17,500 pages like lightning within minutes but then turned into a snail.

Within a 24 hour period the results went from about 17,500 pages to about 20,500 (so only an extra 3000 pages in 24 hours)

I am right in saying that it is not supposed to be this slow right??? So i need to work out whats causing the software to be such a snail... and im not sure what to look for and where to start to be able to fix the problem to be honest.

I think some of the redirects may have something do to with the problem the script as it is uses affiliate data feeds - once a user clicks on a visit store button it comes up with a page and url similar to this ''http://www.online-shop.com.au/catalogues/go.php?proddb=2&l=2056''

I will try and block this from the crawl what would you suggest would be the correct way to put in the filters and where
1. /go.php?proddb= ?
2 /go.php? ?
3 /go.php ?
4 other???

Our Websites: Online shopping * Online shop catalogues * Cell Phone * fishing tackle store
KeyLimeTie
Posted: Thursday, January 03, 2008 1:58:04 PM
Rank: Administration
Groups: Administration

Joined: 1/31/2007
Posts: 380
Points: 539
Location: Chicago, IL
Sounds like you're on the right track.
I'm certain the software is working fast...the website must have a ton of links on each page and/or the HTTP Redirects are cuasing issues.
Also, as each page is visited, it has to fully render the HTML before it can move onto the next page.
If you have external HTML/links that need to contact a 3rd party before fully rendering, you're at the mercy of their servers.
Hope this helps.
Users browsing this topic
Guest


Forum Jump
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

Main Forum RSS : RSS

None
Powered by Yet Another Forum.net version 1.9.1.2 (NET v2.0) - 9/27/2007
Copyright © 2003-2006 Yet Another Forum.net. All rights reserved.
This page was generated in 0.049 seconds.