| Home » Web Pages » FAQ » Search Spider |
The PWP@att.net Search Spider
The search spider is a program that indexes pages at home.att.net for use in your site search feature. From what we understand, the search spider will start at whatever default page is seen at http://home.att.net/~emailid and will follow all HREFs off that page that are in the home.att.net domain. Then, it will follow all HREFs to pages on home.att.net on each of those pages and so on until it exhausts all of them.
The default page that is highest in the priority will be the one that is used as the starting point. If a page isn't linked from the starting point, it won't be indexed.
The current list of default pages (in order of priority) is:
- index.html
- index.htm
- home.html
- home.htm
- personal.html
- resume.html
- my_business.html
- my_assoc.html
- wsb.html
- store.html
For example, if you have an index.html and an home.html page, the index.html would be the spiders starting point and the home.html would not be spidered unless there is a link to it from a page that is spidered. Also, if you don't use one of the default pages shown above, your site won't be spidered unless the spider follows a link from another site.
Some additional info on the search spider:
- It obeys password protection.
It appears as just another client to the server and can't authenticate. - It will follow HREFs in HTML comments.
Your browser won't see them but the spider will. - It won't follow HREFs that are generated by JavaScript.
It doesn't have a JavaScript interpreter. An example of this is a page linked by using the writeDoc() command.
If you have a subset of pages that aren't linked from the starting point page, consider linking to the main subset page using an HREF in a comment. If you have a lot of individual pages that you want spidered, but aren't individually linked on any page, make a page with all of those links on it (aka Site Map) and link to THAT page in an HTML comment. You could do the same type of thing with pages that are otherwise linked only via javascript. We've provided a sample link to a site map within an HTML comment below:
<!-- This is a comment and won't be displayed.
<a href="http://www.wurd.com">WURD</a>
-->
Here are some additional points about PWP Search:
- The list of pages to be searched is a list of the starting URLs, of the form: http://home.att.net/~emailid/
- For all those users who have search turned on, there will be a line in the start file, as described above. For those users who have never turned on search, or for those who have turned it off, there isn't a line in the file.
- Every day, the search engine checks all the sites in the start file starting at the URL in each line in the start file:
- It retrieves http://home.att.net/~emailid/
- It checks the timestamp on the file, and if it is newer than the timestamp in the database, re-indexes it. If it isn't in the database, it adds it.
- It pulls all the href links out of the file, and then retrieves each of the files that have URLs within home.att.net.
- Repeat starting with step b for each href pulled out of an indexable file
- When a user enables site search on the publishing server, it sets a bit in their profile. Every morning, the users who have either added or removed search get their appropriate line updated in the start file.
- The search engine then begins to spider all the sites.
- The search engine never indexes off-site. i.e., it only follows links relative to home.att.net and *.home.att.net.
- Even if a site preference isn't to be indexed, if someone else's site has a link to it, and their site is indexed, then your site ends up in the index starting from the page that was linked to.
- All of home.att.net ends up in one collection. When you use the SMS (Search My Site) syntax, the results are filtered to those in the user's home subdirectory structure. Thus, SMS is by subscription, not by account. To put it another way, you can't do SMS on aggregated sites (i.e., using two or more of your secondary e-mails to get more than 25MB of disk space), as the expanse of search is only all of PWP, or just a single subscription's (e-mail, user id) space.
- It takes 24 to 48 hours (best case/worst case) for the results of the search to show up.
If you fear that your site isn't being indexed:
- Make sure that you have search enabled in your profile:
http://publish.att.net/cgi-bin/profile - Make sure that the first indexable page in your site (i.e., the page that is retrieved from the URL http://home.att.net/~emailid/) contains links to the other pages in your site.
Some common misconceptions on site indexing:- It is the first_indexable_file that is retrieved, not some file that users send out the URL for as their homepage. e.g., if they send out the link http://home.att.net/~jims.stuff/jims-stuff.htm as their homepage in e-mail, but there isn't an indexable file for http://home.att.net/~jims.stuff/, or it doesn't contain links to the rest of their site, (especially a link to /~jims.stuff/jims-stuff.htm) their site won't be indexed as they might like.
- The spider doesn't crawl off-site. Ever. It doesn't index other hosting providers' content. If the user's have links going off-site, or are linked to a domain redirector to stealthily display their PWP content, it won't work.
- The spider won't index files in the user's PWP space that aren't linked to from their first indexable page. It's an HTTP spider, not a disk spider.
- There is no automated process to update their indexing preference in the user's PWP profile. If site search is turned on, then the user explicitly did it. If site search is turned off, then either they never turned it on, or they explicitly turned it off.
- It is theoretically possible for the user to cause the state of the search to be out of sync between the publishing server preferences and the search server start files, although it would be very difficult to do, in practice. If the user's site really, truly isn't being indexed, as verified by checking the points in 1 and 2 above, then the user should check the search preference in the PWP profile:
- If it isn't what they desire (the nasty double clicker or simply forgetful user), then they should change the preference to what they want.
- If the state is what they think it should be, then they should toggle it once, wait for the form to submit, count to ten, and toggle it back to what they desire.
Need Additional Help?
If you can't find the answers you need, please try:
- The help file for the application you are using.
- Our FAQs.
- The AT&T Worldnet Help Newsgroups.
