Parsing of web sites
-
When Sitetruth looks for the address of a business on a web site, no more than twenty pages will be processed. Pages longer than 1 megabyte are truncated. Text in images, JavaScript or Flash will not be recognized. Only 1000 links per page will be examined.
- Not all links are followed. The text associated with the link must contain a word that indicates likelihood of finding a street address. The current word list includes "About", "Site Map", "Contact", "Location", "English" (for non-English sites with an English language page), "Order", "Return", "Checkout", "Cart", and similar words. Capitalization does not matter. We suggest a "Contact" or "Contact Us" page.
- Only on-site links (interpreted broadly; subdomains are properly understood) are followed.
- Street addresses outside the United States must end with a country name or country code to be recognized. Only street addresses in Roman character sets are currently recognized. Sites are checked for a link containing "English", and if there's an English-language page on the site, it will be examined, even if the rest of the site is in another language or character set.
- The SiteTruth parser will make an attempt to parse incorrectly formatted web pages, but may not find addresses or links on them. We don't insist on perfect HTML or XHTML, but if a page is totally rejected as unparseable by the W3C validator, do not expect SiteTruth to parse it.
-
What SiteTruth is looking for in a street address is a standard mailing label, like this:
<p>Example, Inc.<br>
1234 Example St.<br>
Example, IL 12345
</p>
or
<br><br>Example, Ltd.<br>
24 Grosvenor Square<br>
London, W1A 1AE<br>
United Kingdom
<br><br>
The street address can be inside a <table> or <div>, or decorated with <font> and <span> tags. It should be surrounded by some white space, using either <p>, <br>, or some other enclosing tag such as <li>, <td>, or <div>. This handles most real world cases.
-
Street addresses within an <a> tag will not be recognized. Watch for unclosed <a> tags. Also, <li> tags outside an <ul> or <ol> result in problems. Street addresses with each line in a separate table box will not be recognized. Nor will addresses displayed only as images, even if there is an "alt" attribute on the image.
- Addresss where the lines are separate bullet list items (<li> tag) or table rows (<tr> tag ) are not currently recognized.
|
WHOIS
"Phishing" sites
- We are currently checking sites against the PhishTank database.
|
SSL certificates |
- SiteTruth looks for an SSL certificate associated with the domain. This check is performed even if the site has no visible secure pages.
- The certificate chain is verified; unverifiable certificates are ignored.
- SiteTruth accepts the same list of root certification authorities as does Firefox.
- Only certificates with business name and location information are considered in ranking. "Domain only" certificates have no value to SiteTruth.
- "High Assurance" certificates are recognized, but do not give a certificate extra value.
|
Seals of approval |
- Seals of approval from the Online Better Business Bureau are checked and will result in a good rating for a site if valid. The seal must have a direct link (not Javascript) to the "bbbonline.org" site, and that link must yield information for the site being examined. False claims of BBBonline certification downgrade a site to a very low rating, and we may report these to the BBB.
- No other seals are currently accepted by SiteTruth. The ones from certificate authorities duplicate information in SSL certificates, so recognizing them is unnecessary.
|
Business information checking |
- In the current test version, we're focusing on identifying the business. We will be checking out business information more thoroughly in later releases.
|
You can check your
rating on these items with our web-based tools. You, the web site operator,
are in control of all this information. So it's entirely up to you.
|