Here’s the situation – you want your Google Search Appliance (GSA) to crawl a website that either requires or sets cookies. After you crawl the site using GSA, when you go to the cached version of the page on GSA, you see an error message that says “Cookies must be enabled on your browser to view this page”.
How do you fix errors like this? Fortunately there are a few solutions.
Solution #1: Set the application to not require a cookie when the user-agent is gsa-crawler.
This solution works if you have control over the WebServer – AppServer setup. When the header field user-agent matches gsa-crawler, serve the websites without requiring or setting any cookies. If achieving this kind of setup requires you to rewrite many lines of code or requires you to make a lot of changes, it may not work for your environment. Fortunately there are two other solutions that worked for our setup.
Solution #2: Use http headers
This is the easiest solution for situations described in the problem statement.
Login to the GSA and click on Crawl and Index -> HTTP Headers. In the text box under Additional HTTP headers for crawler, add the cookies required by the environment.
However, this approach fails when the cookie you have set expires after a certain TTL. You have to login to the GSA and change the cookie frequently. Also, this approach fails when you have a portal setup with multiple websites under the same sub-domain. A cookie that’s valid for one context/one website might be invalid for the remaining websites under the same domain. This could also cause the crawler to fail for other websites that use a similar cookie-based setup.
Solution #3: Use contexts and rules
For our environment, we setup contexts and created GSA rules to deal with those contexts. For example, we defined contexts like this:
For each context, apply the following steps:
- Step 1: Login to the GSA and click on Crawl and Index -> Forms authentication. Enter the URL of the site that sets/requires a cookie and click on the appropriate URL pattern. In our case, we had setup one rule for the root context of every website that was under the same sub-domain because of the way the environment is configured.
- Step 2: Under Crawl and Index -> Crawler access, enter the user ID and password of the user that has access to the site. Even if there is no user ID required to access the site and if you have a login page for the portal, enter the user ID of a user that has access to the portal.
- Step 3: Go to the Serving -> Universal Login Auth Mechanism -> Cookie tab. Enter a mechanism name and the sample URL. Also, check the box next to When Sample URL check fails, expect the sample page to redirect to a form, and login to that form. Finally, set the Trust Duration to be far less than the TTL for cookies in the environment you are trying to crawl.
This approach worked and we were able to successfully crawl all the websites using GSA crawler.
So in conclusion, take a detailed look at your environment and use one of these approaches to configure GSA. Good luck!
For further information
For expert help on your projects, contact us at https://www.xtivia.com/contact-us.