We saw an issue recently where the database connector for Google Search Appliance (GSA) wasn’t indexing all the records in the database table.
Using GSA database connector, we were trying to index a database view that had about 200,000 rows. Checking the status of the database crawler was always returning a number much lower (sometimes zero) than the 200,000 records we were expecting to see in the crawler/indexer status.
Before we look further into troubleshooting these issues, let us look at the options the Google Search Appliance provides to crawl a DB table/view.
There are two options to crawl a database using Google Search Appliance.
One – Configure database crawl on the appliance, From the appliance Crawl and Index – Databases menu.
Two – Configure crawl and index using a database connector, format the rows as XML metadata feeds and push the feeds to the Google Search Appliance.
We took the second approach because it allows us to customize the results and provides more options for crawling and indexing databases such as
1) Control over the jdbc interface/jar used
2) Ability to schedule recrawls
3) Ability to push in ACLs (aka secure results)
4) Ability to use the Connector Manager filter
So let us see how database crawler works:
Step 1 – The GSA DB crawler works by connecting to the database and fetching the records using a SQL query.
Step 2 – The query results are converted to metadata using a custom XSLT stylesheet.
Step 3 – The crawler feeds the data as XML to the Google Search Appliance using XML feeds.
Step 4 – The XML feeds are then stored into the index by the GSA.
So if this works as expected, there would be no problems. Based on the SQL query, we should have all the records in the collection. However, as it turns out this was not the case.
Cause of the issue
The issue was with the way rows are returned when the GSA sends a query to the database. Let us assume there are 2000 rows in the table that you would like to index using GSA database connector. You have set the connector to fetch 200 documents every minute. The database returns 200 documents and they are formatted by the connector and sent as an XML feed to the GSA and then indexed. The DB crawler finishes fetching 2000 rows in 10 runs and starts another traversal. When the GSA database connector starts another traversal, by design, it expects the exact 200 rows returned in the first run to be returned back for the query. However, queries executed on the database don’t always return the same results if there is no clause for ordering the result set. When this happens, the crawler assumes that the initial 200 rows, indexed in the first crawl, no longer exist and sends a delete record XML feed for the unmatched rows. This causes the index to never be up-to-date.
So how did we fix it?
The solution is really simple and comes in the form of an ‘order by’ clause in the SQL query.
If there is a Primary Key for your table, you can change your query to the following in the connector configuration:
Select * from schema.Table order by PRIMARYKEYCOLUMNNAME ASC;
Adding an order by clause to the query would ensure that the query always returns the results in the exact same order to the GSA thus ending the arbitrary deletes from the index.
Since we were querying a view, the view had no primary key column. We fixed the issue by adding an ‘order by’ clause to every column in the view.
Select * from SCHEMA.VIEW ORDER BY COLUMN1 ASC,COLUMN2 ASC,COLUMN3 ASC,COLUMN4 ASC, ...
and so on.
Google support has acknowledged the issue. Their support personnel have mentioned that this is a known bug and will be fixed in a future release.
For further information
For expert help on your projects, Contact Us Today!.