For science researchers seeking published information, it was a major step forward when BIOSIS, the gateway to international medical and biological research, moved to relational indexing for its Biological Abstracts. However, implementing the new indexing methodology was like stepping through a minefield.
I had three big challenges: processing the sheer volume of index data in a reasonable time, correctly sorting unusual characters and dealing with files larger than two gigabytes.
Biological Abstracts, the leading reference publication for life sciences material, is published by BIOSIS, a not-for-profit organization established in 1926. With the world's largest collection of abstracts and bibliographic references to biological and medical literature, BIOSIS processes approximately 550,000 items each year from primary research and review journals, books, monographs and conference proceedings. This has helped us amass over 13 million citations in our information systems. Obviously indexes are of key importance in organizing and finding this information. Thus, it was with a dramatic increase of usability for our clientele in mind that we moved to relational indexing for our Biological Abstracts.
In the past, our indexes were organized according to key words in an article's title and initial text. Now we use a system of relational indexing that makes searching easier and retrieval of records more accurate. Relational indexing follows a logical, consistent set of rules, which categorize information using a hierarchical chain. If a researcher seeks information on a new medicine for arthritis, for example, but doesn't know the medicine's name, the relational index will lead that researcher to the right information. Under the old keyword method, the researcher might not find the medicine's name in the index.
When we implemented our new indexing system, BIOSIS also moved from a mainframe to a Windows platform. That meant changing the utilities and other tools that we had used on the mainframe. We needed a product that would sort and order data from the mainframe correctly, and I soon realized that our current product was inadequate for the task. When we learned that SyncSort, a high-performance sorting and data manipulation product had become available for NT, we became one its first users. We had long used SyncSort in the mainframe environment; and when we tested SyncSort on Windows, we discovered we could reduce processing time dramatically. We substituted SyncSort for a weeklong multistep procedure that processed approximately three gigabytes of raw data. SyncSort completed the same job in a single step in just two hours and 12 minutes. (With more memory available on our server, we estimate that SyncSort would complete the job in less than two hours). But beyond sorting speed, we needed a workable solution to the problem of defining collating sequences.
Specialized Collating Services
The profusion of chemical names in the BIOSIS database posed a challenge because chemical symbols are written in both upper and lower case letters and include many special characters. For example, chemical names ABC, A,B,C and A(B)C, should sort together, but unless an exception list is created for special characters, the records will end up being separated during a sort. This is not a major problem in a small index, but with 2,600,376 records published on 9,356 pages, correct and consistent sequencing became crucial since similar names could appear several pages apart. We were able to use SyncSort to define our own collating sequence without having to program it. This was all the more impressive because it is not a simple sort: it's a four-key sort with three of the fields having specialized collating sequences and up to 250 characters. Despite the complexity of the task, SyncSort gave us exactly what we needed. Using SyncSort saved programming hours, processing hours and overtime costs.
Beating the Two-Gig Limit
Unfortunately, this now well-indexed file, at three gigabytes, was too big to process in many of our standard Windows applications. The applications would fail at the two- gigabyte mark. Although the problem wasn't with their product, the technical people at Syncsort Incorporated (the maker of SyncSort) helped us find a solution. We ended up using a special SyncSort feature that splits a single input file into multiple output files. Using SyncSort's copy function, we could generate the dual output without programming and without the overhead associated with sorting.
Not only did SyncSort help us with this particular challenge, but we have also expanded our usage to other areas. SyncSort is an integral part of the entire BIOSIS publishing product line and is also widely employed in the production of two other BIOSIS electronic product lines. Overall, we currently use SyncSort applications in 88 separate product-generation procedures where they heighten our efficiency, accuracy and cost-effective operation.
|SyncSort is a high-performance sort and data manipulation product that speeds extract, transform and load (ETL) applications by up to 90 percent and facilitates data mining and clickstream processing. Available for Windows, mainframe and UNIX, it sorts, merges, aggregates, converts data and resizes records to produce multiple load files, and its parallel processing enables greatly accelerated performance.|
Register or login for access to this item and much more
All Information Management content is archived after seven days.
Community members receive:
- All recent and archived articles
- Conference offers and updates
- A full menu of enewsletter options
- Web seminars, white papers, ebooks
Already have an account? Log In
Don't have an account? Register for Free Unlimited Access