Jardic Pro Implementation Details
Dictionary File Structure
Dictionary File Logical Structure
Jardic Pro dictionary file consists from the set of article records that can be referenced from different indexes. A size of one article record is unlimited. Format of one article record can vary depending on its origination (Jardic, EDICT, DSL, Eijiro and so on).
Indexes are used to store lists of keys. From the user point of view they are lists of words. Index entries point to records. Keys could be complex and they could consist from subkeys like: <reading><word><priority>.
Jardic Pro supports following indexes:
Low-level Organization of Dictionary File
Jardic Pro dictionary file consists from pages. Following types of pages are supported:
Data Management System
Jardic Pro Data Management System (DMS) performs functions that are typical to Index Management Systems.
Jardic Pro DMS is implemented as a 2-tier system. Low-level tier supports an access to pages using specified page numbers. High-level tier supports external requests coming from application program.
Low-level tier supports following functions:
DMS External functions
Jardic Pro DMS external functions support dictionary data search and data update. DMS functions could be divided into following groups:
Management: Connect, Disconnect, CreateCursor, DestroyCursor, UseRecord, UseIndex, Compress, SetProgressProc and some other.
Search in indexes: FindEQ, FindGE, FindLE, FindLT, FindGT, FindFirst, FindLast, FindNext, FindPrevious, FindByCounter, FindByPercentage. Results of search functions are: record pointer (article pointer), and a sequence number of a found key value in the index.
Reading of data: GetRecord.
Key insertion: InsertRecord, InsertKey, BeginBulkInsertKey, EndBulkInsertKey, StopBulkInsertKey.
Key comparison: CompareKeys, GetCollator, GetCollatorVersion, CreateCollator0, CreateCollator1. Key comparison functions are exported from DMS to application program, because the application program should use the same string comparison functions as DMS.
Jardic Pro 4.3 does not support update functions. They are not currently implemented because there is no user demand, and due to the complexity of full-text search index update. Future version of Jardic Pro could have implementation of this set of function for some types of dictionaries.
Jardic Pro full text search indexes are stored in dictionary files. Depending on dictionary size such indexes could contain up to some millions of entries. To build large dictionary indexes Jardic Pro uses optimized bulk insert functions. Our bulk insert functions does not use preliminary data sorting (e.g. like in Microsoft SQL Server). That is due to the fact the time of preliminary data sorting could be comparable to the time of pure key insertion into the index. Instead of that, Jardic Pro uses optimization based on on-the-fly calculation of key distribution histogram. The histogram intervals are set to such a value that the volumes of key values hit into one interval are comparable to DMS memory cache size. Jardic Pro bulk insert functions support insertion of about 0.7 million keys per 1 minute for 1.6 GHz CPU. Such insertion includes index filling, index page compression, and writing index pages on HDD.
DMS functions that work with indexes intensively use key comparison and key sorting functions. Sorting of keys with text strings is based on the internal implementation of Unicode Collation Algorithm (UCA) and Default Unicode Collation Element Table (DUCET). This implementation supports correct "dictionary" sorting of words with small and capital letters, diacritics and with internal punctuation (apostrophes, dashes and so on).
Comparison of text strings according to UCA is implemented in following steps:
3-level Collation Elements (CE's) are built using Default Unicode Collation Element Table (DUCET).
For correct sorting of text string with blanks, dashes, apostrophes and other punctuation characters the program uses CE's with variable weights.
Under the term "dictionary shell" we mean the program that interacts with a user. It is the dictionary shell that is accepted by the user as "Jardic Pro". Dictionary shell performs following base functions:
All the following word lists displayed in Jardic Pro are virtual:
In one of its windows Jardic Pro displays dictionary article for the current list item. Jardic Pro can work with imported dictionaries containing articles in different format. To display articles of dictionaries with different format the program uses separate formatting functions (e.g., for EDICT, DSL, Eijiro and so on).
The main function of each electronic dictionary is searching for entered words. This function is implemented in Jardic Pro by performing a search in virtual lists that are common for all opened dictionaries.
Jardic Pro can search for translation of words under the cursor on-the-fly in Microsoft Word, Internet Explorer, HTML Help and some other programs. The access to MS Word objects is implemented using COM Automation, specifically through Running Object Table (ROT). An access to Internet Explorer and HTML objects is implemented using IHTMLDocument2 interface obtained through Microsoft Active Accessibility (MSAA).
When translating words on-the fly Jardic Pro analyzes a type of text under the cursor. If the text contains kanji, the program tries to find translation for words starting from current kanji under the cursor and with some next characters. If a text under the cursor contains kana words, the program converts word endings to "normal" form. Then the program repeats those steps for the next character to the left, then one more character to the left and so on. Search results (found word translations) are accumulated. Jardic Pro displays found word with maximum length, but it can show all intermediate results including translation of separate kanji characters.
If the text under the cursor does not contain kanji the program looks for word boundaries at the left and at the right position of the cursor. Search for word boundaries is implemented using Unicode default boundary search algorithm. Then the program searches for translation of the selected word limited by found boundaries.
Import of Dictionaries
When importing dictionaries Jardic Pro performs following steps:
Jardic pro builds dictionary indexes using DMS bulk insert functions: BeginBulkInsertKey, InsertKey, EndBulkInsertKey. The program extracts words from articles for full text search indexes using Unicode default boundary search algorithm.
Jardic Pro is written in C++. Source code consists from more than 160,000 lines.
|© Vitaly Zagrebelny, 2007-2013|