Some raw specifications Products


Short summary:


* 34 hours to crawl/index a corpus of 6 billion documents
* No limit in scaling content size and/or performance
* One (1) person to manage/operate a 3000 computers system
* EQR: Combines 6 algorithms/technologies when ranking results
* All search features of existing search engines and much more
* Full-indexing of 22 document types. More are easily added
* Link-indexing of 62 document types
* Distributable over an arbitrary number of computers
* Several patentable technologies
* Private siteSearch: encrypted documents can be stored & searched for
* 2 years of research and development
* State-of-the-art technologies
* Installs on Windows 2000/2003/XP Pro



Runtime system:


Windows 2000/2003/XP Pro. Soon on Linux & Unix.


Software:


MUNAX can be installed on a single computer or an arbitrary number of computers on a local area network, or over computers on the internet.

MUNAX runs as a single process / multithreaded program. MUNAX execution is logically distributed over subsystems, not machines. One machine can host several subsystems and/or one subsystem can span over several machines. This way, MUNAX can be distributed freely and each machine's execution power can be utilized to its maximum. For instance, one of the very powerful machines can have three full subsystems running while another machine may have only a dictionary server running. All this is easily setup in the MUNAX parameter files.

Besides storing data in files, MUNAX stores information in several smaller databases. These databases are provided by the service layers and one specific type of database may combine one or several technologies like hash structures and different type of tree structures. MUNAX also introduces a new type of database, called GAX, where the knowledge of the existence of any type of object is represented by a single bit.

The entities that carries out the duties in MUNAX are called Worker and Server processes. A process is an instance (or a clone) of a MUNAX component. Examples of worker components are crawlers, indexers and sorters and examples of server components are Index, Rank, Dictionary and doc repository components. All-in-all MUNAX consists of 13 type of server components and 3 type of worker components.

The workers communicates with the servers over the MUNAX Mchannel interface which is a session layer on top of TCP/IP. Using Mchannel is faster than using sockets directly and, as important, helps to overcome system limitations (number of opened sockets). The workers may use the Mchannel interface directly or the MUNAX Unified File System (ufile) to open any type of communications channel or file. ufile, in turn, uses the MUNAX Remote File system to open files on the LAN or anywhere on the internet. From a servers point of view, it is unknown where a file is located. Its location and definition is set in the MUNAX parameter files.

MUNAX has no limitations when it comes to scaling, neither concerning the speed nor the size of the database content. For instance, MUNAX does not have the famous "max 4 billion docs" bug.

The MUNAX software has several patentable technologies.


Reporting & ratings:


MUNAX has a very rich reporting system. Reports are levelled as TRACE, LOG, ERROR and PANIC. Great efforts have been spent on ensuring that the MUNAX system reports any abnormalities instead of / before letting a computer crash/hang.

When it comes to performance, we have measured ratings by using old PC's (typical Pentium II, 300Mhz, 256MB RAM) and tested the system both as a single-computer installation and when distributed over several machines. As an example, using a single-computer system with an ADSL line the crawlers reaches full bandwidth of that ADSL line, i.e. 2.5Mbps, running 120 crawlers simultaneously.

Indexing is about 60 times faster than crawling. Sorting is about 80 times faster than crawling.


Management/Control:


One, or several MUNAX systems, each consisting of thousands of subsystems and machines, can be managed from one, single, console (one person!). From the MUNAX Management Console program, the operator issues commands targeted either to the whole MUNAX system, a subsystem, a machine, a component or to an individual process.

Besides the common commands for starting/stopping activities like crawling, indexing and/or sorting, great care have been taken to ensure that administrative duties like backup, restore, reset can be carried out with minimum effort. For instance, backing up a MUNAX database consisting of billions of indexed documents is done in minutes simply by issuing the 'backup' command.


Crawling:


The bandwidth utilization = the sum of all bandwidths. For instance, for a MUNAX system setup on 3000 computers on the internet, and each computer using ADSL 2.5Mbps, the total bandwidth will be 7.5Gbps. For a corpus of 6 billion docs (average size 15Kbytes) this means a (theoretical) total time of 34 hours to crawl & index the total corpus.

To overcome the DNS bottleneck, MUNAX uses the MUNAX DNS database servers. There might be any number of these servers and the DNS knowledge is distributed over all the servers. These servers have more intelligence than just the DNS functionality. For instance, they keep track of the access frequency of crawlers accessing the web servers to avoid "attacking" a web server too often.

MUNAX stores all documents in the repository, either in their original format or in a converted format. Encrypted or non-Encrypted. Compression ratio is about 5:1. Typically, 1 million documents, together with the indexing information requires about 3GB of disk space.


Indexing:


Currently, MUNAX full-indexes the following type of documents: htm html shtm shtml jhtm asp php php3 pdf ps doc xls ppt rtf wp wp5 wp6 wpd txt c cpp h. More types of documents for full-indexing is easily added by adding the name of the convert/fixup program for that doc type in the MUNAX parameter files.

Currently, MUNAX link-indexes the following type of documents: gif, jpg, tga, bmp, iff, img, jif, mac, msp, pcx, pic, tif, ico, jpe, mp3, wav, ram, snd, mp4, aif, mid, vqf, la1, lav, mp2, avi, mpg, mpeg, rm, qt, asx, mov, fli, flc, eps, wri, asc, fmk, for, zip, gzip, tar, arc, lzh, sit, rar, arj, dd, tgz, lha, exe, hqx, dll, vbs, vxd, bat, cmd, class, jar, java, jav and email addresses.

MUNAX knows what type of files these are and groups them accordingly.

When it comes to link-indexing in MUNAX, this is more complex than just indexing the anchor text and the URL. Amongst other things, MUNAX relates each link to the other links on the page and to the text of the page itself.

MUNAX allows information to be encrypted, i.e. crawlers fetch encrypted docs from web sites. The indexer decrypts the doc before indexing it and encrypts it before storing it in the doc repository. This enables corporates to store their sensitive information in a safe and nonvolatile way and at the same time have it searchable with the highest precision.

MUNAX is language independent. For those docs where "words" are other than just bit streams, i.e. where words are builtup of ASCII or ISO, 255 languages are recognized.

MUNAX can index fragments to produce one (or several) "frame documents" from a html page. This makes it ideal for setting up a News Search Engine. Such a search engine may co-exist in the same database as the "normal" search engine.


Sorting:


Sorting involves the sorting of several parts of the MUNAX data, like: doc location, link analysis, dictionary and of course the sorting of the forward index to produce the inverted index.


Querying and Viewing results:


MUNAX supports the same query parameters as many other search engines, like: Phrases, word subtraction, language, file format, date, top domain, site name, porn filtering e.t.c. However, MUNAX introduces new ones that cannot be found elsewhere, like: Compose Search, Customer code, Access code, User Scalar, Rich content (multimedia), Robots inclusion/ exclusion (yes, at query time).

To give the search engine visitor an even greater search experience, MUNAX let him choose which ranking algorithms to combine, or to simply sort the results by date if that is what he wants.

Docs can be fetched directly from the MUNAX repository and viewed in the browser with the query words highlighted. Either in the original format or with the html tags stripped away.


Ranking results, AIRelevance and Equalizer ranking (EQR):


Today, when making a query on the most popular public search engine on the internet, you may get results that are not related to what you querying. Or, even worse, when you click on a query result the destination document does not even contain any of the words in your query.

It's clear that it is not enough to rely on one or two ranking algorithms to produce the best results. In MUNAX, the goal is to give the most informative documents the highest rank. To decide which documents are most informative, we invented the AIRelevance algorithm. It tries to "look" at a document very much the same way as web surfer and asks itself: "Is this a good and informative document in relation to the query ?". However, realizing that relying on a single ranking algorithm is not enough, we combine AIRelevance with five (5) others to produce a final ranking that we call Equalizer Ranking (EQR).

EQR combines the following ranking algorithms: AIRelevance, docVote (link analysis), Proximity, VSM (Vector Space Model), mmRank (Multimedia ranking) and USR (User Scalar Ranking). Each algorithm can, of course, be parametized to decide how much influence it should have on the final ranking.
 
MUNAX Iserver Night
MUNAX Iserver Continuous
MUNAX Iserver Personal
MUNAX Iserver Small Business
MUNAX Corporate Twin Server
MUNAX Corporate Twin System+

The information above is released on the www for anyone to read and is through this page publicly known. Thus, neither can its whole, nor parts of it be patented. The ownership of the technologies are for  Munax AB through the Copyright laws of Sweden (SFS 1960:72).

© 2003 - 2005, Nexyne Systems
© 2005 - 2007, Munax LLC
© 2007 - , Munax AB