Short summary:
* 34 hours to crawl/index a corpus of 6 billion documents
* No limit in scaling content size and/or performance
* One (1) person to manage/operate a 3000 computers system
* EQR: Combines 6 algorithms/technologies when ranking results
* All search features of existing search engines and much more
* Full-indexing of 22 document types. More are easily added
* Link-indexing of 62 document types
* Distributable over an arbitrary number of computers
* Several patentable technologies
* Private siteSearch: encrypted documents can be stored & searched for
* 2 years of research and development
* State-of-the-art technologies
* Installs on Windows 2000/2003/XP Pro
Runtime system:
Windows 2000/2003/XP Pro. Soon on Linux & Unix.
Software:
MUNAX can be installed on a single computer or an arbitrary number of computers on a local
area network, or over computers on the internet.
MUNAX runs as a single process / multithreaded program. MUNAX execution is logically
distributed over subsystems, not machines. One machine can host several subsystems and/or
one subsystem can span over several machines. This way, MUNAX can be distributed freely
and each machine's execution power can be utilized to its maximum. For instance, one of
the very powerful machines can have three full subsystems running while another machine
may have only a dictionary server running. All this is easily setup in the MUNAX parameter
files.
Besides storing data in files, MUNAX stores information in several smaller databases.
These databases are provided by the service layers and one specific type of database may
combine one or several technologies like hash structures and different type of tree
structures. MUNAX also introduces a new type of database, called GAX, where the knowledge
of the existence of any type of object is represented by a single bit.
The entities that carries out the duties in MUNAX are called Worker and Server processes.
A process is an instance (or a clone) of a MUNAX component. Examples of worker components
are crawlers, indexers and sorters and examples of server components are Index, Rank,
Dictionary and doc repository components. All-in-all MUNAX consists of 13 type of server
components and 3 type of worker components.
The workers communicates with the servers over the MUNAX Mchannel interface which is a
session layer on top of TCP/IP. Using Mchannel is faster than using sockets directly and,
as important, helps to overcome system limitations (number of opened sockets). The workers
may use the Mchannel interface directly or the MUNAX Unified File System (ufile) to open
any type of communications channel or file. ufile, in turn, uses the MUNAX Remote File
system to open files on the LAN or anywhere on the internet. From a servers point of view,
it is unknown where a file is located. Its location and definition is set in the MUNAX
parameter files.
MUNAX has no limitations when it comes to scaling, neither concerning the speed nor the
size of the database content. For instance, MUNAX does not have the famous "max 4
billion docs" bug.
The MUNAX software has several patentable technologies.
Reporting & ratings:
MUNAX has a very rich reporting system. Reports are levelled as TRACE, LOG, ERROR and
PANIC. Great efforts have been spent on ensuring that the MUNAX system reports any
abnormalities instead of / before letting a computer crash/hang.
When it comes to performance, we have measured ratings by using old PC's (typical Pentium
II, 300Mhz, 256MB RAM) and tested the system both as a single-computer installation and
when distributed over several machines. As an example, using a single-computer system with
an ADSL line the crawlers reaches full bandwidth of that ADSL line, i.e. 2.5Mbps, running
120 crawlers simultaneously.
Indexing is about 60 times faster than crawling. Sorting is about 80 times faster than
crawling.
Management/Control:
One, or several MUNAX systems, each consisting of thousands of subsystems and machines,
can be managed from one, single, console (one person!). From the MUNAX
Management Console program, the operator issues commands targeted either to the whole
MUNAX system, a subsystem, a machine, a component or to an individual process.
Besides the common commands for starting/stopping activities like crawling, indexing
and/or sorting, great care have been taken to ensure that administrative duties like
backup, restore, reset can be carried out with minimum effort. For instance, backing up a
MUNAX database consisting of billions of indexed documents is done in minutes simply by
issuing the 'backup' command.
Crawling:
The bandwidth utilization = the sum of all bandwidths. For instance, for a MUNAX system
setup on 3000 computers on the internet, and each computer using ADSL 2.5Mbps, the total
bandwidth will be 7.5Gbps. For a corpus of 6 billion docs (average size 15Kbytes) this
means a (theoretical) total time of 34 hours to crawl & index the total corpus.
To overcome the DNS bottleneck, MUNAX uses the MUNAX DNS database servers. There might be
any number of these servers and the DNS knowledge is distributed over all the servers.
These servers have more intelligence than just the DNS functionality. For instance, they
keep track of the access frequency of crawlers accessing the web servers to avoid
"attacking" a web server too often.
MUNAX stores all documents in the repository, either in their original format or in a
converted format. Encrypted or non-Encrypted. Compression ratio is about 5:1. Typically, 1
million documents, together with the indexing information requires about 3GB of disk
space.
Indexing:
Currently, MUNAX full-indexes the following type of documents: htm html shtm shtml jhtm
asp php php3 pdf ps doc xls ppt rtf wp wp5 wp6 wpd txt c cpp h. More types of documents
for full-indexing is easily added by adding the name of the convert/fixup program for that
doc type in the MUNAX parameter files.
Currently, MUNAX link-indexes the following type of documents: gif, jpg, tga, bmp, iff,
img, jif, mac, msp, pcx, pic, tif, ico, jpe, mp3, wav, ram, snd, mp4, aif, mid, vqf, la1,
lav, mp2, avi, mpg, mpeg, rm, qt, asx, mov, fli, flc, eps, wri, asc, fmk, for, zip, gzip,
tar, arc, lzh, sit, rar, arj, dd, tgz, lha, exe, hqx, dll, vbs, vxd, bat, cmd, class, jar,
java, jav and email addresses.
MUNAX knows what type of files these are and groups them accordingly.
When it comes to link-indexing in MUNAX, this is more complex than just indexing the
anchor text and the URL. Amongst other things, MUNAX relates each link to the
other links on the page and to the text of the page itself.
MUNAX allows information to be encrypted, i.e. crawlers fetch encrypted docs from web
sites. The indexer decrypts the doc before indexing it and encrypts it before storing it
in the doc repository. This enables corporates to store their sensitive information in a
safe and nonvolatile way and at the same time have it searchable with the highest
precision.
MUNAX is language independent. For those docs where "words" are other than just
bit streams, i.e. where words are builtup of ASCII or ISO, 255 languages are recognized.
MUNAX can index fragments to produce one (or several) "frame documents" from a
html page. This makes it ideal for setting up a News Search Engine. Such a search engine
may co-exist in the same database as the "normal" search engine.
Sorting:
Sorting involves the sorting of several parts of the MUNAX data, like: doc location, link
analysis, dictionary and of course the sorting of the forward index to produce the
inverted index.
Querying and Viewing results:
MUNAX supports the same query parameters as many other search engines, like: Phrases, word
subtraction, language, file format, date, top domain, site name, porn filtering e.t.c.
However, MUNAX introduces new ones that cannot be found elsewhere, like: Compose Search,
Customer code, Access code, User Scalar, Rich content (multimedia), Robots inclusion/
exclusion (yes, at query time).
To give the search engine visitor an even greater search experience, MUNAX let him choose
which ranking algorithms to combine, or to simply sort the results by date if that is what
he wants.
Docs can be fetched directly from the MUNAX repository and viewed in the browser with the
query words highlighted. Either in the original format or with the html tags stripped
away.
Ranking results, AIRelevance and Equalizer ranking (EQR):
Today, when making a query on the most popular public search engine on the internet, you
may get results that are not related to what you querying. Or, even worse, when you click
on a query result the destination document does not even contain any of the words in your
query.
It's clear that it is not enough to rely on one or two ranking algorithms to produce the
best results. In MUNAX, the goal is to give the most informative documents the highest
rank. To decide which documents are most informative, we invented the AIRelevance
algorithm. It tries to "look" at a document very much the same way as web surfer
and asks itself: "Is this a good and informative document in relation to the query
?". However, realizing that relying on a single ranking algorithm is not enough, we
combine AIRelevance with five (5) others to produce a final ranking that we call Equalizer
Ranking (EQR).
EQR combines the following ranking algorithms: AIRelevance, docVote (link analysis),
Proximity, VSM (Vector Space Model), mmRank (Multimedia ranking) and USR (User Scalar
Ranking). Each algorithm can, of course, be parametized to decide how much influence it
should have on the final ranking.
|
MUNAX Iserver Night
MUNAX Iserver Continuous
MUNAX Iserver Personal
MUNAX Iserver Small Business
MUNAX Corporate Twin Server
MUNAX Corporate Twin
System+
|