Since a parallel computer or network of workstations presumably has more total memory than a single processor, the amount of memory consumed during matching is less of a concern. In fact, if the machine can be used as a backend processor (a continuously-running server that receives queries from a client, processes them, and sends the results back to the client) and has a large enough collective memory, each processor can cache term and document vectors in main memory, saving the cost of loading the vectors from secondary storage for each query. Depending on the machine, the cost of starting a process on a parallel computer or network of workstations might be very high. Because of this, we consider only the distributed implementation executing as a backend processor.
Computing the similarity between the pseudo-document and each term or document is a highly parallel operation. When applying the similarity measure, the scaling and multiplication of one term or document vector is completely independent of the operations performed on another term or document vector. Even a simple blocked data distribution [18] provides a high degree of parallelism while balancing the computational load among the processors. If each processor loads its respective blocks of term and document vectors upon initialization, the costs associated with loading term and document vectors from secondary storage will only have to be paid once rather than each time a query is performed.
Processing a query in parallel is very similar to processing a query on a serial machine. One processor, the root processor, is responsible for receiving a query from the client, building the pseudo-document, and performing the bookkeeping tasks associated with the query. When the root processor has finished building the pseudo-document, it broadcasts it to the other processors. It also must broadcast the number of factors to use, whether to return terms, documents, or both, and the number of terms or documents to return. Each processor then computes the similarity between the pseudo-document and each local term or document vector, recording the results. At the end of computation, a global sort returns the ranked similarity measures to the root processor. The root processor then returns the results to the client.
For the price of a few broadcasts before computation begins and a single global merge sort after computation ends, a query can be processed very quickly without accessing secondary storage or causing network contention. In addition, since there is very little difference between the serial algorithm and the parallel algorithm, much of the code from the serial implementation can be used by the distributed implementation.