MoinMoin recommendation system

News

The Goal

The goal of this system is to provide a way to recommend pages based on the content of the page viewed by the user. The system should be completely unsupervised.

Requirements

The only requirement for this feature is numpy. NumPy can be downloaded from http://numpy.scipy.org/.

Pattern representation

Currently pages are represented using the bag-of-words approach. Features are selected based on their "document frequency" in the document space.

The number of selected features is determined by a users/administrator provided value (numFeatures). This number also represents the number of inputs to the classifier.

The vector representation of the page is determined using the computed weights of the input features. Feature weights are computed using the formula:

w = tf*IDF = (tf/tfmax)*log(N/n)

where:

Similarity measure

The similarity between input vectors (page vectors) is computed using the cosine similarity measure. This similarity measure was selected based on experimental results. A better similarity measure might yield better results.

Clustering algorithm

The clustering algorithm is based on the ART neural network and provides unsupervised and incremental learning allowing the system to evolve based on new content or user selected pages.

System optimizations

Feature Selection

Currently the system uses all words that are longer then 3 characters and selects the most frequent ones, skipping common words like "while, about, again etc". Common words are defined in so called "stop word" lists. Currently we provide "stop word lists" for English, German, Russian, Spanish, Portuguese, Norwegian, Finnish and Dutch.

Another idea is to use stemming in order to reduce words to their root. Eg. "stemmer" and "stemming" would be reduced to their root: "stem". Unfortunately this brings some problems: the stemming algorithms are optimized for a specific language and there is no easy way to implement this for all languages. When implemented we might need a secondary classifier for detecting the main language of a page :).

Classifier

The ART network has some very sensible parameters:

Experimentally we used a number of 1000 inputs for the ANN. A lower number (100) yielded poor clustering performance, bigger values yielded poor classification speed. One way for solving the speed impact of a larger number of inputs would be to introduce an external dependency: NumPy. NumPy would allow to remove some slow list comprehension operations. The number of inputs is a critical part of the ART network and might make the difference between good clustering performance and poor performance.

The vigilance, simply said, tells the system to create a new category when a input vector is not less the vigilance similar to any of the available category's. It's value is critical for good clustering performance.

Usage

In order to test the current implementation (17 Aug 2007) you need to do:

  1. Index pages in order to find the best tokens for the ANN:

    moin.py ... recommender index
    1. For testing you could pass the option "--underlay yes" to the script. This way the script will select features from the entire wiki, including underlay. OR
    2. Select a few pages for training and put them in a wiki page with MoinMoin Group syntax

      1. Run the training script

        moin.py ... recommender train --wikigroup GroupPage # Replace GroupPage with the page with the training set
      OR
    3. Use the "Star Page" action to train the classifier with selected pages.
  2. Run the "cluster" script to update the cache

    moin.py ... recommender cluster

To view the computed clusters use the WikiClusters macro. [[WikiClusters]]

Wishes

Bugs

MoinMoin: MarianNeagul/RecommendationSystem (last edited 2007-12-18 21:51:33 by MarianNeagul)