About

Welcome to my “Programming and music” blog. No personal stuff here! 🙂

2 Responses “About” →
  1. Hello,

    I read your blog about serializing to AMF. I am not an AS3 programmer but I need a programmer to help me save and load project files from my Flash app. http://www.imageeditor.net Save and Load project files to local HDD and to the server and retain layers, clipart, text, everything… Please contact me and let me know if your up for the task. I need someone soon and I have the money ready to pay if you can confidently do the job.

    Thanks,
    Andy Garza

    Reply
  2. One nice bit of search query functionality, particularly in boolean
    systems, is the wildcard match. If you aren’t sure whether the title
    you’re trying to remember contains the word academy, academic,
    academically, or academics then you might be well served by trying all
    four: academ*.

    We can treat this problem as a kind of query expansion – as if the
    user had performed a query for all words that begin with ‘academ’, and
    we would get the union of the posting lists for each matching word as
    normal. To do this, we need to be able to efficiently retrieve all
    words that match the wildcard, which requires a change to the
    dictionary the terms are stored in.
    Tries

    Earlier we used a hash map (in the form of a normal PHP array) to
    store the dictionary. This works fine for quick lookups, but doesn’t
    allow us to find similar words, as even words that vary only by an
    accent, such as cliche and cliché will generate very different hashes.

    Instead, we will use a tree structure that allows us to look for a
    word a letter at a time, called a trie, from retrieval – though
    usually pronounced ‘try’. A trie is structured so there is a node for
    each letter in a word. A trie containing the words ‘acts’, ‘actor’ and
    ‘ant’ would look like:

    a
    / \
    c n
    | |
    t t
    / \
    s o
    |
    r

    We can store a value with each word, to allow us to hold the document
    count and lookup for the posting list. General retrieval can be
    handled by just walking through the tree one node at a time, and the
    string one character at a time looking for matches at each level. If
    we get to the end of the string, we can then test whether that’s a
    word node.

    Traditionally the different letters in a node would have been
    represented with a linked list, and the existence of a word determined
    by adding an extra non-used character, like $, to distinguish between
    the word ‘add’ and the letters a-d-d as part of the word ‘adder’.
    However, we can stick the whole lot in a series of nested php arrays,
    which while not as efficient are rather straightforward! This also
    lets us have a ‘value’ key on the node that represents the end of a
    word.

    Prefix matching, with the wildcard at the end of the query term, can
    be handled by looking up the specified characters, e.g. looking for
    the node in the trie that represents ‘academ’ (if it exists), then
    traversing through all nodes below it, returning all the words found.
    Here’s a class that handles this, which uses references to traverse
    the tree, though recursion or an iterator would probably be a bit
    cleaner!

    array());
    }

    public function add($key, $value = null) {
    $trieLevel =& $this->getTrieForKey($key, true);
    $trieLevel[‘value’] = $value;
    }

    public function isMember($key) {
    $trieLevel = $this->getTrieForKey($key);
    if($trieLevel != false && array_key_exists(‘value’,
    $trieLevel)) {
    return true;
    }
    return false;
    }

    public function prefixSearch($prefix) {
    $trieLevel = $this->getTrieForKey($prefix);
    if(false == $trieLevel) {
    return false;
    } else {
    return $this->getAllChildren($trieLevel, $prefix);
    }
    }

    private function& getTrieForKey($key, $create = false) {
    $trieLevel =& $this->trie;
    for($i = 0; $i array());
    } else {
    return false;
    }
    }
    $trieLevel =& $trieLevel[‘children’][$character];
    }

    return $trieLevel;
    }

    private function getAllChildren($trieLevel, $prefix) {
    $return = array();
    if(array_key_exists(‘value’, $trieLevel)) {
    $return[$prefix] = $trieLevel[‘value’];
    }

    if(isset($trieLevel[‘children’])) {
    foreach($trieLevel[‘children’] as $character => $trie) {
    $return = array_merge($return,
    $this->getAllChildren($trie,
    $prefix . $character));
    }
    }
    return $return;
    }

    }
    ?>

    Using this for a wildcard at the end of the string is then straightforward:

    add($word);
    }
    var_dump($trie->prefixSearch(‘add’));
    ?>

    Gives:

    array
    (2) {
    [“adder”]=>
    NULL
    [“addled”]=>
    NULL
    }
    Other Wildcards

    The next trick we can use is to compute another dictionary, but this
    time in reverse. We store each term reversed, so the path to the term
    academic would be c-i-m-e-d-a-c-a. Using the same technique as before
    we can now search for all words that end with a certain phrase,
    allowing us to handle wildcards at the beginning of query terms, e.g.
    *cademically.

    By maintaining both we can handle wildcards in the middle of terms, by
    simply splitting the term into one that ends with a wildcard, and one
    that begins with a wildcard. The two resulting expansions can then by
    AND’d together, to give the proper results. For example hap*ily would
    become the query hap* AND *ily, and be treated as above. If we wrap up
    the logic in a couple of functions, we can process a single wildcard
    expansion anywhere in the string in a fairly straightforward manner:

    add($word);
    $rtrie->add(strrev($word));
    }
    return array(‘trie’ => $trie, ‘rtrie’ => $rtrie);
    }

    function searchTries($search, $tries) {
    $terms = explode(‘*’, $search);
    if(count($terms) > 2) {
    return false;
    }

    if(strlen($terms[0]) && strlen($terms[0])) {
    // middle wildcard
    $straight = $tries[‘trie’]->prefixSearch($terms[0]);
    $rev = $tries[‘rtrie’]->prefixSearch(strrev($terms[1]));
    return array_intersect($straight, reverseArray($rev));
    } else if(strlen($terms[1]) ) {
    // leading wildcard
    return
    reverseArray($tries[‘rtrie’]->prefixSearch(strrev($terms[1])));
    } else {
    // trailing wildcard
    return $tries[‘trie’]->prefixSearch($terms[0]);
    }
    }

    function reverseArray($keys) {
    $return = array();
    foreach($keys as $key => $value) {
    $return[strrev($key)] = $value;
    }
    return $return;
    }

    /* Do some searches */

    $words = array(
    ‘adder’,
    ‘addled’,
    ‘abject’,
    ‘agreement’,
    ‘astronaut’,
    ‘handily’,
    ‘happily’,
    ‘helpfully’
    );
    $tries = buildTries($words);

    $return = searchTries(‘h*ly’, $tries);
    var_dump($return);

    $return = searchTries(‘ha*ly’, $tries);
    var_dump($return);
    ?>

    The results from the two var dumps look like this:

    array
    (3) {
    [“handily”]=>
    NULL
    [“happily”]=>
    NULL
    [“helpfully”]=>
    NULL
    }

    array(2) {
    [“handily”]=>
    NULL
    [“happily”]=>
    NULL
    }

    There is another technique, the permuterm index, that can allow these
    kind of queries to be expanded within a single tree lookup, which
    works by adding a character to represent the end of the string, then
    rotating that string through each position. This can be quicker, but
    rather space intensive. Another technique is to create a k-gram index,
    which will store sequences of letters – for example a the 2-grams for
    ‘example$’ (with dollar denoting the end of the string would be):
    ‘ex’, ‘xa’, ‘am’, ‘mp’, ‘pl’, ‘le’, ‘e$’, which can be queried
    similarly.

    The biggest weakness of the trie is the amount of space it wastes –
    chances are there will be runs of characters with a word only at the
    end, meaning a bunch of extra levels of the tree that aren’t needed.
    The PATRICIA trie, or radix trie, attempts to address this by allowing
    the ‘decision’ at each note to be on something other than the first
    character, so each node stores an offset and the branch options (e.g.
    if the letter in position 5 is A, go down this branch).

    For real world systems, this technique is likely to be combined with a
    B-Tree into an String B-Tree, which is structured to allow the
    dictionary to remain mostly on disk, and minimise the number of
    accesses needed to answer any given query – important when the size of
    the dictionary is greater than can be practically stored in memory.
    That said, t

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: