Algorithm for string similarity – better than php Levenshtein and similar_text

Posted on March 25, 2011

6


In a current pedagogical project, the students are supposed to answer questions/solving tasks using a drag-and-drop interface. This interface has to be general and loadable with different kind of pedagogical stuff. Examples could sorting things in order depending on let’s say weight, or sorting months in the right order onto a circular board with 12 “slots”, one for each month.

For checking and verifying the students answers, I use a solution where each sortable item carries a single character tag. In the months example, january has the tag character “a”, february has the tag character “b” etc. This means that a correct answer would be “abcdefghijkl” – all the 12 months in the correct order.

My problem is the following: What algorithm to use to get a “pedagogically usable and fair” result value when the student’s answer pattern is partly correct, but differs from the correct answer in a way that php Levenshtein and similar_text algorithms doesn’t handle well. Example:

similar_text('jonas', 'xxjon', $similar); echo $similar; // returns 60
similar_text('jonas', 'asjon', $similar); echo $similar; // returns 60 <- !
echo levenshtein('jonas', 'xxjon'); // returns 4
echo levenshtein('jonas', 'asjon'); // returns 4  <- !

As seen in these examples, the similar_text and the Levenshtein functions returns the same value for both strings, in spite of the fact that the “asjon” string is much closer to the “jonas” when it comes to the character content.

I’ve been lurking around for some more elaborated solution that would suite my needs better, but not yet found one. So, I had to give it a go myself. And here’s what I’ve come up with:

static public function string_compare($str_a, $str_b)
{
    $length = strlen($str_a);
    $length_b = strlen($str_b);

<code>    $i = 0;
    $segmentcount = 0;
    $segmentsinfo = array();
    $segment = '';
    while ($i < $length)
    {
        $char = substr($str_a, $i, 1);
        if (strpos($str_b, $char) !== FALSE)
        {
            $segment = $segment.$char;
            if (strpos($str_b, $segment) !== FALSE)
            {
                $segmentpos_a = $i - strlen($segment) + 1;
                $segmentpos_b = strpos($str_b, $segment);
                $positiondiff = abs($segmentpos_a - $segmentpos_b);
                $posfactor = ($length - $positiondiff) / $length_b; // <-- ?
                $lengthfactor = strlen($segment)/$length;
                $segmentsinfo[$segmentcount] = array( 'segment' => $segment, 'score' => ($posfactor * $lengthfactor));
            }
            else
            {
                 $segment = '';
                 $i--;
                 $segmentcount++;
             }
         }
         else
         {
             $segment = '';
            $segmentcount++;
         }
         $i++;
     }

     // PHP 5.3 lambda in array_map
     $totalscore = array_sum(array_map(function($v) { return $v['score'];  }, $segmentsinfo));
     return $totalscore;
}

Some results:

  • jonas / jonax : 0.8
  • jonas / sjona : 0.68
  • jonas / sjonas : 0.66
  • jonas / asjon : 0.52
  • jonas / xxjon : 0.36

I’m sure i isn’t perfect, and that it could be optimized, but nevertheless it seems to produce the results that I’m after… One weak spot is that when strings have different length, it produces different result when the values are swapped. But, will do for a start!

Advertisements
Posted in: PHP