• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

Simple algorithm to calculate distance between two vectors

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

Sjaak

Member
Joined
Apr 26, 2004
Location
The Netherlands
Heyo!

I know i could probably ask this question on a maths-related forum or browse through a long list of papers and books on it to try and condense what i need..but it's a real simple problem and i just like these forums alot.

I'm doing research on proteins atm and have written a script that outputs a vector for every sequence filled with values depending on certain aspects of that sequence:

Sequence1 = [0.2,0.4,0,0,1,1,0.2..etc]

Each vector will contain 344 values which range between 1 and 0 to about 3 decimals (unsure atm but better be flexible). What i need to do is compare each number in the vector with the corresponding number in the other vector in turn and calculate the 'distance' between them, where equal numbers = smallest distance and one being 1 and the other 0 = largest difference.

Goal would be to input two sequences, and recieve a single number between 0 and 1 to indicate how well they correlate, 0 being identical and 1 being very different. Language to be used = python..

I was thinking of looping through the sequences at first, removing any values that 0 are in both (these will have no biological value). Then, award points based on the difference between the remainder - but how can i translate this into a single 'score' in the end?

Tia

- Sjaak
 
I'll give it a shot

Psuedocode given, I'm not very familiar with Python
Code:
'Takes two comma seperated inputs of your vectors
'Make sure they are the same length before passing them
Function GetScore(byval sequence1 as string, byval sequence2 as string) as double



     'double for decimal precision
     dim sequence1exploded() as double = explode(sequence1,",")
     dim sequence2exploded() as double = explode(sequence2,",")
     dim arrayofscores(sequence1exploded.length) as double
     dim counter as integer
     dim runningTotal as double = 0

     for counter = 1 to sequence1exploded.length
          arrayofscores(counter) = sequence1exploded(counter) - absolutevalue(sequence2exploded(counter))
     next counter

     for counter = 1 to arrayofscores.length
          runningTotal += arrayofscores(counter)
     next counter

     GetScore = runningTotal / arrayofscores.length

That's a quick interpretation. You may have to convert some strings to numbers somewhere depending on how you do it.
 
I'm not much into statistics, but there are a *lot* of ways that you can measure "distance". You could find average distance, root-mean-square distance, Euclidean distance, and so on. The "correct" metric to use will depend on the problem itself, and what makes most sense. You might want to consult Wikipedia for information on the different distance metrics.

Some Java-esque pseudocode:

Average Distance
Computes the average of all the differences.
Code:
double computeAverageDistance(Vector sequence1, Vector sequence2) {
     double sum = 0.0;
     
     for (int index=0; index<sequence1.length; index++) {
          sum = sum + AbsoluteValue( sequence1[index] - sequence2[index] );
     }
     
     return sum / sequence1.length;
}

Root-mean-square Distance
Computes the square-root of the square of the distances.
Code:
double computeRMSDistance(Vector sequence1, Vector sequence2) {
     double sum = 0.0;
     
     for (int index=0; index<sequence1.length; index++) {
          sum = sum + ( sequence1[index] - sequence2[index] )^2;
     }
     
     return (sum / sequence1.length)^0.5;
}


Euclidean Distance
Computes the "distance in space" between the two vectors.
Code:
double computeEuclideanDistance(Vector sequence1, Vector sequence2) {
     double sum = 0.0;
     
     for (int index=0; index<sequence1.length; index++) {
          sum = sum + ( sequence1[index] - sequence2[index] )^2;
     }
     
     return sum^0.5;
}

EDIT: Note that none of the algorithms above remove "zeros" from the input vectors. You could create a loop which does this fairly easily, and then pass the resulting vectors into one of these functions.

JigPu
 
#Unsure if it will interest anyone but this is the result, if you want to run it use the attached file :) Thanks again!


Code:
import math

#fileName = raw_input("enter a filename that contains two vectors of equal length: ?")
fileName = "vectoren.txt"

#Loads file and vectors
fileContent = open(fileName, "r")
data = fileContent.readlines()

#Makes vectors into lists (previously read as strings)
Vector1 = data[0].split(",")
Vector2 = data[2].split(",")

#Check whether both vectors have the same length
if len(Vector1)!= len(Vector2):
    print "Error: unequal lengths, distance calculation not possible"
    quit

#Loop through them, remove instances that are '0' in both
limit = len(Vector1)
index = 0
tempVector1 = []
tempVector2 = []

while index < limit:
    temp1 = Vector1[index]
    temp2 = Vector2[index]
    
    if (float(temp1) != 0 or float(temp2) != 0):
            tempVector1.append(temp1)
            tempVector2.append(temp2)
     
    index = index + 1

    
#Rename the temp ones
Vector1 = tempVector1
Vector2 = tempVector2

#Loop through the remaining instances, calculate mean root square distance
index = 0
limit = len(Vector1)
avgScore = 0

while index < limit:
    value1 = float(Vector1[index])
    value2 = float(Vector2[index])
    avgScore = avgScore + pow(float(value1-value2), 2)
    index = index + 1

print pow((avgScore/len(Vector1)), 0.5)
 

Attachments

  • vectoren.txt
    4.7 KB · Views: 209
hi! i need to calculate euclidean distance between two vectors in java and that should be called into another function.
 
Back