Programmers: how to recognize coordinates?

To send you correct alerts for ISS passes, our system has to be good at finding out where you are. And it does work well, with one exception. Read on if you know what regular expressions are and if want to help Twisst a little.

The problem

Most followers of @twisst have a location in their Twitter profile like 'Mumbai, India' or 'Atlanta, USA'. Our system takes that location and asks Yahoo or Google to geocode it so we get geographical coordinates.

Some people help us a little by listing their coordinates instead of a text location. In those cases, all we need to do is check if the coordinates are correct and find out when ISS will pass over that location.

This fails however, when people use geographic coordinate system instead of decimal notation. Our script understands '40.641205, -8.654413' but when it encounters a location like 'N 52°35' 0'' / W 2°1' 0''', everything goes wrong. It will either think it is a text location, which it will feed to the geocoding services, or it uses the coordinate point while ignoring the N, S, E or W, resulting in ISS alerts for the wrong location.



The script

Here is the very simple PHP-function we currently use:

function geoCodeiPhone($location) {

    // see if there are coordinates in the location, ie. "iPhone: 52.043900,5.555832". if so, use those coordinates as lat & lng

    preg_match_all("/-?\d+[\.|,]\d+/", $location, $coords, PREG_SET_ORDER);

    $lat = str_replace(',', '.', $coords[0][0]);
    $lng = str_replace(',', '.', $coords[1][0]);

    return array($lng, $lat);
}

After this function we check if this resulted in valid coordinates. If no coordinates were found, we proceed with the geocoding.

A solution?

I would be very grateful if someone could make an alternative function which does two things:

  1. besides the coordinates in decimal format, also find coordinates like 34.3330°S, 150.9170°E; N5.431781 E51.362089; N 52°35' 0'' / W 2°1' 0''

  2. convert those coordinates into decimal format.

You are very welcome to post an all-in-one function, but a solution to either tasks would be a great help too. Post your code below!

Comments

Great, thanks for that update! I tested it and it also works well with the previous test strings.

I will e-mail you about our wish list. Thanks for the suggestion, I will respond in my e-mail.

I'd be more than happy to help with anything else, you have my email.

If I might make a suggestion (or a feature request, maybe?). One thing I thought of when signing up for this service, was that I would like to keep my public location a bit vague (and geeky) like "Orion Arm" or something like that. I think many other users would also like to keep their positions a little less exact than supplying their coordinates.

If it were possible to follow @twisst and then send @twisst a direct message (or @reply, since DMs only work for people who are following you as well) with my coordinates (or city name) and have the service then store that value in the database. Then if I move, I can just send @twisst another message with my new location. All the while keeping my vague bio location of "Orion Arm".

Just a thought.

As for the zipcode issue... if the location contains just a zipcode, no matter where it is, it will fail. But if the location contains a full address like "45 Anytown St., Reseda, CA 01335", it will return a false positive with the current function (matching the 45, and the 013 portion of the zipcode, which returns a valid coordinate pair).

So I've added a few tests in the regex (called lookarounds) that make sure we are not starting or stopping a match in the middle of a number and that should take care of zipcodes altogether. I've also cleaned it up a bit, and removed some redundant regex and code.

Here is that new code:
http://twisst.pastebin.com/XJxuxCeh

Please test it and make sure the values returned are the same as before, because I did remove some code that now appeared to be redundant, but it may not have been.

Just found one wrong result: 51°14'0.93N, 0°10'2.60W led to: 51.2335916667, 51.2335916667. Apparently 0 is not a good longitude to be :)

The code is running and it's working great! This guy for instance, @bouwhelm with location "52°15' NB 6°9' OL", got alerts for 37.911694, -87.944977 which is near Evansville, a city in the US.

Thanks to the new code, from now on he will get alerts for his true location in the Netherlands, 52.250000, 6.150000 :-)

@ foster63 yes I think we should somehow ask people to check if we got their location right. I was thinking to give their location a more prominent place on their passes info page, with a map, so they see if something is wrong at a glance. Will add that soon.

Sorry to keep you waiting. I agree Benjam, your code is very good. It's efficient and the translation array is a nice touch.

Don't you think all the zipcodes could get filtered out, by checking if the string contains more than one number?

I am going to replace the existing function with this one right now. We will do a better job at locating new followers, but I will also have it take a new look at all coordinates in our followers list to undo previous errors.

Thank you so much for helping out! From what I read on your blog, your marriage is based on geekiness anyway, so I guess more geekiness is good ;-)

There's are a lot of fun features we would like to add to Twisst. Let me know if you like me to e-mail you some of those ideas to see if you'd like to help with those too!

Ok, here is an updated version that seems to be working for all the new examples.

http://twisst.pastebin.com/3fGwQ4gF

The + was not a problem, I also added a translate array that you can easily add to. The array key is the original language cardinal point ('NB', 'OL', etc), and the array value is the english equivalent (I guessed at the ones that were in there, so those might be incorrect)

I also added a test to make sure the values were within reason (lat between -90 and 90, and long between -180 and 180) which will catch most of the zipcodes, but there may be some that still get through (US east coast mostly), one thing that can be done is check and make sure the value has a decimal point (how many people live at exactly 40N 112W ? )

But there you go. Nothing will every be perfect for all cases, the most you can hope for is a sizable chunk.

Let me know if you have any further questions. I'm glad to help. Love the service. Have already seen a few passes thanks to you. (Although my wife is further convinced of my geekiness because of it ;) )

Those extra test strings as a function call:

geoCodeiPhone('iPhone: -52.043900,-5.555832'));
geoCodeiPhone('34.3330°S, 150.9170°E'));
geoCodeiPhone('-34.3330°, 150.9170°'));
geoCodeiPhone('N 52°35\' 0" / W 2°1\' 0"'));
geoCodeiPhone('N 35 32 4.678 W 124 23 18.234'));
geoCodeiPhone('- 35 32.678 - 124 23.234'));

geoCodeiPhone('Reseda, Ca. 91335'));
geoCodeiPhone('38d 33m 06.32sN 121d 29m 04.75'));
geoCodeiPhone('Pre: 39.8714500,-105.01943'));
geoCodeiPhone('51° 50\' North 006°50\' East'));
geoCodeiPhone('35°29\'18.15′N 76°37\'06.18W'));
geoCodeiPhone('52°15\' NB 6°9\' OL'));

geoCodeiPhone('53°15\'10.69"N 5°15\'6.87"O'));
geoCodeiPhone('+29° 35\' 3.6, -95° 13\' 45.1'));
geoCodeiPhone('40° 0\' N 105° 16\' W'));
geoCodeiPhone('Hengelo, 52.2670°N, 6.8000°E'));
geoCodeiPhone('35°57\'17.10 N, 84°06\'03.79 W'));
geoCodeiPhone('30°19′10″N 81°39′36″'));

geoCodeiPhone('1.3819056°N 103.8448167°E'));
geoCodeiPhone('1°16\'S & 36°48\'E'));
geoCodeiPhone('ÜT: 52.07764,4.35581'));

@ Redshift42 Glad you had a fun learning day :-) And thanks! Unfortunately I am getting "Warning: preg_match(): Compilation failed: invalid UTF-8 string at offset 30" on both lines 6 and 30 when using the code you supplied. Tried to get it to work, but no luck I'm afraid.

@ Benjam & Simon & Matt Thanks for your trouble! Here's what I got when running the code:

TEST STRINGS

iPhone: -52.043900,-5.555832
34.3330°S, 150.9170°E
-34.3330°, 150.9170°
N 52°35\' 0" / W 2°1\' 0"
N 35 32 4.678 W 124 23 18.234
- 35 32.678 - 124 23.234

MATT

-5.555832, -52.043900
150.9170, -34.3330
150.9170°, -34.3330°
-2.01666666667, 52.5833333333
18.234, 4.678
23.234, 32.678

BENJAM

-5.555832, -52.0439
150.917, -34.333
150.917, -34.333
-2.01666666667, 52.5833333333
-124.388398333, 35.5346327778
-124.387233333, -35.5446333333

So Matts code leaves a few degrees characters in, and it doesn't compute the last two correctly. Benjams code however seems to be perfect!

To make sure this is working well in the real world, I got some more exotic cases from the live database for us to play with:

Reseda, Ca. 91335 (should not result in a false positive)
38d 33m 06.32sN 121d 29m 04.75 (I'm inclined to say we can't win them all)
Pre: 39.8714500,-105.01943
51° 50' North 006°50' East
35°29'18.15′N 76°37'06.18W
52°15' NB 6°9' OL (NB and OL are Dutch abbreviations. In these cases the only thing we could try to do is guess I suppose.)
53°15'10.69"N 5°15'6.87"O
+29° 35' 3.6, -95° 13' 45.1 (is that plus a problem?)
40° 0' N 105° 16' W
Hengelo, 52.2670°N, 6.8000°E
35°57'17.10 N, 84°06'03.79 W
30°19′10″N 81°39′36″
1.3819056°N 103.8448167°E (despite the N and E, the coordinates already are decimal)
1°16'S & 36°48'E
ÜT: 52.07764,4.35581

When we test these and it goes well, do you guys agree that Benjams code is ready to be implemented?

Okay, to be honest I'm an American expat living over here and I wasn't sure so I looked it up. Both seemed to be acceptable, but back on topic. Benjam's solution looks more complete. It is nicer having a single regex. More efficient. A regex, no matter how clever, would never be enough to solve the problem as there is logic involved in deciding how to recombine the terms and there are calculations when the degree formats are used. The solution looks good.

Not really sure how you might look for the character encoding, my text editor just tells me which file encoding it's using.

Often times with a UTF encoded file, there is a byte order marker (BOM) at the beginning of the file, but you'll need to view the hexadecimal code of the file to see it, as most editors ignore it.

Anywho... give this revised regex a try and see if it helps:
http://twisst.pastebin.com/RSsa9nx8

I made the portion of the regex that is looking for the in-between character allow more than one character. It may allow for more false matches, but it seemed to work fine with the examples in the file.

I probably should have checked the comments again sooner, but I had a fun day of learning a bit of PHP. ;-)

Here's my version, which should be fairly flexible with regard to formatting (whitespace, double ' vs " as the seconds mark, etc.)

function geoCodeAll($location) {

preg_match("/(?P[NS]?)\s*(?P-?\d+[\.|,]\d+)°?\s*(?P[NS]?)[,\/\s]+(?P[EW]?)\s*(?P-?\d+[\.|,]\d+)°?\s*(?P[EW]?)/u", $location, $elements);

// If we found digital coordinates, with or without degree symbol and/or NSEW
if (isset($elements["lat"]) && isset($elements["lng"]))
{
//print "Digital\n";

$lat = str_replace(',', '.', $elements["lat"]);
$lng = str_replace(',', '.', $elements["lng"]);

if ($elements["hemi_lat1"] == 'S' || $elements["hemi_lat2"] == 'S' )
$lat *= -1;
if ($elements["hemi_lng1"] == 'W' || $elements["hemi_lng2"] == 'W' )
$lng *= -1;

return array($lng,$lat);

}
else
{
$elements = NULL;

preg_match("/(?P[NS]?)\s*(?P\d+)°[,\s]*(?P\d+)\'[,\s]*(?P\d+(?:[\.|,]\d+)?)[\'\"]+\s*(?P[NS]?)[,\/\s]*(?P[EW]?)\s*(?P\d+)°[,\s]*(?P\d+)\'[,\s]*(?P\d+(?:[\.|,]\d+)?)[\'\"]+\s*(?P[EW]?)/u", $location, $elements);

if (isset($elements["deg_lat"]) && isset($elements["deg_lng"]))
{
//print "Degrees\n";

$elements["sec_lat"] = str_replace(',', '.', $elements["sec_lat"]);
$elements["sec_lng"] = str_replace(',', '.', $elements["sec_lng"]);

$lat = $elements["deg_lat"] + $elements["min_lat"]/60 + $elements["sec_lat"]/3600;
$lng = $elements["deg_lng"] + $elements["min_lng"]/60 + $elements["sec_lng"]/3600;

if ($elements["hemi_lat1"] == 'S' || $elements["hemi_lat2"] == 'S' )
$lat *= -1;
if ($elements["hemi_lng1"] == 'W' || $elements["hemi_lng2"] == 'W' )
$lng *= -1;

return array($lng,$lat);

}

}

//print "Neither\n";

return array("Parse","failure");
}

very strange indeed.

It looks like it's missing most of the regex match in the first match group.

The match[0][0] should be the full coordinate like: "N 52°35' 0"

What character encoding are you using for the file? I'm using ISO-8859-1

It might be converting the character to a two-byte UTF-16 character and getting lost in the second byte of the "strange character"? Not really sure.

hmmm... strange.
Comment out all the debugging lines except for the 7th entry (line 97), and then add the following after line 31:

var_dump($coords);

and get me the new output and I'll see what's going on.

p.s.- I love my eee =) very compact, a bit small for development work though.

Please note that I know next to nothing about PHP, I'm just doing some monkey work and yes, I know the bit at the end is for debugging only.

simon@eee:~$ php -v
PHP 5.3.2-1ubuntu4.2 with Suhosin-Patch (cli) (built: May 13 2010 20:01:00)
Copyright (c) 1997-2009 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2010 Zend Technologies
simon@eee:~$

Also, that output is merely a debugging output that should be ignored in production use.

All lines after 89 should be deleted from the code in production use.

@Simon- hmmm... my 7th result is the following:

array(2) {
[0]=>
float(-2.0166666666667)
[1]=>
float(52.583333333333)
}

based on the input of
N 52°35' 0" / W 2°1' 0"

so it looks correct to me.

What version of PHP are you running? (I'm on 5.3.2)

Also, it should be known that my script is based on the few examples given on this post, in addition to the few permutations that I could think up. It is by no means fully exhaustive, and there are more than likely a few edge cases I missed, but it works for most permutations (all of them I tested). And if some come in it doesn't catch, the regex and code can be easily modified to grab those as well.

Test it with a few different syntaxes and it should work for most of them, let me know if it doesn't. You have my email address.

Just ran the script (I think) and got this output:

array(2) {
[0]=>
float(5.555832)
[1]=>
float(52.0439)
}
array(2) {
[0]=>
float(5.555832)
[1]=>
float(-52.0439)
}
array(2) {
[0]=>
float(-5.555832)
[1]=>
float(52.0439)
}
array(2) {
[0]=>
float(-5.555832)
[1]=>
float(-52.0439)
}
array(2) {
[0]=>
float(150.917)
[1]=>
float(34.333)
}
array(2) {
[0]=>
float(150.917)
[1]=>
float(-34.333)
}
array(2) {
[0]=>
float(1)
[1]=>
float(52)
}
array(2) {
[0]=>
float(-124.38839833333)
[1]=>
float(35.534632777778)
}
array(2) {
[0]=>
float(-124.38723333333)
[1]=>
float(-35.544633333333)
}

Two questions:

1. Does this look like the correct output ie. am I running the script correctly,

2. Do the results look correct? To my eye the seventh set of results look suspicious. (Hey that's my location!)

Jaap,

I'm liking the look of benjam's script. What do you think? I'm afraid I'm strictly a Python man when it comes to scripting.

1. Sometimes you can't force the problem into a regexp and it's better to program it.

2. stackoverflow.com is a great place to get help with problems like this.

3. There is a php command-line executable. It's included with the download from the php site.

I don't know of any online service that will run your PHP for you (would be a fairly large security risk for that service to do so). But there are places like pastebin that allow you to paste your code and others can edit.

See my pastebin entry in my comment below.

Jaap,

Agreed, If there was somewhere to run this script I'd be glad to help with the debugging. Anyone know whether I can run PHP from the command line?

Simon, Matt, let's keep it on-topic please! Code is more important than spelling, I'm sure the Queen will agree ;-)

I am sure I saw an online script debugger somewhere... Would be nice to share the same workspace to try different solutions together. Cannot find it now though, so I will run the functions locally and report back. Hope you will do the same!

Matt,

Your queen, my queen, she still spells it *rigorously*. Don't assume everyone you meet on the net is an American.

Give this solution a try, it has a pretty comprehensive regex and matches many permutations of the various inputs.

If you have any questions, please feel free to email me.

Love the service, and am happy to help any way I can.

Solution code: http://twisst.pastebin.com/5mnbgE0y

All joking aside, that is the accepted British spelling my friend :) Queen's English. See my followup comment. The first version was wrong.

Matt, please spell check *rigorously*. Only joking :-)

PS. That solution looks promising.

Feeding coordinates into the geocoding services of Yahoo and Google will sometimes work, but it is not very reliable. Even for well formatted coordinate points they usually return an error.

@ hsmade thanks for the regexp, looks good!
Would be nice to have one unified test for all kinds of coordinates, but I suppose it's easier to first check what kind of notation we're dealing with anyway.

Sorry, I misinterpreted an algorithm for converting from degs to decimal. New version of cleanLatLong:

function cleanLatLong($token) {
$token = str_replace(' ', '', $token);
$token = str_replace(',', '.', $token);
if (preg_match("/°([NSEW])/", $token, $m)) {
$token = $m[1] . substr($token,0,strlen($token)-2);
}
switch (substr($token,0,1)) {
case 'S':
case 'W':
$token = '-'.substr($token,1);
break;
case 'N':
case 'E':
$token = substr($token,1);
}
if (preg_match("/(-?)(\d+)[°:](\d+)[':](\d+)/", $token, $m)) {
$token = $m[1] . ($m[2] + $m[3]/60 + $m[4]/3600);
}
return $token;
}

I think I have a solution. It may not be the cleanest but should work. Please test rigourously:

function geoCodeiPhone2($location) {

// see if there are coordinates in the location, ie. "iPhone: 52.043900,5.555832". if so, use those coordinates as lat & lng

preg_match_all("/[-NSEW]?\d+[\.|,]\d+°?[NSEW]?/", $location, $coords, PREG_SET_ORDER);
$lat = $lng = 0;

if (count($coords)) {
$lat = cleanLatLong($coords[0][0]);
$lng = cleanLatLong($coords[1][0]);
} else {
// probably using degree, minute, second format
preg_match_all("/[NSEW]+\s*\d+°?\s*\d+[':\.]\s*\d+/", $location, $coords, PREG_SET_ORDER);
$lat = cleanLatLong($coords[0][0]);
$lng = cleanLatLong($coords[1][0]);
}

return array($lng, $lat);
}

function cleanLatLong($token) {
$token = str_replace(' ', '', $token);
$token = str_replace(',', '.', $token);
if (preg_match("/°([NSEW])/", $token, $m)) {
$token = $m[1] . substr($token,0,strlen($token)-2);
}
switch (substr($token,0,1)) {
case 'S':
case 'W':
$token = '-'.substr($token,1);
break;
case 'N':
case 'E':
$token = substr($token,1);
}
if (preg_match("/(-?)(\d+)[°:](\d+)[':](\d+)/", $token, $m)) {
$token = $m[1] . $m[2] . '.' . ($m[3]/60 + $m[4]/3600);
}
return $token;
}

Brilliant idea! Let's get them all to use the same format! I'm not holding my breath.

Chris raises an interesting question. I'm guessing that Google will successfully decode most of these formats, though I don't know which API you're using. Can you not use the raw data, perhaps with any prefix (eg. iPhone) stripped off?

Posted this originally as a Tweet but posting it here as requested to 'help the discussion along'

I am a programmer (not PHP). My first guess would be to replace "N" or "E" with "+" and replace "S" or "W" with "-" then continue.

When I entered this into the Google Maps web interface it correctly located it - can't you send it to the geocoder by default?

34.3330°S, 150.9170°E => /\d+\.\d+°[SN], +\d+\.\d+°[EW]/
N5.431781 E51.362089 => /[NS]\d+\.\d+ +[EW]\d+\.\d+/
N 52°35' 0'' / W 2°1' 0' => /[NS] *\d+° *\d+\' *\+\"[ /]*[WE] *\d+° *\d\' *\d+\"/

I've been looking into this before.. I can't help you with the regexp, but the converting (after you figured out the minutes (and maybe seconds)) is fairly easy:

Since you know that 15' is '15 minutes', divide by 60 and multiply with 100 (divide minutes with 0.6):

N50 15.000' becomes 50.250 degrees decimal. For the seconds (") this same trick can be used.

N50 15' -> 50 + (($mins / 0.6) / 100) = 50,25
N50 15' 15" -> 50 + ((($mins / 0.6) /100) * (($secs / 0.6) /100) ) = 50,2525 (?)

Not sure if my calculations are correct, but you'll hopefully catch my drift :)

Remco