Pages: [1]   Go Down
Author Topic: How to extract server/domain from a URL in char[]  (Read 611 times)
0 Members and 1 Guest are viewing this topic.
Netherlands
Offline Offline
Jr. Member
**
Karma: 0
Posts: 54
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Hello All,

I have a question on how to find a hostname/server/ipaddress from a URL in a char[]. I cannot seem to find a way that works robustly:

For example, in my code I have this:
Code:
char completeURL[256];
int length = 0;

while (Serial.available()) {
   completeURL[length] = Serial.read();
   length = (length+1) % 512;
   delay(50);
}
   completeURL[length] = '\0';
So, a user can enter a URL via the serial terminal. This may be the content of completeURL for example:
Code:
char completeURL[] = "http://www.google.com/search?q=arduino";
What I am trying to do is to extract the server from this and the rest of the URL so I have those in two separate char[]'s, like this:

Code:
char serverAddress[] = "www.google.com";
char restOfURL[] = "/search?q=arduino";

Now, the contents of completeURL may or may not contain the 'http://' part, or the 'www.'. Also, the top level domain is not guaranteed to be one specific one. In addition, if the user only enters a server (i.e. www.google.com), it may not contain a trailing /

I have tried to come up with a way to separate the server address and the rest, but it quickly becomes a spaghetti of if...then and loops, not resulting in the correct separation in all cases.

How would the more experienced programmers here, tackle such a challenge? I understand regular expressions are not an option with Arduino (not that I am an expert in those). I hope someone could point me in helpful direction.

Regards,
Arno
Logged

0
Offline Offline
Jr. Member
**
Karma: 0
Posts: 92
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

The problem here is that people are used to software fixing their URLs for them.  Check out wikipedia's article on URLs:  http://en.wikipedia.org/wiki/Uniform_Resource_Locator#Syntax

Depending on the context of the URL, programs attempt to fix the "scheme" parameter if it is missing.  This works well in contexts that you know what the usual method of retriving that data type is, and which port that usually happens on.  ie:  web-browsers usually look on the http port 80 for web page data.  Browsers will automatically pre-pend the http:// to the domain name, and see if it gets a response.  This doesn't always work though, if you're really looking for a FTP site, for example.  So really, unless you know the context, not having the "scheme" parameter indicates a malformed URL.

Check out http://en.wikipedia.org/wiki/URL_normalization.  It's a good list of things you should do too, if you want it to work at least good (poorly?) as web-browsers do.

That being said, your spagetti code is probably on the right track.


You'd probably have best luck using the String library.  Otherwise, checking individual array location gets to be a pain.  The string library will let you use a substring function too, so you'll easily be able to search for the "http://" and the "www." strings and remove them from the beginning of your complete URL.  There are quite a few protocols that you'd need to check and remove from the beginning, but getting all the standard ones might be good enough to do the job. 

Here's a list of the scheme parameters you may need to parse out:  http://en.wikipedia.org/wiki/URI_scheme

Once those are removed from the front, simply search for the first "/" you find.  Everything before that will be the domain.  Everything else (including the "/") will be the rest of the URL, obviously.

Good Luck, parsing sucks.
« Last Edit: July 01, 2011, 04:37:24 pm by BKnight760 » Logged

Netherlands
Offline Offline
Jr. Member
**
Karma: 0
Posts: 54
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Thanks for your reply BKnight760,

I really want to prevent using Strings for two reasons: with that server address I am going to do a DNS lookup to resolve its IP address. But most importantly (to me that is): whenever I saw something with char array's I steered away and used String's instead as they seem easier to work with. But, I feel like I should wrap my head around this structure as well :-)

The approach I am taking is:
I first get all the data the user enters, then add \o to terminate the char[].

Next, I traverse through the char[] looking for a '/'. If I find one, I'll look for another one directly before or after the found /. If those are not there, then I know I am not dealing with the 'http://' part and everything before the found / is the server address part. Everything after it should be the rest of the url.

If no / is found (apart from the http;// maybe), then the address the user entered is only a site's domainname.

Once I manage to get something robust, I will post here for others to use as well.
Logged

Netherlands
Offline Offline
Jr. Member
**
Karma: 0
Posts: 54
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

All,

It took a bit of fiddling with the code but I managed to come up with a piece of code that does this:
Code:
      // Read the input from the terminal
      while (Serial.available()) {
        requestedURL[length] = Serial.read();
        length = (length+1) % 512;
        delay(50);
      }
      // terminate the input
      requestedURL[length] = '\0';
     
      // Now, find the spot where the domain/server ends and the rest of the URL starts
      int splitLocation = 0;
      boolean foundSplit = false;
     
      // Go through the terminal input, one character at a time and check for a forward-slash
      // if found, make sure the character before or after is not a forward-slash also, because
      // in that case we are dealing with the http:// part of a url.
      // if we find a single forward-slash, remember the position where that is in the char[]
      for (int i=0; i < length; i++) {
        if (foundSplit != true && requestedURL[i] == '/') {
          if (requestedURL[i-1] != '/' && requestedURL[i+1] != '/') {
            foundSplit = true;
            splitLocation = i;
          }
        }
      }
     
      // if foundSplit is true, then requestedURL can be split, if false, there is only a domain/server and no page
      // now store the domain/server part in hostName[]
      if (foundSplit == true) {
        for (int j=0; j < splitLocation; j++) {
          hostName[j] = requestedURL[j];
        }
       
        // and the rest in pagePartOfURL[]
        for (int k=splitLocation; k < length; k++) {
          pagePartOfURL[k-splitLocation] = requestedURL[k];
        }
        hostName[splitLocation] = '\0';
        pagePartOfURL[length-splitLocation] = '\0';
      }

Hopefully this may help others. And if anyone reading this has tips, tricks, improvements, etc, please let me know as I am sure there is room for improvement here!
Logged

nr Bundaberg, Australia
Online Online
Tesla Member
***
Karma: 130
Posts: 8623
Scattered showers my arse -- Noah, 2348BC.
View Profile
WWW
 Bigger Bigger  Smaller Smaller  Reset Reset

I would start by getting any possible "http://" out of the way, something like

Code:
string_index = 0;  // assume there's no http://

if (memcmp ( "http://", requestedURL, 7) == 0)
    string_index = 7;

// now we can start with a clean string
// do stuff starting at requestedURL[string_index];

______
Rob
Logged

Rob Gray aka the GRAYnomad www.robgray.com

Netherlands
Offline Offline
Jr. Member
**
Karma: 0
Posts: 54
Arduino rocks
View Profile
 Bigger Bigger  Smaller Smaller  Reset Reset

Hi Rob,

Thanks for the valuable addition! Your addition eliminates the logic to look for a double forward-slash (which occurs in the protocol id http://)

Arno
« Last Edit: July 04, 2011, 01:20:39 pm by aetjansen » Logged

Pages: [1]   Go Up
Jump to: