Monday, November 28, 2011

Build your own Whois Lookup with ASP.NET and jQuery


IP addresses can reveal a lot about your web visitors. For an ecommerce site, the owner or registrant of the visitor's IP address can be very useful information. You could paste each visitor IP address into one of the many free IP Lookup sites available, but that can be time-consuming and tedious. Or you could build your own lookup. This article shows how this can be done pretty quickly.

Visitor IP addresses are stored in the site log file. If your IIS is set up to log using the W3SVC Extended Log Format, the fields written to the log file are as follows:
date
time
s-sitename
s-ip
cs-method
cs-uri-stem
cs-uri-query
s-port
cs-username
c-ip
cs(User-Agent)
sc-status
sc-substatus
sc-win32-status
Each request is logged on a separate line, with a space as the delimiter. The visitor's (client) IP address is stored against c-ip, and the server ip address is stored as s-ip. It's the c-ip address we are interested in. A typical log file is generated for each day's activity, and will contain multiple entries for each visitor, even if they only visited one page on the site. There will be an entry for every file requested, which includes each page and each image, javascript file and css files associated with them. One way to query the contents of the log file is to read it into a database and then use SQL. However, this example will use Regular Expressions to parse the file for IP addresses and then work from the results that way.
The "application" (it's just an aspx file and its code-behind) needs to do 4 things:
  1. It needs to read a given log file and get IP addresses from it
  2. It needs to identify distinct IP addresses and display them
  3. It needs to provide a mechanism by which IP addresses are resolved
  4. It needs to display the results
First we need a web form:

<%Page Language="C#" AutoEventWireup="true" CodeFile="IPLookup.aspx.cs" Inherits="IPLookup" %>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml"> 
<head runat="server"> 
    <title>Untitled Page</title>
    <style type="text/css">
        body { font-family:Verdana;font-size:76%; }
        pre { font-size:10pt; }
        span { cursor:pointer; }
        #dns { float:left; }
        #calendarfloat:left;width:350px; }
        #loading { left:300px;z-index:100;position:absolute; }
    </style>
</head> 
<body> 
    <form id="form1" runat="server">

    <div id="calendar">
      <asp:Calendar ID="Calendar1" runat="server"
        onselectionchanged="Calendar1_SelectionChanged" />
      <asp:Literal ID="ip_addresses" runat="server" />
    </div>
    <div id="dns"></div>
    <div id="loading"></div>
    </form>
</body> 
</html> 

There are 3 divs. One contains a Calendar control and a Literal, while the other 2 are empty. You can see from the CSS in the head section that the Calendar div is floated to the left and has a set width. This means that the dns div will sit to the right of it. The final div is absolutely positioned and stacked quite high relative to the page through its z-index. The Calendar control has its onselectionchanged event wired to an event handler in the code-behind.
The event handler will perform the first 2 functions of the application. It will parse the nominated log file for IP addresses and deliver then to the Literal control having got rid of duplicates.

protected void Calendar1_SelectionChanged(object sender, EventArgs e)
{
    ip_addresses.Text = "";
    List<string> IPs = new List<string>();
    StringBuilder sb = new StringBuilder();
    DateTime date = Calendar1.SelectedDate;
    string filedate = string.Format("{0:yyMMdd}", date);
    string path = @"D:\Logs\ex" + filedate + ".log";

    if (File.Exists(path))
    {
        string content;
        using (StreamReader sr = new StreamReader(path))
        {
            content = sr.ReadToEnd();
        }

        Regex re = new Regex(@"\w\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}\w");
        MatchCollection mc = re.Matches(content);
        foreach (Match mt in mc)
        {
            if (mt.ToString() != "xxx.xxx.xxx.xxx")
                IPs.Add(mt.ToString());
        }

        var result = IPs.Select(i => i).Distinct().ToList();

        foreach (string ip in result)
        {
            sb.Append("<span>" + ip + "</span>\n");
        }

        ip_addresses.Text = "<pre>" + sb.ToString() + "</p>";

    }
    else
    {
        ip_addresses.Text = "No logs available for that date";
    }
}

There are 40-odd lines of code here, so let's see what's going on in there. Some variables are set at the top, including the text for the Literal control, a generic List<string> and the format for the Calendar's selected date. this is then used to obtain the log file that matches the selected date. All IIS log files are saved (very conveniently) with file names in the following format: ex + yyMMdd + .log. So today's (April 13th 2009) will be ex090413.log. If a file exists that matches the selected date, it's contents are pulled into a StreamReader object. Once there, the contents are subjected to a Regular Expression matching exercise, which looks for anything matching the pattern of an IP address. All IP addresses feature 4 sets of numbers separated by a point. Each group of numbers can feature 1, 2 or 3 digits.
Now, if you remember, the server's IP address is logged for every request (s-ip). We don't want to spend time resolving that or using memory adding to the List<string> for every instance of it in the file, so the code checks to see if the current match from the Regex is NOT the server IP address (replace the xxx.xxx.xxx.xxx with your server IP address), and if so, it adds it to the List<string>. This results in a list potentially full of duplicates, so a Lambda Expression is used to just extract the Distinct values. Each of the distinct IP addresses are then placed in a <span> element and then the whole thing is set as the Text property of the Literal control.
Now we need to create a mechanism whereby the IP addresses can be resolved. As I said at the start of the piece, there are numerous free web sites where you can paste in an IP address, and get information about who it is registered to. A typical report will look like this:
What we need is a way to generate these registration reports reliably on demand. It might be possible to find a web service that provides this, but I couldn't find one. It might also be possible to hook into one of these free sites, but most of them prevent automated querying, and those that currently don't may do at some stage. So the best thing to do is go to the source of the reports, and that is the Regional Internet Registriesor RIRs. There are five of these organisations - one for each region in the globe. ARIN looks after North America and some of the Caribbean, RIPElooks after Europe, the Middle East and Central Asia, APNIC is responsible for the Asia-Pacific region, LACNIC takes care of Latin America and the rest of the Caribbean, leaving AFRINIC responsible for Africa. When examining the relevant sites, each of them has their own querying form. ARIN and RIPE support GET requests, whereas APNIC, LACNIC and AFRINIC take POSTs. I will only be looking at the first 4 RIRs. The main reason for this is that the Afrinic site did not support remote requests at the time I tested it, and I get so little traffic from that region that I did not pursue any kind of resolution. To support this, two utility methods are needed that make use of the HttpWebRequest class to perform HTTP requests with the remote servers. These are placed in the code-behind and look like this:

private static string GetHtmlPage(string url)
{
    String result;
    WebResponse response;
    WebRequest request = HttpWebRequest.Create(url);
    response = request.GetResponse();
    using (StreamReader sr = new StreamReader(response.GetResponseStream()))
    {
        result = sr.ReadToEnd();
        sr.Close();
    }
    return result;
}

private static string PostHtmlPage(string url, string post)
{
    ASCIIEncoding enc = new ASCIIEncoding();
    byte[] data = enc.GetBytes(post);
    WebRequest request = HttpWebRequest.Create(url);
    request.Method = "POST";
    request.ContentType = "application/x-www-form-urlencoded";
    request.ContentLength = data.Length;
    Stream stream = request.GetRequestStream();
    stream.Write(data, 0, data.Length);
    stream.Close();
    WebResponse response = request.GetResponse();
    string result;
    using (StreamReader sr = new StreamReader(response.GetResponseStream()))
    {
        result = sr.ReadToEnd();
        sr.Close();
    }
    return result;
}

If you have used HttpWebRequest before, then these should be familiar. The first method makes the HTTP GET request, while the second handles HTTP POST requests. The first method just requires a url complete with querystring passed into it, while the second requires the url that the form is at, together with a string that looks like a querystring being passed in separately. Both of them retrieve the HTML provided by the Response and return that.
What is needed now is a method that makes use of the above utility functions. In previous jQuery articles, where asynchronous calls have been made to server-side resources, I have tended to use a Web Service. In this example, I am going to use a Page Method. The principal is exactly the same except that the Page Method appears in the code-behind of the aspx page or in a <script runat="server"> block at the top. Dave Ward at www.encosia.com covers this pretty extensively but be sure to follow the links in his article to other useful nuggets of information about this area.

[WebMethod]
public static string GetWhois(string ip)
{
    string response = "";
    string arin = "https://ws.arin.net/whois/?queryinput=" + ip;
    string ripe = "http://www.db.ripe.net/whois?form_type=simple&full_query_string=&searchtext=" + ip + "&do_search=Search";
    string apnic = "http://wq.apnic.net/apnic-bin/whois.pl";
    string lacnic = "http://lacnic.net/cgi-bin/lacnic/whois?lg=EN";
    string lacnicFields = "query=" + ip;
    string apnicFields = ".cgifields=object_type&.cgifields=reverse_delegation_domains&do_search=Search&" +
                         "form_type=advancedfull_query_string=&inverse_attributes=None&object_type=All&searchtext=" + ip;

    response = GetHtmlPage(arin);
    Regex pre = new Regex(@"<pre>[.\n\W\w]*</p>"RegexOptions.IgnoreCase);
    Match m = pre.Match(response);
    if (pre.IsMatch(response))
    {
        if (m.Value.IndexOf("OrgName:    RIPE Network Coordination Centre") > 0)
        {
            response = GetHtmlPage(ripe);
            m = pre.Match(response);
        }
        else if (m.Value.IndexOf("OrgName:    Asia Pacific Network Information Centre") > 0)
        {
            response = PostHtmlPage(apnic, apnicFields);
            m = pre.Match(response);
        }
        else if (m.Value.IndexOf("OrgName:    Latin American and Caribbean IP address Regional Registry") > 0)
        {
            response = PostHtmlPage(lacnic, lacnicFields);
            m = pre.Match(response);
        }
        return m.Value;
    }
    else
    {
        return "<pre>No Data</p>";
    }
}

There's not a lot of explanation needed for the method. It simply takes the returned string from the HttpWebRequest call to the page, starting with a call to the ARIN service, and then checking to see if the response contains a reference to another RIR. If it does, it calls the appropriate one. I chose ARIN first because it reliably returns the alternative RIR where the ip Address falls outside of the ARIN region. The Regex only grabs the content within the <pre> block in the response, which is the formatted registry record. This is what is returned from the Page Method to the calling code, which follows next.
Turning back to the aspx file, Javascript is added which is all that is needed to manage the request, response and the display of the result.

<script type="text/javascript" src="script/jquery-1.3.2.min.js"></script> 
<script type="text/javascript"> 
  $(document).ready(function() {
    $("span").each(function() {
      $(this).click(function() {
        $("#loading").html("<img src=\"images/loading.gif\" />");
        $("#dns").empty();
        $.ajax({
          type: "POST",
          contentType: "application/json; charset=utf-8",
          data: "{ip: '" + $(this).html() + "'}",
          url: "IPLookup.aspx/GetWhois",
          dataType: "json",
          success: function(response) {
            $("#loading").empty();
            $("#dns").html(response.d);
          }
        });
      });
    });
  });
</script> 

And that takes care of all of it. It starts of by iterating over all the <span> elements on the page (these contain the IP addresses) and adds a click event handler to each one. The event handler function first loads a gif image into the loading div, which is a version of the familiar animation to indicate activity on an AJAX page
Then it clears the content of the dns div, which may hold details of previous results. Once that has been accomplished, the AJAX call to the Page Method is made and if successful, the loading gif is cleared and the results displayed.
This article again shows how little code is required when using jQuery to perform AJAX calls and manipulate the DOM. It also looks at some other very useful aspects of .NET including Page Methods, using lambda expressions on the results of a text file, Regular Expressions and the HttpWebRequest class to perform both POST and GET requests to "scrape" remote web pages. If you wanted to build on this example, there's nothing to stop you examining the structure of the Registration records that get returned from the RIRs to extract just the key information you want, and then scheduling the parsing of them with the results emailed each week or so.


Data Scraping in ASP.NET Using WhoIs Search !!

No comments :

Post a Comment