Programmatic web page login with cookies

November 12th 2021 .NET

.NET makes it really easy to scrape some data from a public website. You can use HttpClient to download the web page.

public async Task<string> GetHtmlAsync(string username)
{
    var uri = new Uri(
        $"https://backloggery.com/ajax_moregames.php?alpha=1&user={username}");
    return await httpClient.GetStringAsync(uri).ConfigureAwait(false);
}

The best library for parsing the HTML is probably AngleSharp, but that's not the topic of this post. Instead, I'll focus on what to do if the web page is not public and you need to log in first. Typically, you will then need to submit the login form programmatically and use the cookies from the response in future requests.

You should do the login in a private browser window first, opening the Network tab of Developer Tools to see the interaction required.

img/WebPageLoginRequest.png

Web page login request in the browser Development Tools

I have highlighted the parts you should pay attention to (and blurred out the sensitive data). Let us go through them in order:

  • Request URL is the URL you should send the credentials to.
  • Request Method is the HTTP verb to use. In most cases, it should be POST.
  • Status Code is often 3** which means a redirect. If this is the case, you will need to disable auto-redirection with HttpClient to access the actual login response.
  • set-cookie headers contain the cookies that the server expects to receive in future requests from the logged-in client (in this case, your application).
  • Form Data contains the data that must be submitted to the login URL. This includes the values of all input fields (with the value of their id attribute as the key), but it's not uncommon to include several hidden input fields from the login page.

Based on all this information, you should be able to implement a login method similar to the following:

private async Task<IEnumerable<KeyValuePair<string, string>>> LoginAsync(
    string username, string password)
{
    var loginUri = new Uri("https://backloggery.com/login.php");

    var loginParams = new KeyValuePair<string?, string?>[]
    {
        new ("username", username),
        new ("password", password),
    };

    using var content = new FormUrlEncodedContent(loginParams);

    var response = await httpClient.PostAsync(loginUri, content)
        .ConfigureAwait(false);
    var loginCookies = response.GetCookies();

    if (!loginCookies.Any())
    {
        throw new ArgumentException("Login failed. Invalid username or password.");
    }

    return loginCookies;
}

The data that needs to be entered by the user is passed in via parameters. The rest is hard-coded. A few details to keep in mind:

  • The form data is specified as a sequence of KeyValuePair<string?, string?> instances. In my case, only user-supplied data is required. If a site requires you to include hidden fields from the login page, you need to request that page first and parse the HTML using AngleSharp or some other method.
  • The resulting sequence is then passed to the FormUrlEncodedContent constructor. The keys and values in the sequence are nullable strings, because that's what FormUrlEncodedContent expects. The created FormUrlEncodedContent instance is used as the content of the POST request in the PostAsync call.
  • The next step is to parse the set-cookie headers from the HTTP response. I do not think there are any helper methods for this in the .NET base class library. Since I could not find any NuGet packages either, I used a regular expression to do the parsing. That worked well enough for me:

    public static IDictionary<string, string> GetCookies(
        this HttpResponseMessage response)
    {
        if (response == null)
        {
            return new Dictionary<string, string>();
        }
    
        var cookies = response.Headers.GetValues("Set-Cookie")
            .Select(setCookieString =>
            {
                var match = Regex.Match(
                    setCookieString,
                    @"(?<key>\w+)=(?<value>\w+);");
                return new KeyValuePair<string, string>(
                    match.Groups["key"].Value,
                    match.Groups["value"].Value);
            });
    
        return new Dictionary<string, string>(cookies);
    }
    
  • To detect a failed login, I only needed to check if there were cookies in the response. For other sites, you might need to look for a specific cookie check status code of the response, or even parse the returned HTML content.

The LoginAsync method needs to be called before downloading the requested web page, so that the returned login cookies can be used for the request:

public async Task<string> GetHtmlAsync(
    string username, string? password = null)
{
    IEnumerable<KeyValuePair<string, string>>? loginCookies = null;

    if (!string.IsNullOrEmpty(password))
    {
        loginCookies = await LoginAsync(username, password)
            .ConfigureAwait(false);
    }

    var uri = new Uri(
        $"https://backloggery.com/ajax_moregames.php?alpha=1&user={username}");
    return await httpClient.GetStringAsync(uri, loginCookies)
        .ConfigureAwait(false);
}

Unfortunately, there is no built-in GetStringAsync method overload with a parameter for cookies. Therefore, I had to create an extension method myself:

public static async Task<string> GetStringAsync(
    this HttpClient httpClient,
    Uri uri,
    IEnumerable<KeyValuePair<string, string>>? cookies)
{
    using var request = new HttpRequestMessage(HttpMethod.Get, uri);

    if (cookies != null)
    {
        var cookieValue = string.Join(
            "; ",
            cookies.Select(cookie => $"{cookie.Key}={cookie.Value}"));
        request.Headers.Add("cookie", cookieValue);
    }

    var response = await httpClient.SendAsync(request).ConfigureAwait(false);
    return await response.Content.ReadAsStringAsync().ConfigureAwait(false);
}

I had to combine all the cookies myself into a single cookie header and then use the SendAsync method to send the generated HTTP request.

There is one last code change needed. If the login response status code requires redirection, the automatic redirection feature of the HttpClient must be disabled. This must be done when the HttpClient instance is created. Since you should do this with IHttpClientFactory, you can pass the option when registering the HttpClient for dependency injection:

serviceCollection.AddHttpClient<BackloggeryClient>()
    .ConfigurePrimaryHttpMessageHandler(() => new SocketsHttpHandler
    {
        AllowAutoRedirect = false,
    });

If you have problems with the above, you can check out the full code in my GitHub repository.

There is no simple built-in functionality in HttpClient or . NET in general to programmatically log in to websites and download web pages using that login. You have to manually perform all the steps to do this: submit the login form data, parse the returned cookies, and include those cookies in future requests. The basic building blocks for all these steps are all available.

Get notified when a new blog post is published (usually every Friday):

If you're looking for online one-on-one mentorship on a related topic, you can find me on Codementor.
If you need a team of experienced software engineers to help you with a project, contact us at Razum.
Copyright
Creative Commons License