How to Crawl a Password Protected Website

Crawling Private Pages of Password Protected Websites

Last Edited September 30, 2017 by Garenne Bigby in Sitemap Generator

What is a password protected website? Why would a website be password protected? Why would I want to crawl a password protected website? These are common questions that everyone asks themselves at some point, especially if you work in website design, development, or SEO. Let’s take each question one at a time, before looking at a few others as well.

What is Password Protection?

Password protection is used by many websites as a security process in order to secure information that should not be made available to the public at large. This information is otherwise accessible from any computer, but password protection makes it unavailable to the general public and requires a password (usually with an authorized user name as well) before being accessible.

Some websites are entirely password protected, whereas others are only partially so—meaning there’s a staff or membership area.


Why use Password Protection?

It may seem confusing to come across a password-protected site, especially if the reason is not immediately clear. If you’re a website owner or manager, you may be asking the same question—“Why should I password protect my site?”

After all, password protection prevents ordinary visitors from viewing your site, which restricts traffic. The whole point of having a site up is to get as much traffic as possible, with as much visibility as you can generate, right?

Actually, there are three major reasons why you as a website manager may want—or even need—to password protect your website. If you come across one as a user, you can be sure that it’s one of the three following reasons for the website being password protected:

  • Development. The website may still be in the developmental and design phase, and therefore not ready for the public to view it. Alternatively, the site may be undergoing some major changes in terms of its design. While undergoing development or reconstruction, it’s necessary to allow employees to access the site without allowing clients to do the same. The best way to do so is to set up password protection.

  • Private Content. This reason typically results in a major portion of the website being password protected, rather than the site as a whole. The reasons for this is to encourage visitors to sign up for membership, whether it’s a free subscription or a paid account. By password protecting some of the best content, information sites (and many blogs) entice visitors to become active members. A good example would be a bank’s website: there are areas available to the general public, such as the different account types, branches, and other financial services offered by the bank. But in order to access your online banking account, you’ll need to log in, as it is password protected.

  • Security. This is somewhat of an extension of the membership cause. By setting up password protection, site owners and managers can restrict access to certain administrative functions and/or files. A good example would be a “staff only” section, or an interweb, where important information can be easily shared with staff members in a large corporation. This works better than emails, which can get lost in the spam folder or accidentally deleted, as it forms a permanent online resource library. As new employees join the company, they can immediately access all the necessary information by using their username and password. The password protected interweb system also allows staff to access the administrative back-end of the site, where they can manage client accounts on a web service site, for example.


Why Crawl Password Protected Websites?

Crawling a password protected website may sound illegal—after all, it’s password protected for a reason—but there are a number of fully legitimate reasons for wanting to crawl your protected site. Besides which, in order to crawl a password protected website, you’re going to need access to the administrative back-end, so you’ll only be accessing information that you already have access to as an admin.

Let’s take a look at why you would want to crawl your website while it’s under password protection.

  • Developing a content inventory. This is especially important when the site is still being developed and/or redesigned, as you want to have a list of all the contents on the site. The reasons for this could easily take up another article, but the most common one would be to make sure everything that needs to be there has been uploaded.

  • Performing a content audit. Once you have your content inventory, you’ll want to go through that content and make a list of any issues that need to be fixed. These could include missing content that still needs to be uploaded, partial or corrupted files that need to be replaced, or even content that hasn’t been tagged properly to allow website crawlers and search engines to read them properly.

  • Creating a visual sitemap. Sitemaps are used by website crawlers and human users alike, and help both to navigate through your site with ease. By crawling through your password protected site before launching, you can draw up your visual sitemap in advance and immediately see where information needs to be better organized. This also helps when you’re doing a major redesign, as you can plan your project ahead of time, make comparisons to the previous design, and implement necessary changes in real time before launching.

  • SEO. The SEO related reasons for wanting to crawl any website are extensive enough to warrant an article dedicated to the subject, and all of them apply to crawling a password protected website too. In brief, crawling your site under password protection allows you to see what the search engines see. As you’re going to be relying rather heavily on them to help attract traffic to your site, you want to make sure that your site isn’t only well optimized for human users, but for search engines as well. In part, this includes your usability experience, the effectiveness of your sitemap, and the meta tags allowing search engines to crawl your site content thoroughly for inclusion in their search engine indexes (which they use to generate search query results so fast). Making sure that your site is well optimized before launching can and will save you a lot of difficulties, effort, and frustration in the long run.

  • Accessibility. We’ve already covered optimizing the accessibility of your site for search engine website crawlers and general users, but there’s another important aspect of your site’s accessibility score that you certainly don’t want to overlook. Whether your website is going to become available to the general public once you launch or a section will remain password protected, you want to make sure that the pages are easily accessible to persons with disabilities. This is especially true of private website pages that will continue to be password protected.


Best Practices for Crawling Password Protected Websites

Before we get into the details of how to crawl a password protected website, there are some practices that you want to bear in mind and adhere to.

  • Use a read-only account when crawling. When you’re crawling through your protected site, you’re going to be picking up on things that need to be changed. You may even be tempted to fix them yourself as you go along, but you have to remember the reasons for your crawling the website in the first place. One of them is to make a list for the developers to sort out. After all, they’re the professionals, and you’ve hired them to deal with these kinds of issues. If you lack the expertise, you may end up doing more harm than good. By using a full administrative username and password when crawling your website, you’re also giving the crawler full access to the entire admin section. While the best crawlers won’t do anything of the like, with full administrative access, you risk having your crawler making changes to the site’s themes and plugins, even deleting posts. This is why it is imperative that you setup a read-only account for crawling your website.

  • Always exclude your admin pages. This carries on from what we’ve said in the above point: the read-only account you’ve set up for crawling your website must exclude the administrative back-end pages and folders to help you avoid making those undesirable changes to your website. For WordPress, you’ll want to exclude the entire /wp-admin/ section; for Joomla, the same applies for the /administrator/ section.


What Else Do I Need to Know?

Learn about CSS selectors and HTML

When building a Custom System Login, you’re going to need some basic knowledge of both CSS and HTML, which are the programming protocols used. We’ll give you a breakdown of the four best sites for learning CSS and HTML to help you get started.

Theoretically, you could ask your developers to set this up for you. But you want to have full control over your Custom System Login, so for security reasons, it’s always going to be better for you to do it yourself. But don’t think you’ll have to enroll yourself in a college or university course to learn how!

Here are the four best online tutorials and courses, which are available 100% free of charge.

  • Udacity - Intro to HTML and CSS. Udacity have some fantastic online courses made freely available, and their Intro to HTML and CSS is certainly one of the best.

    The tagline for this course is “Not Your Typical Intro,” and we wholeheartedly agree. While most HTML and CSS tutorials focus on the language syntax, Udacity recognizes that syntax alone won’t help you design a webpage—or, in our case, setup a Custom System Login. This course is structured in such a way as to help you think like a front-end developer, showing you where to start and how the task should be approached.

    When you complete the course, you’ll have all the knowledge you need to implement HTML and CSS in all of your page elements too—meaning you’ll be able to make minor changes to your site without having to call in a developer every time.

  • Khan Academy - Intro to HTML/CSS: Making Webpages. Khan Academy has an extensive list of topics that are wonderfully broken down into a variety of modules. You won’t have to complete the full Computer Programming subject to partake in their Intro to HTML/CSS: Making Webpages course.

    This is another fantastic, extensive exploration of HTML and CSS, and teaches you how to use both for developing web pages. The same principles will apply when you setup your Custom System Login, and understanding the difference between HTML—a markup language—and CSS—a stylesheet language—can make or break your ability to do so efficiently.

  • Codecademy - Learn HTML and Learn CSS. Codecademy focuses purely on the teaching users the different coding languages and applications. Both the Learn HTML and Learn CSS tutorials are brilliant crash courses designed to help novices learn the ropes quickly and efficiently.

    While others have an estimated duration of 3 weeks, you’ll learn everything you need to know to setup your Custom System Login in as little as 12 hours with these two tutorial courses. As part of the course, you’ll also be given an opportunity to build a fashion blog and a wine festival schedule (HTML), after which you can tackle the CSS tutorials, where you’ll complete a few extra projects. These projects aren’t necessary for the immediate need of crawling your password protected site, but having this knowledge will prove invaluable when you’re looking for any bugs that need to be fixed or doing minor changes later on without having to hire a developer each time.

  • W3Schools - HTML5 Tutorial and CSS Tutorial. W3Schools is a popular source for learning website development tools. They rightly market themselves as the world’s largest tutorial and reference site for the industry.

    You’ll want to start off with their extensive HTML5 Tutorial, which you can easily finish in a day and includes some basic CSS implementation already. There are hundreds of examples to help make the learning process that much easier, and W3 Schools have also included on-page HTML editors that allow you to practice each lesson and view the results in real time.

    To properly supplement your CSS knowledge—whether you plan on using it further, or only want to be able to set up your Custom System Login—you should also run through their CSS Tutorial. Once again, there are hundreds of examples provided to help you learn quickly, and on-page editing sections where you can practice and view your coding skills immediately.


Learn how to use a browser inspector tool

This knowledge will prove invaluable in finding the right information in the code that will be necessary for building your custom system login. You’re going to need a browser inspector tool for finding the necessary information in your code to help you build your Custom System Login.

We’ve found the six best browser inspector tools, specifically designed and developed for the most popular browsers in use.

  • Chrome DevTools. Possibly the most popular browser in use, Google Chrome features a range of tools to help webmasters get the most out of their experience. Chrome DevTools are built into Google Chrome, and the authoring, debugging, and profiling tools that make up the set are easily accessible. To do so, simply navigate to Chrome’s Main Menu and select “More Tools” > “Developers Tools” to activate the package. Right click on any page element and choose “Inspect”, and you’ll receive a full breakdown.

  • Firebug. Firebug is a long-standing favorite for Firefox users, which closely rival Chrome users in terms of numbers. While the Firebug extension itself is no longer being further developed or maintained, it remains one of the most powerful web development tool sets available online. Firebug can still be comfortably and efficiently used at present, but you’ll certainly want to follow Mozilla’s recommendation of migrating to the next tool on our list—Firefox Developer Edition.

  • Firefox Developer Edition. Replacing Firebug as the go-to option for developers and website managers using Mozilla Firefox, the Firefox Developer Edition comes with an extensive set of tools. For the express purpose of using the browser inspection tool in developing your Custom System Login, you should take a look at the Page Inspector, their number 1 core tool.

  • Safari Web Inspector. For those of you using a Mac with the default browser engine, Safari Web Inspector is the solution for you. Apple has outdone themselves in creating a command center that affords their users with quick and easy access to one of the richest sets of development tools available. Safari Web Inspector is fully compatible with macOS, iOS, and tvOS.

  • Microsoft Edge Developer Tools. Not many people still use the default Windows browser anymore, but for those of you who do, there’s the Microsoft Edge Developer Tools. The F12 developer tools range has been recently updated to include features requested by users. For anyone using Microsoft Edge, the seven tools included in the F12 range are well-developed and are sure to meet your needs.

  • Opera Dragonfly. Last but not least is the Opera Dragonfly developers’ tool set. While most Android users will be familiar with Opera as the default browser for many smartphones, the engine is also very adept when used on your desktop or laptop too. Recognizing the need for Opera-oriented developer tools, the team have released Dragonfly, a fully-featured suite that’s built into the browser itself. To make use of the Dragonfly feature, right-click on any element and select “inspect”—as easy as that!


Build a Custom System Login

Create a DYNO Mapper account if you do not already have one. If your website or application is under 100 pages, you can use the free version. Tiered pricing is available based on the page count of your project. After you have logged into DYNO Mapper, follow these instructions.

  1. Create from URL in the main menu and click advanced option.

    create from URL for password protected

  2. In advanced options, under Website Authentication. Click the Add/Edit Custom System button.

    create from URL for password protected websites
  3. Click Add System

    password protected add system

  4. Title your Custom System Login.

    password protected form
  5. Enter Login Page URL - This is the page where the login form can be found. It can be a relative or absolute url.

  6. Enter Form Selector - This is the login form element itself. The form tag (<form>).

  7. Enter Form Submit URL - This is where the form will be submitted. Usually the value of the form tags' action attribute. Ex. in <form action="wp-login.php"> submit url is wp-login.php

  8. Enter Form Username Selector - This is the username field or input tag inside the login form tag.

  9. Enter Form Password Selector - This is the password field or usually the input tag with type="password" attribute inside the login form tag.

  10. Enter Default Username - Login username. This is used if you leave the login username empty in Website authentincation of advanced option.

  11. Enter Default Password - Login password. This is used if you leave the login password empty in Website authentication of advvanced option.

  12. Enter Exclude URLS - Absolute or a part of url to be excluded in crawling. Useful for excluding the logout link so the crawler won't logged out if it finds a log out url. Separate multiple entries with new line.

  13. Enter Login Success Criteria - This must be set to make the crawler know if it successfully logged in. CSS selector criterias is just usually needed. Other criterias requires knowledge on http.

  14. Enter Criteria(s):
  • CSS selector match - If selector has match after submitting the log in form. Example are elements that can be seen only while logged in; (#dashboard-wrapper, .login-success-alert)
  • CSS selector does not match - If selector has no match after submitting the log in form. Example are elements that shows when log in fails; (.login-failed-alert)
  • HTTP code match - HTTP status code of successful log in.
  • HTTP code does not match - Opposite of HTTP code match
  • Header name match - HTTP header name found on successful log in.
  • Header name does not match - Opposite of Header name match.
  • Header name/value match - HTTP header name and value found on succesful log in.
  • Header name/value does not match - Opposite of Header name/value match.

 

Garenne Bigby
Author: Garenne BigbyWebsite: http://garennebigby.com
Founder @dynomapper
Garenne Bigby is freelance Chicago developer and founder of DYNO Mapper with over 10 years experience in both agency and freelance roles in design, development, user experience, SEO, and information architecture.

Back

Related Articles

Create Visual Sitemaps

Create, edit, customize, and share visual sitemaps integrated with Google Analytics for easy discovery, planning, and collaboration.

free 14 day trail button*No credit card required.

Popular Tags

Search Engine Optimization SEO Create Sitemaps Sitemaps Accessibility Testing UX Sitemap Generator Content Audit User Experience Website Content Audit

Private Beta

Are you interested in participating in Dyno Mapper's private beta period? We are currently selecting users so please fill out the form below to apply.

First Name*
Invalid Input

Last Name*
Invalid Input

Email*
Invalid Input

Occupation*
Invalid Input

How do you plan to use Dyno Mapper?*
Invalid Input

Submit