Friday, 28 February 2014

iGoogle Gadgets for Webmaster Tools



Update: The described feature is no longer available.

When you plan to do something, are you a minimalist, or are you prepared for every potential scenario? For example, would you hike out into the Alaskan wilderness during inclement weather with only a wool overcoat and a sandwich in your pocket - like the naturalist John Muir (and you thought Steve McQueen was tough)?

Or are you more the type of person where even on a day hike, you bring a few changes of clothes, 3 dehydrated meals, a couple of kitchen appliances, a power inverter, and a foot- powered generator, because, well, you never know when the urge will arise to make toast?

The Webmaster Tools team strives to serve all types of webmasters, from the minimalist to those who use every tool they can find. If you're reading this blog, you've probably had the opportunity to use the current version of Webmaster Tools, which offers as many features as possible just shy of the kitchen sink. Now there's something for those of you who would prefer to access only the features of Webmaster Tools that you need: we've just released Webmaster Tools Gadgets for iGoogle.

Here's the simple process to start using these Gadgets right away. (Note: this assumes you've already got a Webmaster Tools account and have verified at least one site.)

1. Visit Webmaster Tools and select any site that you've validated from the dashboard.
2. Click on the Tools section.
3. Click on Gadgets sub-section.
4. Click on the big "Add an iGoogle Webmaster Tools homepage" button.
5. Click the "Add to Google" button on the following confirm page to add the new tab to iGoogle.
6. Now you're in iGoogle, where you should see your new Google Webmaster Tools tab with a number of Gadgets. Enjoy!

You'll notice that each Gadget has a drop down menu at the top which lets you select from all the sites you have validated to see that Gadget's information for the particular site you select. A few of the Gadgets that we're currently offering are:

Crawl errors - Does Googlebot encounter issues when crawling your site?



Top search queries - What are people searching for to find your site?



External links - What websites are linking to yours?




We plan to add more Gadgets in the future and improve their quality, so if there's a feature that you'd really like to see which is not included in one of the Gadgets currently available, let us know. As you can see, it's a cinch to get started.

It looks like rain clouds are forming over here in Seattle, so I'm off for a hike.

Thursday, 27 February 2014

3 tips to find hacking on your site, and ways to prevent and fix it



Google shows this message in search results for sites that we believe may have been compromised.You might not think your site is a target for hackers, but it's surprisingly common. Hackers target large numbers of sites all over the web in order to exploit the sites' users or reputation.

One common way hackers take advantage of vulnerable sites is by adding spammy pages. These spammy pages are then used for various purposes, such as redirecting users to undesired or harmful destinations. For example, we’ve recently seen an increase in hacked sites redirecting users to fake online shopping sites.

Once you recognize that your website may have been hacked, it’s important to diagnose and fix the problem as soon as possible. We want webmasters to keep their sites secure in order to protect users from spammy or harmful content.

3 tips to help you find hacked content on your site

  1. Check your site for suspicious URLs or directories
    Keep an eye out for any suspicious activity on your site by performing a “site:” search of your site in Google, such as [site:example.com]. Are there any suspicious URLs or directories that you do not recognize?

    You can also set up a Google Alert for your site. For example, if you set a Google Alert for [site:example.com (viagra|cialis|casino|payday loans)], you’ll receive an email when these keywords are detected on your site.

  2. Look for unnatural queries on the Search Queries page in Webmaster Tools
    The Search Queries page shows Google Web Search queries that have returned URLs from your site. Look for unexpected queries as it can be an indication of hacked content on your site.

    Don’t be quick to dismiss queries in different languages. This may be the result of spammy pages in other languages placed on your website.


    Example of an English site hacked with Japanese content.
  3. Enable email forwarding in Webmaster Tools
    Google will send you a message if we detect that your site may be compromised. Messages appear in Webmaster Tools’ Message Center but it's a best practice to also forward these messages to your email. Keep in mind that Google won’t be able to detect all kinds of hacked content, but we hope our notifications will help you catch things you may have missed.

Tips to fix and prevent hacking

  • Stay informed
    The Security Issues section in Webmaster Tools will show you hacked pages that we detected on your site. We also provide detailed information to help you fix your hacked site. Make sure to read through this documentation so you can quickly and effectively fix your site.

  • Protect your site from potential attacks
    It's better to prevent sites from being hacked than to clean up hacked content. Hackers will often take advantage of security vulnerabilities on commonly used website management software. Here are some tips to keep your site safe from hackers:

    • Always keep the software that runs your website up-to-date.
    • If your website management software tools offer security announcements, sign up to get the latest updates.
    • If the software for your website is managed by your hosting provider, try to choose a provider that you can trust to maintain the security of your site.

We hope this post makes it easier for you to identify, fix, and prevent hacked spam on your site. If you have any questions, feel free to post in the comments, or drop by the Google Webmaster Help Forum.

If you find suspicious sites in Google search results, please report them using the Spam Report tool.

Cross-submissions via robots.txt on Sitemaps.org


Last spring, the Sitemaps protocol was expanded to include the autodiscovery of Sitemaps using robots.txt to let us and other search engines supporting the protocol know about your Sitemaps. We subsequently also announced support for Sitemap cross-submissions using Google Webmaster Tools, making it possible to submit Sitemaps for multiple hosts on a single dedicated host. So it was only time before we took the next logical step of marrying the two and allowing Sitemap cross-submissions using robots.txt. And today we're doing just that.

We're making it easier for webmasters to place Sitemaps for multiple hosts on a single host and then letting us know by including the location of these Sitemaps in the appropriate robots.txt.

How would this work? Say for example you want to submit a Sitemap for each of the two hosts you own, www.example.com and host2.google.com. For simplicity's sake, you may want to host the Sitemaps on one of the hosts, www.example.com. For example, if you have a Content Management System (CMS), it might be easier for you to change your robots.txt files than to change content in a directory.

You can now exercise the cross-submission support via robots.txt (by letting us know the location of the Sitemaps):

a) The robots.txt for www.example.com would include:
Sitemap: http://www.example.com/sitemap-www-example.xml

b) And similarly, the robots.txt for host2.google.com would include:
Sitemap: http://www.example.com/sitemap-host2-google.xml

By indicating in each individual host's robots.txt file where that host's Sitemap lives you are in essence proving that you own the host for which you are specifying the Sitemap. And by choosing to host all of the Sitemaps on a single host, it becomes simpler to manage your Sitemaps.

We are making this announcement today on Sitemaps.org as a joint effort. To see what our colleagues have to say, you can also check out the blog posts published by Yahoo! and Microsoft.

Wednesday, 26 February 2014

Leap day hackathon for Google Gadgets, Maps, and more



If you've got JavaScript skills and you'd like to implement such things as Google Gadgets or Maps on your site, bring your laptops and come hang out with us in Mountain View.

This Friday, my team (Google Developer Programs) is hosting a hackathon to get you started with our JavaScript APIs. There will be plenty of our engineers around to answer questions. We'll start with short introductions of the APIs and then break into groups for coding and camaraderie. There'll be food, and prizes too.

The featured JavaScript APIs:When: Friday, February 29 - two sessions (you're welcome to attend both)
  • 2-5:30 PM
  • 6-10 PM
Where: The Googleplex
Building 40
1600 Amphitheatre Pkwy
Mountain View, CA 94043
Room: Seville Tech Talk, 2nd floor

See our map for parking locations and where to check in. (Soon, you too, will be making maps like this! :)

Just say yes and RSVP!

And no worries if you're busy this Friday; future hackathons will feature other APIs and more languages. Check out the Developer Events Calendar for future listings. Hope to see you soon.

Tuesday, 25 February 2014

Canonical Link Element: presentation from SMX West

A little while ago, Google and other search engines announced support for a canonical link element that can help site owners with duplicate content issues. I recreated my presentation from SMX West and you can watch it below:



You can access the slides directly or follow along here:



By the way, Ask just announced that they will support the canonical link element. Read all about it in the Ask.com blog entry.

Thanks again to Wysz for turning this into a great video.

In fact, you might not have seen it, but we recently created a webmaster videos channel on YouTube. If you're interested, you can watch the new webmaster channel. If you subscribe to that channel, you'll always find out about new webmaster-related videos from Google.

Sunday, 23 February 2014

Introducing the Google Webmaster Central YouTube Channel

In his State of the Index presentation, Matt Cutts said that one of the things to look for from Google in 2009 is continued communication with webmasters. On the Webmaster Central team, we've found that using video is a great way to reach people. We've shown step-by-step instructions on how to use features of Webmaster Tools, shared our presentations with folks who were unable to make it to conferences, and even taken you through a day in the life of our very own Maile Ohye as she meets with many Googlers involved in webmaster support.

We plan on releasing more videos like these in the future, so we've opened up our own channel on YouTube to host webmaster-related videos. Our first video is already up, and we'll have more to share with you soon. If you want to be the first to know when we release something new, you can subscribe to us using your YouTube account, or grab this RSS feed if you'd like to keep track in your feed reader. Please let us know how you like the channel, and use the comments in this post to share your ideas for future videos.

And while we'll all do our best to make sure Matt Cutts understands that Rick Rolling is so last year, be careful where you click on April 1st.

Thursday, 20 February 2014

Best practices against hacking

These days, the majority of websites are built around applications to provide good services to their users. In particular, are widely used to create, edit and administrate content. Due to the interactive nature of these systems, where the input of users is fundamental, it's important to think about security in order to avoid exploits by malicious third parties and to ensure the best user experience.

Some types of hacking attempts and how to prevent them

There are many different types of attacks hackers can conduct in order to take partial or total control of a website. In general, the most common and dangerous ones are SQL injection and cross-site scripting (XSS).

SQL injection is a technique to inject a piece of malicious code in a web application, exploiting a security vulnerability at the database level to change its behavior. It is a really powerful technique, considering that it can manipulate URLs (query string) or any form (search, login, email registration) to inject malicious code. You can find some examples of SQL injection at the Web Application Security Consortium.

There are definitely some precautions that can be taken to avoid this kind of attack. For example, it's a good practice to add a layer between a form on the front end and the database in the back end. In PHP, the PDO extension is often used to work with parameters (sometimes called placeholders or bind variables) instead of embedding user input in the statement. Another really easy technique is character escaping, where all the dangerous characters that can have a direct effect on the database structure are escaped. For instance, every occurrence of a single quote ['] in a parameter must be replaced by two single quotes [''] to form a valid SQL string literal. These are only two of the most common actions you can take to improve the security of a site and avoid SQL injections. Online you can find many other specific resources that can fit your needs (programming languages, specific web applications ...).

The other technique that we're going to introduce here is cross-site scripting (XSS). XSS is a technique used to inject malicious code in a webpage, exploiting security vulnerabilities of web applications. This kind of attack is possible where the web application is processing data obtained through user input and without any further check or validation before returning it to the final user. You can find some examples of cross-site scripting at the Web Application Security Consortium.

There are many ways of securing a web application against this technique. Some easy actions that can be taken include:
  • Stripping the input that can be inserted in a form (for example, see the strip tags function in PHP);
  • Using data encoding to avoid direct injection of potentially malicious characters (for example, see the htmlspecialchars function in PHP);
  • Creating a layer between data input and the back end to avoid direct injection of code in the application.
Some resources about CMSs security

SQL injection and cross-site scripting are only two of the many techniques used by hackers to attack and exploit innocent sites. As a general security guideline, it's important to always stay updated on security issues and, in particular when using third party software, to make sure you've installed the latest available version. Many web applications are built around big communities, offering constant support and updates.
To give a few examples, four of the biggest communities of Open Source content management systems—Joomla, WordPress, PHP-Nuke, and Drupal—offer useful guidelines on security on their websites and host big community-driven forums where users can escalate issues and ask for support. For instance, in the Hardening WordPress section of its website, WordPress offers comprehensive documentation on how to strengthen the security of its CMS. Joomla offers many resources regarding security, in particular a Security Checklist with a comprehensive list of actions webmasters should take to improve the security of a website based on Joomla. On Drupal's site, you can access information about security issues by going to their Security section. You can also subscribe to their security mailing list to be constantly updated on ongoing issues. PHP-Nuke offers some documentation about Security in chapter 23 of their How to section, dedicated to the system management of this CMS platform. They also have a section called Hacked - Now what? that offers guidelines to solve issues related to hacking.

Some ways to identify the hacking of your site

As mentioned above, there are many different types of attacks hackers can perform on a site, and there are different methods of exploiting an innocent site. When hackers are able to take complete control of a site, they can deface it (changing the homepage), erase all the content (dropping the tables of your database), or insert malware or cookie stealers. They can also exploit a site for spamming, such as by hiding links pointing to spammy resources or creating pages that redirect to malware sites. When these changes in your application are evident (like defacing), you can easily spot the hacking activity; but for other types of exploits, in particular those with spammy intent, it won't be so obvious. Google, through some of its products, offers webmasters some ways of spotting if a site has been hacked or modified by a third party without permission. For example, by using Google Search you can spot typical keywords added by hackers to your website and identify the pages that have been compromised. Just open google.com and run a site: search query on your website, looking for commercial keywords that hackers commonly use for spammy purposes (such as viagra, porn, mp3, gambling, etc.):

[site:example.com viagra]

If you're not already familiar with the site: search operator, it's a way to query Google by restricting your search to a specific site. For example, the search site:googleblog.blogspot.com will only return results from the Official Google Blog. When adding spammy keywords to this type of query, Google will return all the indexed pages of your website that contain those spammy keywords and that are, with high probability, hacked. To check these suspicious pages, just open the cached version proposed by Google and you will be able to spot the hacked behavior, if any. You could then clean up your compromised pages and also check for any anomalies in the configuration files of your server (for example on Apache web servers: .htaccess and httpd.conf).
If your site doesn't show up in Google's search results anymore, it could mean that Google has already spotted bad practices on your site as a result of the hacking and may have temporarily removed it from our index, due to infringement of our webmaster quality guidelines.

In order to constantly keep an eye on the presence of suspicious keywords on your website, you could also use Google Alerts to monitor queries like:

site:example.com viagra OR casino OR porn OR ringtones

You will receive an email alert whenever these keywords are found in the content of your site.

You can also use Google's Webmaster Tools to spot any hacking activity on your site. Webmaster Tools provide statistics about top search queries for your site. This data will help you to monitor if your site is ranking for suspicious unrelated spammy keywords. The 'What Googlebot sees' data is also useful, since you'll see whether Google is detecting any unusual keywords on your site, regardless of whether you're ranking for them or not.

If you have a Webmaster Tools account and Google believes that your site has been hacked, often you will be notified according to the type of exploit on your site:
  • If a malicious third party is using your site for spammy behaviors (such as hiding links or creating spammy pages) and it has been detected by our crawler, often you will be notified in the Message Center with detailed information (a sample of hacked URLs or anchor text of the hidden links);
  • If your site is exploited to place malicious software such as malware, you will see a malware warning on the 'Overview' page of your Webmaster Tools account.
Hacked behavior removed, now what?

Your site has been hacked or is serving malware? First, clean up the malware mess and then do one of the following:
  • If your site was hacked for spammy purpose, please visit our reconsideration request page through Webmaster Tools to request reconsideration of your site;
  • If your site was serving malware to users, please submit a malware review request on the 'Overview' page of Webmaster Tools.
We hope that you'll find these tips helpful. If you'd like to share your own advice or experience, we encourage you to leave a comment to this blog post. Thanks!

Tuesday, 18 February 2014

State of the Index: my presentation from PubCon Vegas

It seems like people enjoyed when I recreated my Virtual Blight talk from the Web 2.0 Summit late last year, so we decided to post another video. This video recreates the "State of the Index" talk that I did at PubCon in Las Vegas late last year as well.

Here's the video of the presentation:



and if you'd like to follow along, here are the slides:



You can also access the presentation directly. Thanks again to Wysz for recording this video and splicing the slides into the video.

Thursday, 13 February 2014

MT6516 flashing tutorial

Here you can find the instructions on how to flash your MT6516 based phone. Although example images shown here refer to a specific phone, you can flash other phones based on this MediaTek chipset, with the correct ROM files, of course.

Make sure you read everything carefully and know what you are doing. Don't blame me for any damage on your phone.


What's needed:
  • Flashing cable (USB to UART cable with PL2303 chip)
         USB to UART cable                      Pinout (phone side)
  • USB data cable
  • Profilic PL2303 drivers
  • MediaTek USB VCOM drivers
  • SP Flash Tool (v1.1110 or higher)
  • SN Write Tool (alternatively Maui META or WriteCode can be used as well)

And now the tutorial...
  • Make sure that you have already installed the Profilic PL2303 drivers needed for the USB to 3,5 mm headphone cable.
  • Open SP Flash Tool for MT6516 and chose which system you want to flash in your HD9. Under project you are able to choose Android or Windows Mobile.
  • Make sure you have selected the right COM port is selected and that baud rate is set to 921600 bps.
  • Click on Format to literally format NAND flash of your MT6516 device.
  • Then turn off your device, remove the battery and plug the serial cable to the PC and the jack to the phone.
  • Click Start, replace the battery and press power button for a few seconds until you see the red progress bar along with the message Format All is Processing.
  • After the red progress bar is complete, the real format will begin and a green progress bar will appear.
  • When the process is finished a new window will pop up, just press OK to continue.

  • Now remove the battery once again, keeping the serial cable connected to the PC and the phone.
  • Click Download and make sure that every file needed (again, example for the specific phone used for this tutorial) to flash Android / Windows Mobile is selected:
    • Android
      • BK Modem DB / RS Modem DB - BPLGUInfoCustomApp_MT6516_S01_MAUI_10A_W10_48
      • BK AP DB / RS Modem DB - APDB_MT6516_S00_2010_20
      • PRELOADER - preloader_bird16_a10y.bin
      • UBOOT - uboot_bird16_a10y.bin
      • BOOTIMG - boot.img
      • RECOVERY - recovery.img
      • SEC_RO - secro.img
      • ANDROID - system.img
      • LOGO - logo.bin
      • USERDATA - userdata.img
    • Windows Mobile
      • BK Modem DB / RS Modem DB - BPLGUInfoCustomApp_MT6516_S01_MAUI_09B_W10_16_MP_V5
      • BK AP DB / RS Modem DB - APDB_MT6516_S00_2010_20
      • FLASH BIN file - flash.bin
      • XLDR - MT6516_mldrnandforMTK.nb0
      • EBOOT - MT6516_EBOOTNAND.nb0

  • Click Start, replace the battery and press power button for a few seconds until you see the red progress bar along with the message Download is Processing.
  •  After the red progress bar is complete, there will appear a purple progress bar.
  • Right after the purple progress bar is complete, you’ll have to connect the normal USB data cable. The message will appear under the progress bar: “Please insert USB cable in x seconds”.
  • After plugging the USB  data cable into the phone the download will start. The first time you connect the cable, your computer will detect a new hardware and you’ll have to install MT6516 USB VCOM drivers.
  • There will be one yellow progress bar for every part of the ROM (preloader to userdata / xldr to eboot).
  • After all is complete a new window will appear with a report of the download. The following popup should appear:

  • Now you have completed the process of flashing the ROM into your device. Because the NAND flash was formatted in the beginning of the flash progress, you have now to re-write your phone IMEI1 and IMEI2.
  • Open SN Write Tool and make sure that the correct COM port is selected and baud rate is set to 115200 bps.
  • While keeping only serial cable connected to the phone, remove the battery and wait 10 seconds with the battery removed.
  • Click Start and you be asked to enter IMEI1 and IMEI2. Enter the correct number in the correct fields.
  • After entering the correct IMEI numbers and clicking OK, you have to replace the battery and press power button on the phone. If you have flashed Android, => Meta mode will appear in the bottom left of the screen.

Attention: Please follow the instructions carefully. I will not take any responsibility on whatever may happen with your phone.

Note: All needed tools and drivers can be downloaded from my MT6516 Tools 4shared folder. The password to login is bm-smartphone-reviews.blogspot.com. Have fun.

Update: Here's a trick if you want to flash just one part of the ROM, in this case the recovery. Without formatting flash, deselect all parts except the one you want to flash.


After that, you just have to click start and turn on the phone with USB data cable connected (in this case you don't need the USB serial cable).

Infinite scroll search-friendly recommendations

Webmaster Level: Advanced

Your site’s news feed or pinboard might use infinite scroll—much to your users’ delight! When it comes to delighting Googlebot, however, that can be another story. With infinite scroll, crawlers cannot always emulate manual user behavior--like scrolling or clicking a button to load more items--so they don't always access all individual items in the feed or gallery. If crawlers can’t access your content, it’s unlikely to surface in search results.

To make sure that search engines can crawl individual items linked from an infinite scroll page, make sure that you or your content management system produces a paginated series (component pages) to go along with your infinite scroll.


Infinite scroll page is made “search-friendly” when converted to a paginated series -- each component page has a similar <title> with rel=next/prev values declared in the <head>.

You can see this type of behavior in action in the infinite scroll with pagination demo created by Webmaster Trends Analyst, John Mueller. The demo illustrates some key search-engine friendly points:
  • Coverage: All individual items are accessible. With traditional infinite scroll, individual items displayed after the initial page load aren’t discoverable to crawlers.
  • No overlap: Each item is listed only once in the paginated series (i.e., no duplication of items).
Search-friendly recommendations for infinite scroll
  1. Before you start:
    • Chunk your infinite-scroll page content into component pages that can be accessed when JavaScript is disabled.
    • Determine how much content to include on each page.
      • Be sure that if a searcher came directly to this page, they could easily find the exact item they wanted (e.g., without lots of scrolling before locating the desired content).
      • Maintain reasonable page load time.
    • Divide content so that there’s no overlap between component pages in the series (with the exception of buffering).


  2. The example on the left is search-friendly, the right example isn’t -- the right example would cause crawling and indexing of duplicative content.

  3. Structure URLs for infinite scroll search engine processing.
    • Each component page contains a full URL. We recommend full URLs in this situation to minimize potential for configuration error.
      • Good: example.com/category?name=fun-items&page=1
      • Good: example.com/fun-items?lastid=567
      • Less optimal: example.com/fun-items#1
      • Test that each component page (the URL) works to take anyone directly to the content and is accessible/referenceable in a browser without the same cookie or user history.
    • Any key/value URL parameters should follow these recommendations:
      • Be sure the URL shows conceptually the same content two weeks from now.
        • Avoid relative-time based URL parameters:
          example.com/category/page.php?name=fun-items&days-ago=3
      • Create parameters that can surface valuable content to searchers.
        • Avoid non-searcher valuable parameters as the primary method to access content:
          example.com/fun-places?radius=5&lat=40.71&long=-73.40

  4. Configure pagination with each component page containing rel=next and rel=prev values in the <head>. Pagination values in the <body> will be ignored for Google indexing purposes because they could be created with user-generated content (not intended by the webmaster).

  5. Implement replaceState/pushState on the infinite scroll page. (The decision to use one or both is up to you and your site’s user behavior). That said, we recommend including pushState (by itself, or in conjunction with replaceState) for the following:
    • Any user action that resembles a click or actively turning a page.
    • To provide users with the ability to serially backup through the most recently paginated content.

  6. Test!

Wednesday, 12 February 2014

7 must-read Webmaster Central blog posts

Our search quality and Webmaster Central teams love helping webmasters solve problems. But since we can't be in all places at all times answering all questions, we also try hard to show you how to help yourself. We put a lot of work into providing documentation and blog posts to answer your questions and guide you through the data and tools we provide, and we're constantly looking for ways to improve the visibility of that information.

While I always encourage people to search our Help Center and blog for answers, there are a few articles in particular to which I'm constantly referring people. Some are recent and some are buried in years' worth of archives, but each is worth a read:

  1. Googlebot can't access my website
    Web hosters seem to be getting more aggressive about blocking spam bots and aggressive crawlers from their servers, which is generally a good thing; however, sometimes they also block Googlebot without knowing it. If you or your hoster are "allowing" Googlebot through by whitelisting Googlebot IP addresses, you may still be blocking some of our IPs without knowing it (since our full IP list isn't public, for reasons explained in the post). In order to be sure you're allowing Googlebot access to your site, use the method in this blog post to verify whether a crawler is Googlebot.
  2. URL blocked by robots.txt
    Sometimes the web crawl section of Webmaster Tools reports a URL as "blocked by robots.txt", but your robots.txt file doesn't seem to block crawling of that URL. Check out this list of troubleshooting tips, especially the part about redirects. This thread from our Help Group also explains why you may see discrepancies between our web crawl error reports and our robots.txt analysis tool.
  3. Why was my URL removal request denied?
    (Okay, I'm cheating a little: this one is a Help Center article and not a blog post.) In order to remove a URL from Google search results you need to first put something in place that will prevent Googlebot from simply picking that URL up again the next time it crawls your site. This may be a 404 (or 410) status code, a noindex meta tag, or a robots.txt file, depending on what type of removal request you're submitting. Follow the directions in this article and you should be good to go.
  4. Flash best practices
    Flash continues to be a hot topic for webmasters interested in making visually complex content accessible to search engines. In this post Bergy, our resident Flash expert, outlines best practices for working with Flash.
  5. The supplemental index
    The "supplemental index" was a big topic of conversation in 2007, and it seems some webmasters are still worried about it. Instead of worrying, point your browser to this post on how we now search our entire index for every query.
  6. Duplicate content
    Duplicate content—another perennial concern of webmasters. This post talks in detail about duplicate content caused by URL parameters, and also references Adam's previous post on deftly dealing with duplicate content, which gives lots of good suggestions on how to avoid or mitigate problems caused by duplicate content.
  7. Sitemaps FAQs
    This post answers the most frequent questions we get about Sitemaps. And I'm not just saying it's great because I posted it. :-)

Sometimes, knowing how to find existing information is the biggest barrier to getting a question answered. So try searching our blog, Help Center and Help Group next time you have a question, and please let us know if you can't find a piece of information that you think should be there!

Specify your canonical

Carpe diem on any duplicate content worries: we now support a format that allows you to publicly specify your preferred version of a URL. If your site has identical or vastly similar content that's accessible through multiple URLs, this format provides you with more control over the URL returned in search results. It also helps to make sure that properties such as link popularity are consolidated to your preferred version.

Let's take our old example of a site selling Swedish fish. Imagine that your preferred version of the URL and its content looks like this:

http://www.example.com/product.php?item=swedish-fish


However, users (and Googlebot) can access Swedish fish through multiple (not as simple) URLs. Even if the key information on these URLs is the same as your preferred version, they may show slight content variations due to things like sort parameters or category navigation:

http://www.example.com/product.php?item=swedish-fish&category=gummy-candy

Or they have completely identical content, but with different URLs due to things such as a tracking parameters or a session ID:

http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678

Now, you can simply add this <link> tag to specify your preferred version:

<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />

inside the <head> section of the duplicate content URLs:

http://www.example.com/product.php?item=swedish-fish&category=gummy-candy
http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678


and Google will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. Additional URL properties, like PageRank and related signals, are transferred as well.

This standard can be adopted by any search engine when crawling and indexing your site.

Of course you may have more questions. Joachim Kupke, an engineer from our Indexing Team, is here to provide us with the answers:

Is rel="canonical" a hint or a directive?
It's a hint that we honor strongly. We'll take your preference into account, in conjunction with other signals, when calculating the most relevant page to display in search results.

Can I use a relative path to specify the canonical, such as <link rel="canonical" href="product.php?item=swedish-fish" />?
Yes, relative paths are recognized as expected with the <link> tag. Also, if you include a <base> link in your document, relative paths will resolve according to the base URL.

Is it okay if the canonical is not an exact duplicate of the content?
We allow slight differences, e.g., in the sort order of a table of products. We also recognize that we may crawl the canonical and the duplicate pages at different points in time, so we may occasionally see different versions of your content. All of that is okay with us.

What if the rel="canonical" returns a 404?
We'll continue to index your content and use a heuristic to find a canonical, but we recommend that you specify existent URLs as canonicals.

What if the rel="canonical" hasn't yet been indexed?
Like all public content on the web, we strive to discover and crawl a designated canonical URL quickly. As soon as we index it, we'll immediately reconsider the rel="canonical" hint.

Can rel="canonical" be a redirect?
Yes, you can specify a URL that redirects as a canonical URL. Google will then process the redirect as usual and try to index it.

What if I have contradictory rel="canonical" designations?
Our algorithm is lenient: We can follow canonical chains, but we strongly recommend that you update links to point to a single canonical page to ensure optimal canonicalization results.

Can this link tag be used to suggest a canonical URL on a completely different domain?
**Update on 12/17/2009: The answer is yes! We now support a cross-domain rel="canonical" link element.**

Previous answer below:
No. To migrate to a completely different domain, permanent (301) redirects are more appropriate. Google currently will take canonicalization suggestions into account across subdomains (or within a domain), but not across domains. So site owners can suggest www.example.com vs. example.com vs. help.example.com, but not example.com vs. example-widgets.com.

Sounds great—can I see a live example?
Yes, wikia.com helped us as a trusted tester. For example, you'll notice that the source code on the URL http://starwars.wikia.com/wiki/Nelvana_Limited specifies its rel="canonical" as: http://starwars.wikia.com/wiki/Nelvana.

The two URLs are nearly identical to each other, except that Nelvana_Limited, the first URL, contains a brief message near its heading. It's a good example of using this feature. With rel="canonical", properties of the two URLs are consolidated in our index and search results display wikia.com's intended version.

Feel free to ask additional questions in our comments below. And if you're unable to implement a canonical designation link, no worries; we'll still do our best to select a preferred version of your duplicate content URLs, and transfer linking properties, just as we did before.

Update: this link-tag is currently also supported by Ask.com, Microsoft Live Search and Yahoo!.

Update: for more information, please see our Help Center articles on canonicalization and rel=canonical.

Faceted navigation best (and 5 of the worst) practices

Webmaster Level: Advanced

Faceted navigation, such as filtering by color or price range, can be helpful for your visitors, but it’s often not search-friendly since it creates many combinations of URLs with duplicative content. With duplicative URLs, search engines may not crawl new or updated unique content as quickly, and/or they may not index a page accurately because indexing signals are diluted between the duplicate versions. To reduce these issues and help faceted navigation sites become as search-friendly as possible, we’d like to:



Selecting filters with faceted navigation can cause many URL combinations, such as http://www.example.com/category.php?category=gummy-candies&price=5-10&price=over-10

Background

In an ideal state, unique content -- whether an individual product/article or a category of products/articles --  would have only one accessible URL. This URL would have a clear click path, or route to the content from within the site, accessible by clicking from the homepage or a category page.

Ideal for searchers and Google Search
  • Clear path that reaches all individual product/article pages


    On the left is potential user navigation on the site (i.e., the click path), on the right are the pages accessed.

  • One representative URL for category page
    http://www.example.com/category.php?category=gummy-candies


    Category page for gummy candies

  • One representative URL for individual product page
    http://www.example.com/product.php?item=swedish-fish

    Product page for swedish fish
Undesirable duplication caused with faceted navigation
  • Numerous URLs for the same article/product

    CanonicalDuplicate
    example.com/product.php? item=swedish-fishexample.com/product.php? item=swedish-fish&category=gummy-candies&price=5-10
    Same product page for swedish fish can be available on multiple URLs.

  • Numerous category pages that provide little or no value to searchers and search engines)

    URLexample.com/category.php? category=gummy-candies&taste=sour&price=5-10example.com/category.php? category=gummy-candies&taste=sour&price=over-10
    Issues
    • No added value to Google searchers given users rarely search for [sour gummy candy price five to ten dollars].
    • No added value for search engine crawlers that discover same item ("fruit salad") from parent category pages (either “gummy candies” or “sour gummy candies”).
    • Negative value to site owner who may have indexing signals diluted between numerous versions of the same category.
    • Negative value to site owner with respect to serving bandwidth and losing crawler capacity to duplicative content rather than new or updated pages.
    • No value for search engines (should have 404 response code).
    • Negative value to searchers.

Worst (search un-friendly) practices for faceted navigation

Worst practice #1: Non-standard URL encoding for parameters, like commas or brackets, instead of “key=value&” pairs.
Worst practices:
  • example.com/category?[category:gummy-candy][sort:price-low-to-high][sid:789]
    • key=value pairs marked with : rather than =
    • multiple parameters appended with [ ] rather than &
  • example.com/category?category,gummy-candy,,sort,lowtohigh,,sid,789
    • key=value pairs marked with a , rather than =
    • multiple parameters appended with ,, rather than &
Best practice:
example.com/category?category=gummy-candy&sort=low-to-high&sid=789

While humans may be able to decode odd URL parameters, such as “,,”, crawlers have difficulty interpreting URL parameters when they’re implemented in a non-standard fashion. Software engineer on Google’s Crawling Team, Mehmet Aktuna, says “Using non-standard encoding is just asking for trouble.” Instead, connect key=value pairs with an equal sign (=) and append multiple parameters with an ampersand (&).

Worst practice #2: Using directories or file paths rather than parameters to list values that don’t change page content.
Worst practice:
example.com/c123/s789/product?swedish-fish
(where /c123/ is a category, /s789/ is a sessionID that doesn’t change page content)
Good practice:
example.com/gummy-candy/product?item=swedish-fish&sid=789 (the directory, /gummy-candy/,changes the page content in a meaningful way)

Best practice:

example.com/product?item=swedish-fish&category=gummy-candy&sid=789 (URL parameters allow more flexibility for search engines to determine how to crawl efficiently)

It’s difficult for automated programs, like search engine crawlers, to differentiate useful values (e.g., “gummy-candy”) from the useless ones (e.g., “sessionID”) when values are placed directly in the path. On the other hand, URL parameters provide flexibility for search engines to quickly test and determine when a given value doesn’t require the crawler access all variations.

Common values that don’t change page content and should be listed as URL parameters include:

  • Session IDs
  • Tracking IDs
  • Referrer IDs
  • Timestamp
Worst practice #3: Converting user-generated values into (possibly infinite) URL parameters that are crawlable and indexable, but not useful in search results.

Worst practices (e.g., user-generated values like longitude/latitude or “days ago” as crawlable and indexable URLs):

  • example.com/find-a-doctor?radius=15&latitude=40.7565068&longitude=-73.9668408

  • example.com/article?category=health&days-ago=7

Best practices:

  • example.com/find-a-doctor?city=san-francisco&neighborhood=soma

  • example.com/articles?category=health&date=january-10-2014

Rather than allow user-generated values to create crawlable URLs  -- which leads to infinite possibilities with very little value to searchers -- perhaps publish category pages for the most popular values, then include additional information so the page provides more value than an ordinary search results page. Alternatively, consider placing user-generated values in a separate directory and then robots.txt disallow crawling of that directory.

  • example.com/filtering/find-a-doctor?radius=15&latitude=40.7565068&longitude=-73.9668408
  • example.com/filtering/articles?category=health&days-ago=7

with robots.txt:

User-agent: *
Disallow: /filtering/
Worst practice #4: Appending URL parameters without logic.

Worst practices:

  • example.com/gummy-candy/lollipops/gummy-candy/gummy-candy/product?swedish-fish
  • example.com/product?cat=gummy-candy&cat=lollipops&cat=gummy-candy&cat=gummy-candy&item=swedish-fish

Better practice:

example.com/gummy-candy/product?item=swedish-fish

Best practice:

example.com/product?item=swedish-fish&category=gummy-candy

Extraneous URL parameters only increase duplication, causing less efficient crawling and indexing. Therefore, consider stripping unnecessary URL parameters and performing your site’s “internal housekeeping”  before generating the URL. If many parameters are required for the user session, perhaps hide the information in a cookie rather than continually append values like cat=gummy-candy&cat=lollipops&cat=gummy-candy& ...

Worst practice #5: Offering further refinement (filtering) when there are zero results.

Worst practice:

Allowing users to select filters when zero items exist for the refinement.

Refinement to a page with zero results (e.g., price=over-10) is allowed even though it frustrates users and causes unnecessary issues for search engines.

Best practice

Only create links/URLs when it’s a valid user-selection (items exist). With zero items, grey out filtering options. To further improve usability, consider adding item counts next to each filter.


Refinement to a page with zero results (e.g., price=over-10) isn’t allowed, preventing users from making an unnecessary click and search engine crawlers from accessing a non-useful page.

Prevent useless URLs and minimize the crawl space by only creating URLs when products exist. This helps users to stay engaged on your site (fewer clicks on the back button when no products exist), and helps minimize potential URLs known to crawlers. Furthermore, if a page isn’t just temporarily out-of-stock, but is unlikely to ever contain useful content, consider returning a 404 status code. With the 404 response, you can include a helpful message to users with more navigation options or a search box to find related products.

Best practices for new faceted navigation implementations or redesigns

New sites that are considering implementing faceted navigation have several options to optimize the “crawl space” (the totality of URLs on your site known to Googlebot) for unique content pages, reduce crawling of duplicative pages, and consolidate indexing signals.

  • Determine which URL parameters are required for search engines to crawl every individual content page (i.e., determine what parameters are required to create at least one click-path to each item). Required parameters may include item-id, category-id, page, etc.
  • Determine which parameters would be valuable to searchers and their queries, and which would likely only cause duplication with unnecessary crawling or indexing. In the candy store example, I may find the URL parameter “taste” to be valuable to searchers for queries like [sour gummy candies] which could show the result example.com/category.php?category=gummy-candies&taste=sour. However, I may consider the parameter “price” to only cause duplication, such as category=gummy-candies&taste=sour&price=over-10. Other common examples:
    • Valuable parameters to searchers: item-id, category-id, name, brand...
    • Unnecessary parameters: session-id, price-range...
  • Consider implementing one of several configuration options for URLs that contain unnecessary parameters. Just make sure that the unnecessary URL parameters are never required in a crawler or user’s click path to reach each individual product!

    • Option 1: rel="nofollow" internal links
      Make all unnecessary URLs links rel=“nofollow.” This option minimizes the crawler’s discovery of unnecessary URLs and therefore reduces the potentially explosive crawl space (URLs known to the crawler) that can occur with faceted navigation. rel=”nofollow” doesn’t prevent the unnecessary URLs from being crawled (only a robots.txt disallow prevents crawling). By allowing them to be crawled, however, you can consolidate indexing signals from the unnecessary URLs with a searcher-valuable URL by adding rel=”canonical” from the unnecessary URL to a superset URL (e.g. example.com/category.php?category=gummy-candies&taste=sour&price=5-10 can specify a rel=”canonical” to the superset sour gummy candies view-all page at example.com/category.php?category=gummy-candies&taste=sour&page=all).
    • Option 2: Robots.txt disallow
      For URLs with unnecessary parameters, include a /filtering/directory that will be robots.txt disallow’d. This lets all search engines freely crawl good content, but will prevent crawling of the unwanted URLs. For instance, if my valuable parameters were item, category, and taste, and my unnecessary parameters were session-id and price. I may have the URL:
      example.com/category.php?category=gummy-candies
      which could link to another URL valuable parameter such as taste:
      example.com/category.php?category=gummy-candies&taste=sour.
      but for the unnecessary parameters, such as price, the URL includes a predefined directory, /filtering/:
      example.com/filtering/category.php?category=gummy-candies&price=5-10
      which is then robots.txt disallowed
      User-agent: *
      Disallow: /filtering/
    • Option 3: Separate hosts
      If you’re not using a CDN (sites using CDNs don’t have this flexibility easily available in Webmaster Tools), consider placing any URLs with unnecessary parameters on a separate host -- for example, creating main host www.example.com and secondary host, www2.example.com. On the secondary host (www2), set the Crawl rate in Webmaster Tools to “low” while keeping the main host’s crawl rate as high as possible. This would allow for more full crawling of the main host URLs and reduces Googlebot’s focus on your unnecessary URLs.
      • Be sure there remains at least one click path to all items on the main host.
      • If you’d like to consolidate indexing signals, consider adding rel=”canonical” from the secondary host to a superset URL on the main host (e.g. www2.example.com/category.php?category=gummy-candies&taste=sour&price=5-10 may specify a rel=”canonical” to the superset “sour gummy candies” view-all page, www.example.com/category.php?category=gummy-candies&taste=sour&page=all).
  • Prevent clickable links when no products exist for the category/filter.
  • Add logic to the display of URL parameters.
    • Remove unnecessary parameters rather than continuously append values.
      • Avoid
        example.com/product?cat=gummy-candy&cat=lollipops&cat=gummy-candy&item=swedish-fish)
    • Help the searcher experience by keeping a consistent parameter order based on searcher-valuable parameters listed first (as the URL may be visible in search results) and searcher-irrelevant parameters last (e.g., session ID).
      • Avoid
        example.com/category.php?session-id=123&tracking-id=456&category=gummy-candies&taste=sour
  • Improve indexing of individual content pages with rel=”canonical” to the preferred version of a page. rel=”canonical” can be used across hostnames or domains.
  • Improve indexing of paginated content (such as page=1 and page=2 of the category “gummy candies”) by either:
    • Adding rel=”canonical” from individual component pages in the series to the category’s “view-all” page (e.g. page=1, page=2, and page=3 of “gummy candies” with rel=”canonical” to category=gummy-candies&page=all) while making sure that it’s still a good searcher experience (e.g., the page loads quickly).
    • Using pagination markup with rel=”next” and rel=”prev” to consolidate indexing properties, such as links, from the component pages/URLs to the series as a whole.
  • Be sure that if using JavaScript to dynamically sort/filter/hide content without updating the URL, there still exists URLs on your site that searchers would find valuable, such as main category and product pages that can be crawled and indexed. For instance, avoid using only the homepage (i.e., one URL) for your entire site with JavaScript to dynamically change content with user navigation --  this would unfortunately provide searchers with only one URL to reach all of your content. Also, check that performance isn’t negatively affected with dynamic filtering, as this could undermine the user experience.
  • Include only canonical URLs in Sitemaps.

Best practices for existing sites with faceted navigation

First, know that the best practices listed above (e.g., rel=”nofollow” for unnecessary URLs) still apply if/when you’re able to implement a larger redesign. Otherwise, with existing faceted navigation, it’s likely that a large crawl space was already discovered by search engines. Therefore, focus on reducing further growth of unnecessary pages crawled by Googlebot and consolidating indexing signals.

  • Use parameters (when possible) with standard encoding and key=value pairs.
  • Verify that values that don’t change page content, such as session IDs, are implemented as standard key=value pairs, not directories
  • Prevent clickable anchors when products exist for the category/filter (i.e., don’t allow clicks or URLs to be created when no items exist for the filter)
  • Add logic to the display of URL parameters
    • Remove unnecessary parameters rather than continuously append values (e.g., avoid example.com/product?cat=gummy-candy&cat=lollipops&cat=gummy-candy&item=swedish-fish)
  • Help the searcher experience by keeping a consistent parameter order based on searcher-valuable parameters listed first (as the URL may be visible in search results) and searcher-irrelevant parameters last (e.g., avoid example.com/category?session-id=123&tracking-id=456&category=gummy-candies&taste=sour& in favor of example.com/category.php?category=gummy-candies&taste=sour&session-id=123&tracking-id=456)
  • Configure Webmaster Tools URL Parameters if you have strong understanding of the URL parameter behavior on your site (make sure that there is still a clear click path to each individual item/article). For instance, with URL Parameters in Webmaster Tools, you can list the parameter name, the parameters effect on the page content, and how you’d like Googlebot to crawl URLs containing the parameter.


    URL Parameters in Webmaster Tools allows the site owner to provide information about the site’s parameters and recommendations for Googlebot’s behavior.

  • Be sure that if using JavaScript to dynamically sort/filter/hide content without updating the URL, there still exists URLs on your site that searchers would find valuable, such as main category and product pages that can be crawled and indexed. For instance, avoid using only the homepage (i.e., one URL) for your entire site with JavaScript to dynamically change content with user navigation --  this would unfortunately provide searchers with only one URL to reach all of your content. Also, check that performance isn’t negatively affected with dynamic filtering, as this could undermine the user experience.
  • Improve indexing of individual content pages with rel=”canonical” to the preferred version of a page. rel=”canonical” can be used across hostnames or domains.
  • Improve indexing of paginated content (such as page=1 and page=2 of the category “gummy candies”) by either:
    • Adding rel=”canonical” from individual component pages in the series to the category’s “view-all” page (e.g. page=1, page=2, and page=3 of “gummy candies” with rel=”canonical” to category=gummy-candies&page=all) while making sure that it’s still a good searcher experience (e.g., the page loads quickly).
    • Using pagination markup with rel=”next” and rel=”prev” to consolidate indexing properties, such as links, from the component pages/URLs to the series as a whole.
  • Include only canonical URLs in Sitemaps.

Remember that commonly, the simpler you can keep it, the better. Questions? Please ask in our Webmaster discussion forum.