Posts

How to fix canonical URLs and links in your pre-fork posts

avatar of @holger80
25
@holger80
·
0 views
·
5 min read

As you can see here, I have some links in my blogs written before the hive-fork pointing to steemit.com. It's time to replace them all.

Almost all blog posts written before the fork are written with apps that are not included in hivescript and lead to problems with canonical URLs:

As it can be seen here, my old post update for beem: first release for HF 21 result in different canonical URLs on different front-ends. This is then handled as duplicated content by the search engines. The post was written through palnet. As there is no entry for palnet in hivescript, the front-ends to not know how to build a proper canonical URL:

Fixing The mess

Fixing means:

  • replacing all steemit, steempeak, ... links with relative links
  • setting canonical_url for each post written before 2020-03-20, to fix canonical URLs.

Small update

The script uses now relative links, when found a link to steemit.com ..., it will be replaced by a relative link. A relative link looks like: [holger80](/@holger80) and [this post](/hive-139531/@holger80/how-to-fix-canonical-urls-and-links-in-your-pre-fork-posts)

Small update 2

There are now three boolean parameters, which can be used to set the following:

  • replace_steemit_links: when True, steemit, ... links will be replaced
  • use_relative_links: when True, relative links will be used (starting with /)
  • add_canonical_url: When True, a canonical_url is added to the metadata

Small update 3

It is now possible to use the same script for fixing the canonical links on STEEM for all written post before the fork. When you want to use the script on STEEM:

  • set target_blockchain = "steem"

When you want to use the script on HIVE:

  • set target_blockchain = "hive"

Python code

The following script is using beem and will do exactly this.

beem can be installed by

pip install beem 

or

conda install beem 

Store the following as fix_canonical_urls_hive.py:

#!/usr/bin/python 
from beem import Hive, Steem 
from beem.utils import addTzInfo 
from beem.account import Account 
from beem.comment import Comment 
from beem.nodelist import NodeList 
import time 
from datetime import datetime 
import getpass 
 
 
if __name__ == "__main__": 
    # Parameter 
    canonical_url = "https://hive.blog" 
    replace_steemit_links = True 
    use_relative_links = True 
    add_canonical_url = True 
    target_blockchain = "hive" # can be hive or steem 
    # ---- 
    # at least one option must be true 
    assert replace_steemit_links or add_canonical_url 
    assert target_blockchain in ["hive", "steem"] 
    # Canonical url must not end with / 
    if canonical_url[-1] == "/": 
        canonical_url = canonical_url[:-1] 
    nodelist = NodeList() 
    nodelist.update_nodes() 
    test_run_answer = input("Do a test run? [y/n]") 
    if test_run_answer in ["y", "Y", "yes"]: 
        test_run = True 
        print("Doing a test run on %s!" % target_blockchain) 
    else: 
        test_run = False 
    if test_run: 
        if target_blockchain == "hive": 
            blockchain_instance= Hive(node=nodelist.get_hive_nodes()) 
        else: 
            blockchain_instance= Steem(node="https://api.steemit.com") 
    else: 
        wif = getpass.getpass(prompt='Enter your posting key for %s.' % target_blockchain) 
        if target_blockchain == "hive": 
            blockchain_instance = Hive(node=nodelist.get_hive_nodes(), keys=[wif]) 
        else: 
            blockchain_instance = Steem(node="https://api.steemit.com", keys=[wif]) 
    if target_blockchain == "hive": 
        assert blockchain_instance.is_hive 
    else: 
        assert blockchain_instance.is_steem 
 
    account = input("Account name =") 
    account = Account(account, blockchain_instance=blockchain_instance) 
    if add_canonical_url: 
        print("Start to fix canonical_url on %s for %s" % (target_blockchain, account["name"])) 
    if replace_steemit_links: 
        print("Start to replace steemit links on %s for %s" % (target_blockchain, account["name"])) 
     
    apps_with_cannonical_url = ["hiveblog", "peakd", "esteem", "steempress", "actifit", 
                                "travelfeed", "3speak", "steemstem", "leofinance", "clicktrackprofit", 
                                "dtube"] 
    hive_fork_date = addTzInfo(datetime(2020, 3, 20, 14, 0, 0)) 
    blog_count = 0 
    expected_count = 100 
    while expected_count - blog_count == 100: 
         
        for blog in account.get_blog_entries(start_entry_id=blog_count, raw_data=False): 
            blog_count += 1 
            if blog["parent_author"] != "": 
                continue 
            if blog["author"] != account["name"]: 
                continue 
            if "canonical_url" in blog.json_metadata and canonical_url in blog.json_metadata["canonical_url"]: 
                continue 
            if "app" in blog.json_metadata and blog.json_metadata["app"].split("/")[0] in apps_with_cannonical_url and target_blockchain == "hive": 
                continue 
            if blog["created"] > hive_fork_date: 
                continue 
            body = blog.body 
            if "links" in blog.json_metadata: 
                links = blog.json_metadata["links"] 
            else: 
                links = None 
            if "links" in blog.json_metadata and replace_steemit_links: 
                for link in blog.json_metadata["links"]: 
                    if "steemit.com" in link or "steempeak.com" in link or "busy.org" in link or "partiko.app" in link: 
                        authorperm = link.split("@") 
                        acc = None 
                        post = None 
                        new_link = "" 
                        if len(authorperm) == 1: 
                            continue 
                        authorperm = authorperm[1] 
                        if authorperm.find("/") == -1: 
                            try: 
                                acc = Account(authorperm, blockchain_instance=blockchain_instance) 
                                if use_relative_links: 
                                    new_link = "/@" + acc["name"] 
                                else: 
                                    new_link = canonical_url + "/@" + acc["name"] 
                            except: 
                                continue 
                        else: 
                            try: 
                                post = Comment(authorperm, blockchain_instance=blockchain_instance) 
                                if use_relative_links: 
                                    new_link =  "/" + post.category + "/" + post.authorperm 
                                else: 
                                    new_link =  canonical_url + "/" + post.category + "/" + post.authorperm 
                            except: 
                                continue 
                        if new_link != "": 
                            for i in range(len(links)): 
                                if links[i] == link: 
                                    links[i] = new_link 
                            body = body.replace(link, new_link) 
                            print("Replace %s with %s" % (link, new_link)) 
                             
            json_metadata = blog.json_metadata or {} 
            if links is not None and replace_steemit_links: 
                json_metadata["links"] = links             
            if add_canonical_url: 
                json_metadata["canonical_url"] = canonical_url + "/" + blog["category"] + "/@" + blog["author"] + "/" + blog["permlink"] 
                print("Edit post nr %d with canonical_url=%s" % (blog_count, json_metadata["canonical_url"])) 
            print("---") 
            if not test_run: 
                try: 
                    blog.edit(body, meta=json_metadata, replace=True) 
                except: 
                    print("Skipping %s due to error" % blog.authorperm) 
                time.sleep(6) 
 
        expected_count += 100 
     
     

You can now start the script with:

python fix_canonical_urls_hive.py 

If you are on Linux, you should replace pip by pip3 and python by python3.

How does it work

The script goes through all blog posts written before 2020-03-14. Whenever the post was written by an app, that is not properly handled by hivescript, a new canonical_url is set.

You can define your preferred front-end here:

canonical_url = "https://hive.blog" 

If you like other front-ends, you can replace this line by

  • canonical_url = "https://peakd.com"
  • canonical_url = "https://leofinance.io"
  • canonical_url = "https://esteem.app"

In the next step, all used links are checked. Whenever a link is pointing to a valid hive post or to a valid hive user, the link is replaced by a releative url (When the link was pointing to steemit.com, steempeak.com, busy.org or partiko.app).

Test run

You can do a test run and checking what will be changed by the script:

This show now the following information:

The set canonical url is shown as well all links that will be replaced.

Fixing your posts

We can now start to fix all old posts:

Results

All changes have been broadcasted:

The links have been corrected, as shown here: There seems to be a bug with hive.blog, that steemit.com links are shown as internal and hive.blog links are shown as external links.

The canonical url is also fixed:

It seems that esteem.app has not changed its canonical url right now. As I know that esteem.app should read the canonical_url parameter (works for steempress), it may correct the canonical URLs later.

After a fix on esteem.app, esteem.app is using now the correct canonical url:

Results on STEEM

Setting canonical_url works also on steemit:

I used seoreviewtools to check the canonical urls.


If you like what I do, consider casting a vote for me as witness on Hivesigner or on PeakD