SteemSQL - Database updated and ready for Communities

The long-awaited Communities have been launched today. That’s really great news as it is in line with what has been announced by Steemit inc.

The trick with communities is that it doesn’t require any change in the Steem blockchain protocol (i.e. it doesn’t require a hard fork) and deploy its new features by leveraging existing technology.

I will not go into too detailed explanations, but technically and basically, a community is a Steem account whose name looks like hive-NNNNNN where NNNNNN is the community identifier. Communities will then be managed with this account by broadcasting custom_json operations.

To publish in a community, you just need to use this identifier as the main tag of your post. It's that simple. Smart isn't it?

Why a change in SteemSQL database?

In this picture taken from my Steem statistics, we can see which tags are the most used. But, as you can see, there is now two intruders in the group: hive-100421, which is the Threespeak community, and hive-174578, which is the OCD community. And guess what, I bet we will soon see only "hive-" tags in this chart.

This means the category column of the Comments table in SteemSQL, which stores the main tag of each post, will now roughly contains only hive-nnnnnn values.

Not only is it less user-friendly (think of the previous categories like pictures, cats, stats, steem, …), but less efficient to search for posts matching a specific tag.

The main reason is that the tags will now all move into the json_metadata column, and this column has no index on it. On top of it, it contains an “unvalidated’ JSON data structure that you need to parse to retrieve information, either using complex and resource-consuming SQL queries or client-side processing.

Enter the new Tags table

A new Tags table has been created that will store all the tags of all the posts published on the steem blockchain since its inception.

It contains only 2 columns: comment_id and tag

comment_id is a foreign key containing the ID of the post in the Comments table. tag is, of course, one of the tags used to publish the post.

To retrieve all posts related to a specific tag, for example.cats, you will issue the following SQL query:

SELECT 
	Comments.author,  
	Comments.permlink 
FROM 
	Tags 
	INNER JOIN Comments ON Tags.comment_id = Comments.ID 
WHERE 
	Tags.tag = 'cats'

High performances

SteemSQL has been designed to provide very high performances. But what about when it comes to extracting all the tags from the millions of posts that have been published for almost four years? For the number lover, here are some.

The Steem blockchain contains, as of writing 84,090,247 posts and comments. Although technically comments may contain tags, we are not interested in these.

That still leaves us 18,106,842 posts and the number of tags it can contain is limited only by the front-end that was used to publish them and by the amount of data that a block can contain.

When spitting apart all the tags from those posts, one found itself with an impressive amount of 76,771,968 tags that may grow exponentially now that communities are out.

Therefore, I had to implement advanced column-based data storage and query processing to provide SteemSQL with the ability to run performant real-time analytics on a transactional workload.

The results are impressive. The above queries yield a list of 32362 posts about cats, filtered out of the 77M tags, in less than 6 seconds. I know some who will enjoy this!

Limits

To achieve such performances, and reduce storage resource consumption, I had to filter some garbage. Therefore, empty tags (yes, there were some) and tags with more than 32 characters have been ignored.

You may wonder why one would use a tag with more than 32 characters as it is impractical and quite unreadable. But FYI, I found thousands of them, the biggest ones being pure garbage strings of 4000 characters. Shitty spammers.

SteemSQL now has its own community!

You are never as well served as by yourself. So I created a SteemSQL community where you can publish any post related to SteemSQL. I hope to see many of you and, who knows, maybe one day the tag hive-146513 will appear in my statistics.

Hmmm, that makes me think that I should update these ones to display some more user-friendly text. Damn... let's go back to work 😅

Support

If you have any questions or need assistance with SteemSQL, support is provided on the dedicated SteemSQL channel on steem.chat,

You can also contact me directly on steem.chat, Discord or Telegram

Enjoy this new feature and thanks for reading!

Support me and my work for the Steem community.

Vote for my witness