Pattern: Join Datasets via Similar Text

Thomas_937381 Posts: 196
edited July 2020 in Show and Tell

If you have textual data without any validation on top of human entry, you may have a need arise where you need to join datasets based on similar text. Take the two tables below:

Table 1:

Celestial Body Diameter (km)
Mercury is the first planet from the sun. 4878
Venus is the second planet from the sun. 12104
Earth is the third planet from the sun. 12756
Mars is the fourth planet from the sun. 6794
Jupiter is the fifth planet from the sun. 142984
Saturn is the sixth planet from the sun. 120536
Uranus is the seventh planet from the sun. 51118
Neptune is the eighth planet from the sun 49523

Table 2:

Random Fact
A Venetian day is longer than our year.
A year on Neptune is 165 years on our planet.
It's not possible to stand on Saturn.
Jupiter has the shortest day of the planets in our solar system.
Mars has the tallest mountain in our solar system.
One's weight on the planet Mercury is 38% of his/her weight on our planet.
Seventy-one percent of the planet Earth is covered in water.
Uranus is the coldest planet in the solar system.

There's no key value or identifier in these two tables that we could use to join the datasets. However, if we look close, most of each of the sentences contains the name of a planet. We can use Tables: Start workflow each row to start an inline workflow, and then search for {{celestial-body}} in table 2 using Tables: Find similar text.

Once we've found the most similar entry, we can record that back to the original dataset. When your workflow has finished running, look at the copy-planets-table field in the Create copy of Planets table step. The one entry that does not match perfectly is Venus vs. Venetian, but Tables: Find similar text also returns a similarity threshold that could be used for subsequent human review of the matches.