Pattern: Join Datasets via Similar Text
If you have textual data without any validation on top of human entry, you may have a need arise where you need to join datasets based on similar text. Take the two tables below:
Table 1:
Celestial Body | Diameter (km) |
---|---|
Mercury is the first planet from the sun. | 4878 |
Venus is the second planet from the sun. | 12104 |
Earth is the third planet from the sun. | 12756 |
Mars is the fourth planet from the sun. | 6794 |
Jupiter is the fifth planet from the sun. | 142984 |
Saturn is the sixth planet from the sun. | 120536 |
Uranus is the seventh planet from the sun. | 51118 |
Neptune is the eighth planet from the sun | 49523 |
Table 2:
Random Fact |
---|
A Venetian day is longer than our year. |
A year on Neptune is 165 years on our planet. |
It's not possible to stand on Saturn. |
Jupiter has the shortest day of the planets in our solar system. |
Mars has the tallest mountain in our solar system. |
One's weight on the planet Mercury is 38% of his/her weight on our planet. |
Seventy-one percent of the planet Earth is covered in water. |
Uranus is the coldest planet in the solar system. |
There's no key value or identifier in these two tables that we could use to join the datasets. However, if we look close, most of each of the sentences contains the name of a planet. We can use Tables: Start workflow each row to start an inline workflow, and then search for {{celestial-body}}
in table 2 using Tables: Find similar text.
Once we've found the most similar entry, we can record that back to the original dataset. When your workflow has finished running, look at the copy-planets-table field in the Create copy of Planets table step. The one entry that does not match perfectly is Venus vs. Venetian, but Tables: Find similar text also returns a similarity threshold that could be used for subsequent human review of the matches.