A British ‘tech bio’ start-up, Basecamp Research, says the combination of machine learning, public datasets, and a graph-based Knowledge Graph from Neo4j is proving critical to its mission of helping find new proteins.
It is also helping power a unique business process that may mean – unlike far too many examples in the past – the inhabitants of remote sites, or the plants, where useful proteins get uncovered get properly compensated from any eventual commercialization of what the team finds.
As its Senior Data Engineer, Saif Ur-Rehman, says,
We want to provide biodiversity guardians with a fair and equitable way to share their environmental genomic data with biotechnology innovators to power the emerging bioeconomy.
That ‘emerging bioeconomy’ was identified by The World Business Council for Sustainable Development in 2020 as a $7.7 trillion opportunity for business.
If a bioeconomy takes off, it could be a key element in the fight against climate change, biodiversity loss and resource scarcity.
The basis of Basecamp’s work is going out to places like the Arctic, rain forests and jungles to look for proteins. These are useful in everything from discovering new alternatives to meat, to new biofuels and drugs, as well as the PCR tests that we have all used since the onset of COVID-19.
The problem: finding them. As Ur-Rehman’s colleague, Philipp Lorenz, the firm’s London-based Chief Technology Officer, says:
We’re going out into nature to look for living and chemical things and transform them into a digital record and create an extensive cloud pipeline that takes that raw genetic and biological information and annotates this in hundreds of different ways.
In data terms, proteins can be represented in code as strings of text with 20 permissible characters, or 3D coordinates of an atomic structure.
As stated, Basecamp is manipulating those data elements in a multidimensional knowledge graph, to try and generate protein design insights using observation of how proteins evolve.
This is a large and complex task, says Lorenz. For each protein identity that makes it into the firm’s knowledge graph, hundreds of different biological label types, like the species that come from the genes surrounding it, “tons and tons and tons” of biological and geological context need to be labelled.
Other factors like the climate, or the humidity, the pH, the potassium content, and many other markers at the site where the soil was gathered by Basecamp’s field teams, also need to be stored. As Ur-Rehman says,
You quickly end up with a complex data structure that needs to securely connect metadata in a geophysical place to the soil. And once you’ve got that metadata, and the proteins you’ve extracted, and then whatever annotations you’ve added, one piece of paper can very quickly go to millions and millions of data points.
In addition, the firm then contextualizes all this internal data within publicly curated data from numerous biological datasets to allow the application of community detection algorithms. The final step is that all this data is run through an ETL (Extraction, Transformation and Load) process into the Knowledge Graph.
Why go to all this trouble? Because not only are there huge potential economic rewards, but there just isn’t this level of record available to researchers, says Lorenz. He explains:
We’re a protein design and a protein AI company and we do use the resources and databases out there, but they are flawed because every protein that we know about is derived from less than 1% of what we know of life on Earth.
If all these super-valuable biotech products are derived from that tiny sliver of knowledge we know about biodiversity on our planet, imagine what we could develop if we knew even more like 4 or 8%, rather than just 1%, of the biodiversity on our planet.
Once all the relevant environmental, geological, and chemical conditions are properly stored, as well as all the micro-ecology information, both Basecamp and its commercial customers can look at the DNA they want, to work out what organisms it came from in its full genomic context. This is all then complemented by Basecamp’s ML, which is how it ultimately delivers value.
To get to that level of insight, Lorenz says several data formats, including relational and document, were evaluated but soon rejected. He says:
I wouldn’t want to have to connect a protein to its chemical and geological environment in an SQL database. It would probably take a year just to write the query.”
Fundamentally, graph won out because any piece of annotation you have on a molecule is always patchy; you might have five pieces of annotations on one molecule, and none on another, and a relational database would not handle that very well, because you’ll end up with a whole bunch of tables with a whole bunch of annotations on them, which is not particularly great from a querying or performance viewpoint.
Once graph was adopted, progress has been rapid, says the company. Ur-Rehman explains:
We’ve now got a massive, structured Knowledge Graph – I think we’re at something like 120 million proteins and counting, so that’s already 120 million nodes, and we’re up to around a billion relationships.
But does having a billion relationships do that much for you? Basecamp keeps its cards close to its chest when it comes to calling out specific results, but does confirm it recently helped a customer better understand why a particular protein it was interested in seemed to have no evolutionary reason for a specific function.
Basecamp applied graph-based data science to help, via looking for a so-called ‘signal’ in the protein’s surrounding network topology.
By constructing a special dataset made up of curated data of positive examples, where the proteins didn’t have the function compared with negative examples, the Knowledge Graph was able to find a set of new potential sequences for this particular function.
What’s important to see here, he says, is that these were findings that a human researcher would not have been able to find as “there’s no obvious reason for this to work”. He says:
Literally every one of the hits that we looked at worked, and that was very exciting. None of this was found by lab work, but with a Knowledge Graph, and that’s something that could be a step change in biotechnology moving forward and could be a very, very interesting example of a new kind of in silico detection of protein function.
Summing up his organization’s use of graph, Lorenz says:
We want to be the leading company in the world that can predict and design any protein, for any function, and any performance – to have a generalizable AI – and graph-based platform that can design any solution for any biotech problem.
We’re getting there by applying Knowledge Graph to very different challenges across very different markets through customer-directed challenges on many problems, which strengthens our platform.
It’s kind of a big mission – but I think we’re on our way to getting there.