Before you can even think of creating an algorithm to read an X-ray or interpret a blood smear, the machine needs to know what’s in an image. All of the promise of AI in healthcare — a field that attracted $11.3 billion in private investment in 2021 — cannot be realized without carefully labeled data sets that tell machines what they’re doing. are looking for exactly.
Creating these labeled datasets is becoming an industry unto itself, with companies well north of unicorn status. Now, Encord, a small startup fresh out of Y Combinator, is looking to get in on the action. In an effort to generate labeled datasets for computer vision projects, Encord has launched its own beta version of an AI-assisted labeling program called CordVision. The launch follows pilot programs at Stanford Medicine, Memorial Sloan Kettering and Kings College London. It has also been tested by Kheiron Medical and Viz AI.
Encord has developed a set of tools that allow radiologists to zoom in on DICOM images, a format universally used to transmit medical images. And instead of having a radiologist sit down and annotate an entire image, the software is designed to ensure that only key parts of the image are labelled.
Encord was founded in 2020 by Eric Landau, who has a background in applied physics, and Ulrik Stig Hansen. Hansen was working on a master’s thesis project at Imperial College London centered on the visualization of large datasets of medical images. It was Hansen who first noticed how time-consuming it was to organize sets of labeled data.
These labeled datasets are important because they provide “ground truths” that algorithms can learn from. There are some ways to build AI that don’t require labeled datasets, but a lot of AI (especially in healthcare) has relied on supervised learning, which requires them. .
To create a labeled data set, several doctors will literally walk through the images one by one, drawing polygons around the relevant features. Other times it can be done with open source tools or sensors. But in any case, the scientific literature suggests that this step is a major bottleneck in the world of healthcare AI, especially with regard to radiology, which is an area where AI should make major progress, but has largely failed to deliver major paradigm shifts. .
“I know there is a lot of skepticism [of AI in the medical world]. We think the progress is really slow,” Landau told TechCrunch. “We think moving to an approach where you really think about training data first will help accelerate the progress of these models.”
As the authors of a 2021 paper in Frontiers in Radiology note, it takes human labelers up to 24 years of work to label a dataset of around 100,000 images. Another 2021 position paper published by the European Association of Nuclear Medicine (EANM) and the European Association of Cardiovascular Imaging (EACVI) notes that “obtaining labeled data in medical image analysis can take time-consuming and expensive”. But he also points out that new techniques are emerging that can speed things up.
Picture credits: Encord DICOM Labeling Platform
Ironically, these new techniques are themselves versions of artificial intelligence. This 2021 Frontiers in Radiology paper, for example, showed that by applying an active learning approach, the process could be 87% faster. It would only take 3.2 years of work, compared to 24 years, to return to the example of 100,000 images.
CordVision, essentially, is a version of an active learning process called micro-modeling. This technique, basically, works by having a team label a small representative sample of the images. Then a specific AI is trained on these images and then applied to the larger pool, which the AI labels. Then, human reviewers can check the AI’s work instead of doing the labeling from scratch.
Landu breaks it down nicely in a blog post on his Medium page: Imagine creating an algorithm designed to detect The Batman in the Batman movies. Your micro-model would be trained on five images of Christian Bale’s batman. Another might be trained to recognize Ben Affleck’s Batman, etc. All together, you build the biggest algorithm using every little part, and then release it on the series as a whole.
“It’s something we found worked quite well, because you could get away with doing very, very little markup and starting the process,” he said.
Encord released data to support Landau’s claims. For example, a study conducted in conjunction with Kings College London compared CordVision to a labeling program developed by Intel. Five labellers processed 25,744 endoscopy video images. Gastroenterologists using CordVision moved 6.4 times faster.
The method was also effective when applied to a test set of 15,521 COVID-19 X-rays. People only looked at 5% of the total number of images, and the final accuracy of an AI labeling model was 93.7%.
That said, Enord is far from the only company that has identified this bottleneck and is looking to use AI to help the labeling process. Existing companies in this space are already reporting significant valuations. For example, Scale AI reached a valuation of $7.3 billion in 2021 and Snorkel reached unicorn status.
The company’s biggest competitor, by Landau’s own admission, is probably Labelbox. Labelbox had about 50 customers when TechCrunch covered them at the Series A stage. In January, the company closed a Series D of $110 million, putting it within reach of the $1 billion mark.
CordVision is still a very small fish. But he’s caught up in a tidal wave of data labeling. Landau says the company is going after places that still use open source or in-house tools to do their own data labeling.
So far, the company has raised $17.1 in seed and Series A funding since graduating from Y Combinator. The company has grown from its two founders to a team of 20 people. Encord, says Landau, does not burn money. The company is not looking to raise funds at the moment and believes that the current increases will be enough to get this tool through the commercialization process.