Large language models are all the trend these days, with new ones appearing on a daily basis. The vast majority of these linguistic behemoths, such as OpenAI’s ChatGPT and Google’s Bard, are trained on text material from everywhere on the internet – websites, papers, novels, you name it. This indicates that their work is a mishmash of talent.
But what if LLMs were taught on the black web instead of the web? Researchers have done just that with DarkBERT, with unexpected results. Let us investigate.
DarkBERT: A group of South Korean researchers published a paper outlining how they developed an LLM using a large-scale dark web corpus gathered by trawling the Tor network. The information contained a slew of dodgy websites from a variety of categories, including cryptocurrency, pornography, hacking, firearms, and others. The team, however, did not use the data as is due to ethical issues. The researchers polished the pre-training corpus by filtering it before feeding it to DarkBERT to guarantee that the model was not trained on sensitive data, preventing bad actors from extracting that information.
If you’re curious why the name DarkBERT, it’s because the LLM is built on the RoBERTa architecture, which is a transformer-based paradigm established by Facebook researchers in 2019.
RoBERTa is a “robustly optimized method for pretraining natural language processing (NLP) systems” that improves on BERT, which Google announced in 2018. Meta was able to increase its performance after Google made the LLM open-source.
In the present day, Korean researchers have improved on the initial model even further by feeding it data from the dark web for 15 days, eventually arriving at DarkBERT. According to the research article, a computer equipped with an Intel Xeon Gold 6348 CPU and four NVIDIA A100 80GB GPUs was used for the purpose.
Despite its dark moniker, DarkBERT is meant for security and law enforcement applications rather than criminal activities.
DarkBERT is more effective in cybersecurity/CTI applications than previous language models because it was trained on the dark web, the home of shady sites where big datasets of stolen passwords are frequently found. The model’s creators demonstrated its utility in finding ransomware leak sources.
Hackers and ransomware groups frequently sell released sensitive data such as passwords and financial information on the dark web. According to the research report, DarkBERT can help security researchers automatically identify such websites. It can also be used to crawl through a plethora of dark web forums and monitor any exchange of illegal information.
However, while DarkBERT is better suited for “dark web domain-specific tasks” than other models, the researchers recognize that some tasks may require some fine-tuning due to a lack of publicly available Dark Web task-specific data.
Regardless, DarkBERT reflects a future in which AI models are trained on highly precise data to perform specific tasks. DarkBERT is a specialized weapon for stopping hackers, as opposed to ChatGPT and Google Bard, which are more like multi-purposed Swiss knives.