Revisiting Regex Generation for Modeling Industrial Applications by Incorporating Byte Pair Encoder

by   Desheng Wang, et al.

Regular expression is important for many natural language processing tasks especially when used to deal with unstructured and semi-structured data. This work focuses on automatically generating regular expressions and proposes a novel genetic algorithm to deal with this problem. Different from the methods which generate regular expressions from character level, we first utilize byte pair encoder (BPE) to extract some frequent items, which are then used to construct regular expressions. The fitness function of our genetic algorithm contains multi objectives and is solved based on evolutionary procedure including crossover and mutation operation. In the fitness function, we take the length of generated regular expression, the maximum matching characters and samples for positive training samples, and the minimum matching characters and samples for negative training samples into consideration. In addition, to accelerate the training process, we do exponential decay on the population size of the genetic algorithm. Our method together with a strong baseline is tested on 13 kinds of challenging datasets. The results demonstrate the effectiveness of our method, which outperforms the baseline on 10 kinds of data and achieves nearly 50 percent improvement on average. By doing exponential decay, the training speed is approximately 100 times faster than the methods without using exponential decay. In summary, our method possesses both effectiveness and efficiency, and can be implemented for the industry application.


page 1

page 2

page 3

page 4


The quasispecies regime for the simple genetic algorithm with roulette-wheel selection

We introduce a new parameter to discuss the behavior of a genetic algori...

Let's FACE it. Finnish Poetry Generation with Aesthetics and Framing

We present a creative poem generator for the morphologically rich Finnis...

Generating Modern Poetry Automatically in Finnish

We present a novel approach for generating poetry automatically for the ...

Determination of weight coefficients for additive fitness function of genetic algorithm

The paper presents a solution for the problem of choosing a method for a...

Kernel Density Estimation by Genetic Algorithm

This study proposes a data condensation method for multivariate kernel d...

Data-Driven Regular Expressions Evolution for Medical Text Classification Using Genetic Programming

In medical fields, text classification is one of the most important task...

Performance evaluation and design for variable threshold alarm systems through semi-Markov process

In large industrial systems, alarm management is one of the most importa...

Please sign up or login with your details

Forgot password? Click here to reset