Using regex to select elements of a string with python
Imagine you have a list of string elements, such as you might get after reading in a text file line by line that looks as follows:
From the above, it looks like a value is provided between some label e.g.
month:: and an end string e.g.
-->. If for example you wanted to get at the
month part, you could access it as follows from which point you could start messing around with splitting etc:
list(filter(lambda k: 'month::' in k, data)) >>['month::5-->']
Say your data looks like the following though as the file contained some unexpected gaps:
data=['address::1, Humpty Dumpty Lane,', 'Somewhere -->', 'month::5-->', 'day::13-->']
Notice how the strings that were bordering the info are different list elements now?
data >>'address::1, Humpty Dumpty Lane,' data >>'Somewhere -->'
If I want to get the address, I could try the
filter approach but it won’t get everything I need as really I need it to continue up to the
--> which is now in a different list element:
list(filter(lambda k: 'address::' in k, data)) >> ['address::1, Humpty Dumpty Lane,']
A more efficient and tidy way to deal with this is to use regular expressions through the
re package. First, concatenate everything into a string:
Now use the following:
import re address_info=re.findall("address::(.*?)-->", all_content) print(address_info) # the list >>['1, Humpty Dumpty Lane, Somewhere'] print(address_info) # the string >>1, Humpty Dumpty Lane, Somewhere
This looks for the first instance of
address::, then with
(.*?) gets all text up to the first instance of
-->. If you just pass
(.*) it will be greedy and get all text up to the last occurrence of
--> which matters if you have a few of them!
For help with finding and testing regex commands, check out pythex. You can see it working with the example provided here.
Thanks John Stevenson for the help!