Using regex to select elements of a string with python
Imagine you have a list of string elements, such as you might get after reading in a text file line by line that looks as follows:
data=['month::5-->', 'day::13-->']
From the above, it looks like a value is provided between some label e.g. month:: and an end string e.g. -->. If for example you wanted to get at the month part, you could access it as follows from which point you could start messing around with splitting etc:
list(filter(lambda k: 'month::' in k, data))
>>['month::5-->']
Say your data looks like the following though as the file contained some unexpected gaps:
data=['address::1, Humpty Dumpty Lane,', 'Somewhere -->', 'month::5-->', 'day::13-->']
Notice how the strings that were bordering the info are different list elements now?
data[0]
>>'address::1, Humpty Dumpty Lane,'
data[1]
>>'Somewhere -->'
If I want to get the address, I could try the filter approach but it won’t get everything I need as really I need it to continue up to the --> which is now in a different list element:
list(filter(lambda k: 'address::' in k, data))
>> ['address::1, Humpty Dumpty Lane,']
A more efficient and tidy way to deal with this is to use regular expressions through the re package. First, concatenate everything into a string:
all_content=' '.join(data) 
Now use the following:
import re
address_info=re.findall("address::(.*?)-->", all_content)
print(address_info) # the list
>>['1, Humpty Dumpty Lane, Somewhere']
print(address_info[0]) # the string
>>1, Humpty Dumpty Lane, Somewhere
This looks for the first instance of address::, then with (.*?) gets all text up to the first instance of -->. If you just pass (.*) it will be greedy and get all text up to the last occurrence of --> which matters if you have a few of them!
For help with finding and testing regex commands, check out pythex. You can see it working with the example provided here.
Thanks John Stevenson for the help!