New frontier of information extraction

Crooks, A., & Chen, Q. (2024). Exploring the new frontier of information extraction through large language models in urban analytics. Environment and Planning B: Urban Analytics and City Science, 51(3), 565-569.

The growing interest in artificial intelligence (AI) and Large Language Models (LLMs) in particular, such as Generative Pre-trained Transformers (GPT) and Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018; Brown et al., 2020; Ouyang et al., 2022), have been gaining more and more interest in both academia and with the public at large. While the results from such LLMs might seem plausible, they can also be questionable or totally incorrect as discussed in the past editorial by Batty (2023) who suggested that "we need to combine AI with our own knowledge and intuitions.” Similar notions have been suggested elsewhere by Fu et al. (2023) who compared ChatGPT from OpenAI with manual evaluations of climate change plans and noticed that the machine-generated evaluations struggled with planning-specific jargon.

Nonetheless, over the last year, LMMs have started to be introduced and utilized in GIScience. This is especially the case as newer versions of ChatGPT have been released with more advanced capabilities, including more robust reasoning, longer context windows and multimodal functionalities (e.g., text and images). These improvements have enabled users using natural language to interact with the system for diverse tasks, ranging from translation, classification, image, and code generation (e.g., GitHub’s Copilot) to information retrieval, and mapping. For example, Hu et al. (2023) combined geographical knowledge with GPT models to recognize geographical location descriptions in social media posts for supporting disaster response and management while Jang et al. (2023) used ChatGPT and DALL.E3 (a text-to-image generation model) to study place identity and found that both can capture the salient features of the city of interest. Others like Tao and Xu (2023) examined the capability of ChatGPT in different map making tasks (e.g., thematic and mental maps) based on either publicly available geographical data or conversation-based textual descriptions of geographic space. Similarly, Li and Ning (2023) introduced an autonomous GIS prototype by leveraging LLMs for tasks like geographical data collection, analysis, and visualization through natural language prompts. Some are calling this GeoQA (Geographic Question Answering, Feng et al., 2023) whereby researchers utilize LLMs to answer geographic questions in natural language. These studies, and others alike, align with the notion of "(n)ext generation of GIS: must be easy (Zhu et al., 2021)," emphasizing the shift towards more accessible and user-friendly GIS and spatial analysis technologies.

What does this mean for urban analytics? Again, quoting from a past editorial by Batty (2019) the "term analytics implies a set of methods that can be used to explore, understand and predict properties and features of any system, in our case of cities." One example is the use of street view images to extract and understand the properties of the system or the space within a city. In the past, training and segmentation of such data was a time consuming and a rather technical task as can be seen in many papers, including those published in this journal that explore things like feature extraction of sidewalks (e.g., Ning et al., 2022) and street elements, to that of building facades (e.g., Zhou et al., 2023). Others have taken a more manual approach to extracting information from street view images, in the sense of asking people to classify images to explore building styles or street quality (e.g., Date and Allweil 2022; Li et al., 2022).

This leads to the question: what can LLMs contribute to such approaches, and does it offer a middle ground between the technical and manual labeling of street view images? If so, does this lead to a potential new user base of researchers to explore cities through images? To explore these questions, we used the recent release of ChatGPT4, which allows users to upload images and identity objects and colors, etc. based on prompts. To showcase this, we use two examples, one from Flickr (of the White House in Washington DC) and another from Mapillary (a city block from Manhattan in New York City (NYC)) as shown in Figure 1. We asked ChatGPT several questions and the results can be seen in Figure 1. For example, can it answer where the photo was taken? For the White house, the answer was accurate while for the scene in NYC, it can only suggest an inferred location, which was in Manhattan, based on the architectural style of the buildings. Further analysis on our side showed that the inferred location was less than 2 km from where the image was actually taken. We also asked "What is the place identity shown in this image?" For the NYC image, ChatGPT noted it was an urban residential neighborhood with mixed commercial use on the ground floor. It went on to describe key elements and even extracted text from images like “the presence of storefronts, such as the "Asian Taste" restaurant, indicates a mixed-use neighborhood where businesses are integrated into the lower levels of residential buildings." For the White House, it can recognize landmarks in the image, such as the Andrew Jackson statue in Lafayette Park, and notes that it is the official residence and workplace of the president of the United States. Using simple prompts, more information can be extracted from the images, like architectural styles, weather conditions, and physical attributes. We were surprised by the answers that it could even detect two-way streets, bicycle lanes, crosswalks from the NYC image, while for the White House it mentioned greenspaces, walkways, and security measures.

How do these LLMs know how to do this? We asked ChatGPT about this with the prompt "Which algorithm did you use to extract the features from the image?” Its response was rather verbose but in summary it could not tell us which specific algorithms or training data were used, but more generally, that it utilizes Convolutional Neural Networks (CNNs) and other deep learning architectures and/or uses Natural Language Processing (NLP) algorithms to generate a coherent and contextually relevant description of the image but “the exact technical specifics and architecture of these models are proprietary to OpenAI.”

This is just one example of how LLMs could be used by nearly anyone to explore the urban environment. Currently, there is a plethora of georeferenced images of cities which are constantly being updated (e.g., Mapillary and Flickr) and can be accessed through Application Programming Interfaces (APIs). However, labeling and extracting information from such images is nontrivial as noted above. With LLMs, like ChatGPT4, a new paradigm is emerging. Imagine, if we had images for every street block in NYC and an unlimited number of requests (currently we can only post 40 messages/questions every three hours based on the basic plan we have at a cost of $20/month). We could analyze its responses, which could be combined with geocomputational techniques like spatial analysis or topic modeling, to gain insights into the built environment or more generally exploring urban form and function, like the examination of architectural styles, land use patterns, street infrastructure, to name but a few. The question, however, emerges regarding how this information can then be related to urban theories on how cities change and evolve, which is still unknown.

One also has to be a cognoscente of the limitations of LLMs. For example, they can be considered as black boxes as there is little explanation about the algorithms or data used for training (as noted above). There are also issues with respect to reliability, accuracy of the answers provided by LLMs along with the sensitivity of the answers based on the phrasing of the questions. This echoes back to what Batty (2023) noted, that when using AI, we also need to use our own knowledge and intuition. Nonetheless, LLMs are opening up a new frontier for information extraction in urban analytics, and we do not expect to see this trend stop but rather only grow. To some extent, LLMs and the technology behind them could be seen as a way to open up access to more advanced urban analytical tools to a greater number of researchers around the world.

Figure 1. Examples of using ChatGPT for extracting information from imagery.