Opened 17 years ago
Closed 14 years ago
#484 closed Feature Requests (fixed)
Boost.Iostreams and newline translation
| Reported by: | jpstanley | Owned by: | Jonathan Turkanis |
|---|---|---|---|
| Milestone: | Boost 1.36.0 | Component: | iostreams |
| Version: | None | Severity: | Optimization |
| Keywords: | Cc: |
Description
I like the Boost.Iostreams framework in Boost 1.33. But I haven't found a way to make it do exactly what I want. I maintain software that parses files out of disk images instead of the local file system. I'd like to be able to wrap a C++ istream around my raw binary read/seek APIs to make it easier to parse data from disk images and interface with existing code. Boost.Iostreams makes this job a piece of cake--except that I haven't found a way to make the stream act like an ifstream opened in text mode (I'm on Win32, btw). In other words, I'd like to have CR, LF, and CRLF line endings translated into \n when I read the data. There is a newline_filter class provided which does nearly exactly what I want--but it doesn't support seeking. I still need seekg() and tellg() to work. I tried writing my own filter, but I ran into some problems. It seems the Boost stream buffer class is built on the assumption that this will hold: std::streamsize P0 = str.tellg(); str.read(buf, 100); std::streamsize P1 = str.tellg(); assert(P1 - P0 == 100); But with an ifstream opened in text mode, this assertion can fail (P1 - P0 > 100) if two-character CRLF combinations were translated into a one-character newline in the intervening data. Is it possible to implement an input-seekable filter that will let a Boost.Iostream behave this way?
Change History (5)
comment:1 by , 17 years ago
comment:2 by , 17 years ago
Logged In: YES
user_id=1326483
Thank you for your timely response.
The main reason I want seekg() and tellg() to work is so
that I can remember where in the file I found a piece of
data and come back to it later. For example, suppose I want
to build an index of an mbox file (essentially a large text
file with one email message following another). When I have
the read head positioned at the start of a message, I use
tellg() to retrieve a pointer to it. Then I can use seekg()
later on when I want to retrieve the message.
A byte offset in the unfiltered sequence would be the most
natural form of pointer--I could retrieve a single message
quickly, without filtering everything in front of it. Of
course I couldn't compare pointers to learn the size of a
message in filtered characters, but that's not my primary
objective.
To try and make the behavior I expect a little clearer,
here's an example program:
// test.txt contains the sequence "Hello\x0D\x0AWorld!"
// Compile on MSVC++ with CL /GX newline_test.cpp
#include <iostream>
#include <fstream>
int main(int argc, char **argv)
{
std::ifstream infile("test.txt");
infile.seekg(5, std::ios::beg);
std::cout << infile.tellg() << std::endl;
infile.get();
std::cout << infile.tellg() << std::endl;
infile.close();
return 0;
}
The seekg() statement positions the read head at the CR
character. The first tellg() returns, unsurprisingly, 5.
Then get() extracts both the CR and the LF from the
unfiltered sequence and returns '\n'. The second tellg()
call returns 7, even though there is only one character in
the filtered sequence between the two points.
comment:3 by , 15 years ago
| Component: | None → iostreams |
|---|---|
| Severity: | → Problem |
comment:4 by , 15 years ago
| Milestone: | → Boost 1.36.0 |
|---|---|
| Severity: | Problem → Optimization |
| Type: | Support Requests → Feature Requests |
I'm sorry I let this go so long.
I will consider it as a feature request for a form of seekability weaker than that currently supported, in which you can only request that the file pointer be restored to a location that was previously saved. A weak-seekable newline filter would then not have to worry about seeking relative to the current location, but would just have to remember the file offsets of the downstream device at various previousl-queried locations.
It may already be possible to implement this in the currently library, but introducing a new concept might clarify the situation.
I will consider implementing this in 1.36.
comment:5 by , 14 years ago
| Resolution: | None → fixed |
|---|---|
| Status: | assigned → closed |
I've decided the correct way to solve this problem is to write a seekable filter adapter that provides an implementation of seek in terms of user supplied implementations of read and write. I have added this idea to the list of possible new filters and devices in the Iostreams Roadmap.
In case the wiki entry changes, here is the current description: "A seekable_filter_adapter that provides an implementation of seek() when the user has defined read() and write(). seek() would work by checking whether the offset is relative to the beginning of the stream, and if so, whether it corresponds to a previously saved offset. If so, it fetches the stored offset in the unfiltered stream and performs a seek on the downstream device. This would solve the problem raised by #484"

Logged In: YES user_id=811799 Thanks for your interest in Boost.Iostreams. SourceForge.net wrote: > Submitted By: jpstanley (jpstanley) > I maintain software that parses files out of disk > images instead of the local file system. I'd like to > be able to wrap a C++ istream around my raw binary > read/seek APIs to make it easier to parse data from > disk images and interface with existing code. > Boost.Iostreams makes this job a piece of cake--except > that I haven't found a way to make the stream act like > an ifstream opened in text mode (I'm on Win32, btw). > In other words, I'd like to have CR, LF, and CRLF line > endings translated into \n when I read the data. There > is a newline_filter class provided which does nearly > exactly what I want--but it doesn't support seeking. I > still need seekg() and tellg() to work. > > I tried writing my own filter, but I ran into some > problems. It seems the Boost stream buffer class is > built on the assumption that this will hold: > > std::streamsize P0 = str.tellg(); > str.read(buf, 100); > std::streamsize P1 = str.tellg(); > assert(P1 - P0 == 100); Note quite: you could reach EOF before reading 100 characters ;-) But I know what you mean. When you are reading a filtered sequence, there are two file pointers two worry about: there's the current position in the filtered sequence, and the current position in the unfiltered sequence. When you successfully read 100 characters from the filtered sequence, as here: str.read(buf, 100); the current position in the *filtered* sequence is advanced by exactly 100 characters. This may correspond to more than 100 or fewer than 100 characters in the unfiltered sequence. > But with an ifstream opened in text mode, this > assertion can fail (P1 - P0 > 100) if two-character > CRLF combinations were translated into a one-character > newline in the intervening data. Here you're talking about the current position in the unfiltered sequence. To query this value, you can't use str.tellg(); you have to call seek() on the underlying filter. > Is it possible to implement an input-seekable filter > that will let a Boost.Iostream behave this way? If you want to be able to seek within the filtered sequence, it's possible, but it will be inefficient. If you want to be able to seek with offsets interpretted relative to the unfiltered sequence, I'm not sure it can be done. I'd like you to describe more exactly what you want to do. -- Jonathan Turkanis www.kangaroologic.com